Updated on 07/13/2016. See the update this post, I might find the ultimate solution, even I am still not sure what the cause of the issue.
Recently I had an issue to vMotion some VMs between the vSphere v 6.x cluster hosts. Long story short, here are the symptoms:
- I consistently got the error “Failed waiting for data. Error 195887167. Connection closed by remote host, possibly due to timeout” when vMotioning (host only, no storage vMotion) on some VMs, particularly the vCenter Server Appliance VM. I had two vCSA VMs. Both were have the same issue. But vCSA is not the only VM that I got the error.
- I successfully vMotion some VMs between the hosts. The vMotion network configuration should be okay.
- In other words, some VMs are okay; some are not. The size (CPU, RAM, storage) of the VMs does not seem the problem. The successfully vMotioned VMs can have more/less CPU, RAM, storage than the failed VMs.
- The failed VM has more disks than other VMs. The vCSA VM created 11 disks by default.
- When the VMs were powered off, vMotion successfully.
- Restarted the hosts and restarted the VMs. No difference.
- Verified no IP address conflicts.
- Tried one VMkernel adapter for both management and vMotion or a dedicated VMkernel adapter for vMotion. No difference.
- Tested vmkping successfully between the hosts.
- In the vmkernel.log of the hosts, the error is “2016-05-18T22:47:34.959Z cpu15:39089)WARNING: Migrate: 270: 1463611379229538 D: Failed: Connection closed by remote host, possibly due to timeout (0xbad003f) @0x418000e149ee” on the destination host or “2016-05-19T18:29:23.930Z cpu1:130133)WARNING: Migrate: 270: 1463682486286991 S: Failed: Migration determined a failure by the VMX (0xbad0092) @0x41803a7f6993” on the source host.
Possible solutions
- Remove the snapshot on the VM if it has one
- After removing the snap shot on one of vCSA VMs, vMotion worked fine. But another vCSA had no snapshot, it still failed.
- Try using the vSphere Client instead of the vSphere Web Client. This worked on some VMs, but not always.
- Assign VM’s network adapter to different port group; and change back to its original port group
- This seems the ultimate fix. After doing this, the vCSA VMs, which failed vMotion consistently, are vMotioned successfully.
Conclusion
- I’m not sure the root cause of this issue. But it may relate to the network setting on the vSwitch or port group. Some hits about this: vMotion fails with the general system error: 0xbad003f (KB2008394)
- These KBs are not the solution in my case:
07/13/2016 update
- I run into the same error when migrating some VMs, including the vCenter 6.x appliance (vCSA), between hosts in the Vmware cluster. I am able to migrate other VMs. This leads me to believe the problem on the VM, instead of the VM infrastructure.
- Within the VMs having this error, some of them, which I can power off, are migrated successfully. However, I can not power down the vCSA VM. Because I cannot perform the migration without the vCenter available.
- I try assigning the NIC of the vCSA VM to different port group, and change back. However, I cannot do that, because this VM is the vCenter server and is configured with a distributed switch. If I change the vCSA to different port group (configured with a different VLAN), the vCenter server will be down (because its NIC is assigned to the wrong port group with the wrong VLAN); and I cannot change it back to the original port group with the right VLAN.
- I try connecting directly to the ESXi host of the vCSA VM via the vSphere C# client, then assign the NIC of the vCSA VM to different port group. However, I cannot do that either. Because the host is only configured with the distributed switches, there is no other port group in the selection at this situation.
- I try create a new port group with the ephemeral port binding on the distributed switch. This new ephemeral port group is available in the vSphere C# client when connecting to the host directly. Then I assign the VM to the ephemeral port group and change back. However, the migration still fails with the same error.
- Since I could fix this issue last time by changing the port group and changing back, I guess that somehow reset the NIC on the VM or the virtual switch port to which the VM is connected. That gives me an idea to manually assign the VM to another virtual switch port.
Solutions
- In the screenshot above, the NIC of the vCSA is assigned to port 214 on the vSwitch.
- Log back in the vCenter via the vSphere C# client or Web client (I cannot see the ports on the distributed switch, nor change the port assigned to the VM’s NIC when connecting to the ESXi host via the vSphere C# client)
- I find an used port on the same port group (e.g. port 423 in my case).
- Edit the vCSA VM setting and assign its NIC to the unused port.
- Then I can successfully migrate the vCSA VM to another host.
Conclusion
- Changing the NIC to an used port will be my first attempt when this issue happens again (I bet it will happen).
- I still don’t know the cause of the this issue.
10/15/2019 update
- I got the exact error again when Storage vMotion a VM across two vCenter (from vCSA 6.0 to vCSA 6.5). Deleted the VM snapshot, reran the vMotion, and completed successfully.