Eddie's Blog: February 2017

vSphere 6.5 New Feature – VMware Orchestrated Restart

Let me back to the old ESXi 3 day – when I was just using the standalone ESXi hosts or vCenter without HA and DRS. In case of the power outage or air conditioning failure in the data center, all the ESXi hosts were powered down. Once the environment problem was resolved, I could manage the VM startup sequence by configuring the switched PDU to start the hosts accordingly, and configuring the VM startup order at the host level.

However, once I deployed vCenter Server with HA and DRS, I lost the control of the VM startup order. Because the VMs could be hosted at any host in the cluster. Someone said that I should not worry about the VM startup order in the cluster. Because the ESXi cluster would never go down if I had designed the infrastructure with enough redundancy. As we all know, we never have enough redundancy in a small ESXi deployment.

I have been curious why VMware do not “fix” this issue for so long. Until now, vSphere 6.5 introduces the VMware Orchestrated Restart feature. At the high level, the Orchestrated Restart, likes the VM affinity and anti-affinity rules, put the VMs in different VM groups and set the startup dependence among the VM groups. To learn more about this, please go to “What is VMware Orchestrated Restart?”.

I am so glad to know about this new vSphere 6.5 feature – one more reason to upgrading to vSphere 6.5.

vSAN Performance Service “Hosts Not Contributing Stats” Fix

I have a four-host vSAN cluster running vSAN 6.2. Recently the vSAN health’s Performance service check shows two of the hosts not contributing stats.

The following are all the steps that I tried during troubleshooting and ultimately fixing the issue in my environment. Some of the steps do not fix my issue, however they may be applicable to your situation. PS. I opened a VMware support case on this issue. The support engineer did not directly solve my issue. However, he did give the hint on the cause of the issue that led me to discover the solution.

Turn off and turn on the Performance Services in vSphere web client, vSAN cluster, Manage, Settings, Health and Performance.
Turn off the Performance Services, restart the vSAN management agent “/etc/init.d/vsanmgmtd restart”, then restart the service.
Place the vSAN host in the maintenance mode and restart the host.
SSH to the vCenter server appliance, restart the vmware-vpxd service “service vmware-vpxd restart”.
Verify the vSAN storage provider status of each vSAN host is online in vSphere web client, vCenter server, Manage, Storage Providers. If the host’s vSAN provider is offline, unregister the host’s storage provider and synchronize all vSAN storage providers. This brings the host’s vSAN storage provider back online.
Caution: doing this can cause the VMs on the host to failover to other hosts in the cluster.
(I think this is to begin to lead me to the ultimate fix) Check the certificate info of each vSAN host in Storage Provider. They should be issued by the same Platform Service Controller (my vCenter is the vCSA wit the external PSC, instead of the embedded PSC). In my case, the certificate of the two “problem” vSAN hosts is issued by the VC host; the certificate of the “good” vSAN hosts is issued by the PSC host. I don’t know what the cause of these hosts having different certificate issuers, since I don’t have the history of how these PSC and VC were deployed.
To further confirm the ESXi host certificate is the problem

Login vCenter server as “administrator@vsphere.local’
Home, Administration, Deployment, System Configuration, Nodes, PSC node, Manage, Certificate Authority (if selecting VC node, there is no Certificate Authority tab under Manage)
Enter the password of “administrator@vsphere.local” again
Active Certificate, all the ESXi hosts are listed, except the two “problem” vSAN hosts
It makes sense why the certificates of the two “problem” vSAN hosts are missing here, because they are issued by the VC host, not the PSC host. But it does not make sense how they received the “problem” certificate since there is no Certificate Authority on the VC host.

Once the cause is identified, the fix is to re-issue the certificate to the two “problem” vSAN hosts.
In vSphere web client, the “problem” vSAN host, Manage, Settings, Certificate

Here is also showed the host certificate issuing by the wrong host (the VC host)
Click Renew to request a new certificate
Caution: Once clicking the Renew button, the host HA agent was restarted. Some VMs on the host failed over to the remaining hosts, even the VMs seem no downtime.
Before renewing the certificate

After renewing the certificate

Once the host certificates are re-issued by the PSC, the vSAN Performance service status is showed “Passed”

Conclusion

The cause of the vSAN Performance service “Host Not Contributing Stats” in my case is the “problem” vSAN host having the wrong host certificate.
I don’t know how these “problem” hosts received the wrong host certificate.
When the vCSA with the external PSC, the host certificate is issued by the PSC host.
Re-issuing or renewing the host certificate will restart the host HA agent. It can cause the VMs on the host migrating to other hosts.

vCenter Server 6.5 Native High Availability Feature Summary

Available exclusively for vCenter Server Appliance (vCSA)
Consist of three nodes – active, passive, and witness nodes
- Passive and Witness nodes are cloned from the existing vCSA (active node)
vCenter HA cluster can be enabled, disabled, or destroyed at any time
There is a maintenance mode to prevent planned maintenance from causing an unwanted failover
Use two types of replication between active and passive nodes
- Native PostgreSQL synchronous replication for the vCenter Server database
- A separated asynchronous file system replication for key data outside the database
Two vCenter HA deployment workflows
- Basic: all vCenter HA nodes are deployed within the same cluster
- Advanced: the active, passive, and witness nodes are deployed to different clusters
There is little benefit to using vCenter HA without also providing high availability at the Platform Service Controller layer
- An external Platform Services Controller instance is required when there are multiple vCenter Server instances in an Enhanced Linked Mode configuration.
Failover can occur when a host failure, or when certain key services fail
For the initial release of vCenter HA, a recovery time objective (RTO) is about 5 minutes

I have already known about some of these information when testing vCenter HA in my lab. I highlighted the ones I learned from this white paper.

Source: “What’s New in VMware vSphere”" 6.5” technical white paper

Eddie's Blog

Search This Blog

vSphere 6.5 New Feature – VMware Orchestrated Restart

vSAN Performance Service “Hosts Not Contributing Stats” Fix

vCenter Server 6.5 Native High Availability Feature Summary

Use WinSCP to Transfer Files in vCSA 6.7