Search This Blog

vSphere 6.5 New Feature – VMware Orchestrated Restart

Let me back to the old ESXi 3 day – when I was just using the standalone ESXi hosts or vCenter without HA and DRS. In case of the power outage or air conditioning failure in the data center, all the ESXi hosts were powered down. Once the environment problem was resolved, I could manage the VM startup sequence by configuring the switched PDU to start the hosts accordingly, and configuring the VM startup order at the host level.

However, once I deployed vCenter Server with HA and DRS, I lost the control of the VM startup order. Because the VMs could be hosted at any host in the cluster. Someone said that I should not worry about the VM startup order in the cluster. Because the ESXi cluster would never go down if I had designed the infrastructure with enough redundancy. As we all know, we never have enough redundancy in a small ESXi deployment.

I have been curious why VMware do not “fix” this issue for so long. Until now, vSphere 6.5 introduces the VMware Orchestrated Restart feature. At the high level, the Orchestrated Restart, likes the VM affinity and anti-affinity rules, put the VMs in different VM groups and set the startup dependence among the VM groups. To learn more about this, please go to “What is VMware Orchestrated Restart?”.

I am so glad to know about this new vSphere 6.5 feature – one more reason to upgrading to vSphere 6.5.

vSAN Performance Service “Hosts Not Contributing Stats” Fix

I have a four-host vSAN cluster running vSAN 6.2. Recently the vSAN health’s Performance service check shows two of the hosts not contributing stats.

vsan.host.not.contrubting.stats.01

The following are all the steps that I tried during troubleshooting and ultimately fixing the issue in my environment. Some of the steps do not fix my issue, however they may be applicable to your situation. PS. I opened a VMware support case on this issue. The support engineer did not directly solve my issue. However, he did give the hint on the cause of the issue that led me to discover the solution.

  1. Turn off and turn on the Performance Services in vSphere web client, vSAN cluster, Manage, Settings, Health and Performance.
  2. Turn off the Performance Services, restart the vSAN management agent “/etc/init.d/vsanmgmtd restart”, then restart the service.
  3. Place the vSAN host in the maintenance mode and restart the host.
  4. SSH to the vCenter server appliance, restart the vmware-vpxd service “service vmware-vpxd restart”.
  5. Verify the vSAN storage provider status of each vSAN host is online in vSphere web client, vCenter server, Manage, Storage Providers. If the host’s vSAN provider is offline, unregister the host’s storage provider and synchronize all vSAN storage providers. This brings the host’s vSAN storage provider back online.
    Caution: doing this can cause the VMs on the host to failover to other hosts in the cluster.
    vsan.host.not.contrubting.stats.02
  6. (I think this is to begin to lead me to the ultimate fix) Check the certificate info of each vSAN host in Storage Provider. They should be issued by the same Platform Service Controller (my vCenter is the vCSA wit the external PSC, instead of the embedded PSC). In my case, the certificate of the two “problem” vSAN hosts is issued by the VC host; the certificate of the “good” vSAN hosts is issued by the PSC host. I don’t know what the cause of these hosts having different certificate issuers, since I don’t have the history of how these PSC and VC were deployed.
    vsan.host.not.contrubting.stats.03
    vsan.host.not.contrubting.stats.04
  7. To further confirm the ESXi host certificate is the problem
    1. Login vCenter server as “administrator@vsphere.local’
    2. Home, Administration, Deployment, System Configuration, Nodes, PSC node, Manage, Certificate Authority (if selecting VC node, there is no Certificate Authority tab under Manage)
    3. Enter the password of “administrator@vsphere.local” again
    4. Active Certificate, all the ESXi hosts are listed, except the two “problem” vSAN hosts
    5. It makes sense why the certificates of the two “problem” vSAN hosts are missing here, because they are issued by the VC host, not the PSC host. But it does not make sense how they received the “problem” certificate since there is no Certificate Authority on the VC host.
      vsan.host.not.contrubting.stats.05
  8. Once the cause is identified, the fix is to re-issue the certificate to the two “problem” vSAN hosts.
  9. In vSphere web client, the “problem” vSAN host, Manage, Settings, Certificate
    1. Here is also showed the host certificate issuing by the wrong host (the VC host)
    2. Click Renew to request a new certificate
    3. Caution: Once clicking the Renew button, the host HA agent was restarted. Some VMs on the host failed over to the remaining hosts, even the VMs seem no downtime.
      Before renewing the certificate
      vsan.host.not.contrubting.stats.06
      After renewing the certificate
      vsan.host.not.contrubting.stats.07
  10. Once the host certificates are re-issued by the PSC, the vSAN Performance service status is showed “Passed”
    vsan.host.not.contrubting.stats.08

Conclusion

  • The cause of the vSAN Performance service “Host Not Contributing Stats” in my case is the “problem” vSAN host having the wrong host certificate.
  • I don’t know how these “problem” hosts received the wrong host certificate.
  • When the vCSA with the external PSC, the host certificate is issued by the PSC host.
  • Re-issuing or renewing the host certificate will restart the host HA agent. It can cause the VMs on the host migrating to other hosts.

vCenter Server 6.5 Native High Availability Feature Summary

  • Available exclusively for vCenter Server Appliance (vCSA)
  • Consist of three nodes – active, passive, and witness nodes
    • Passive and Witness nodes are cloned from the existing vCSA (active node)
  • vCenter HA cluster can be enabled, disabled, or destroyed at any time
  • There is a maintenance mode to prevent planned maintenance from causing an unwanted failover
  • Use two types of replication between active and passive nodes
    • Native PostgreSQL synchronous replication for the vCenter Server database
    • A separated asynchronous file system replication for key data outside the database
  • Two vCenter HA deployment workflows
    • Basic: all vCenter HA nodes are deployed within the same cluster
    • Advanced: the active, passive, and witness nodes are deployed to different clusters
  • There is little benefit to using vCenter HA without also providing high availability at the Platform Service Controller layer
    • An external Platform Services Controller instance is required when there are multiple vCenter Server instances in an Enhanced Linked Mode configuration.
  • Failover can occur when a host failure, or when certain key services fail
  • For the initial release of vCenter HA, a recovery time objective (RTO) is about 5 minutes

I have already known about some of these information when testing vCenter HA in my lab. I highlighted the ones I learned from this white paper.

Source: “What’s New in VMware vSphere”" 6.5” technical white paper

New Year Resolution - Improve Productivity

Here is my another new year resolution in 2017 - improve productivity (see my previous 2017 new year resolution here. The source of these ideas are from http://www.businessinsider.com/bad-habits-that-killing-productivity-2016-12.

  • Get out of the bed when the alarm clock buzzes
  • Get enough sleep
  • Do not keep the tablet next to the bed. I keep the smartphone next to the bed as my alarm clock
  • Do not skip breakfast and drink some hot tea before going to the toilet in the morning
  • Complete the hardest and most important tasks at the beginning of the day
  • Do not check email throughout the day, especially in the middle of the night. When wake up in the morning, only check if there is missing call or text message. Do not read the email until later of the day
  • Do not eat junk food or eat less junk food
  • Focus on 3 ~ 5 of the most important goals and ignore the rest
  • Do not sit all day and walk 50,000 steps in a week
  • Do not multitask
  • Do not skip the workout
  • Do not look up the answer of a random question that just popped into your head. Write it down and search later
  • Do not overplan the schedule, instead plan for 4 ~ 5 hours of read work each day
  • Do not underplan
  • Do not accept a meeting unless the person who requested it has put forth a clear agenda and stated exactly how much time they will need
  • Abandon perfectionism

Lessons from Security Breaches

Here are my short summary of the article “Learning From A Year of Security Breaches” that are applicable to most of work environments.

  • Centralize logs, including host, application, authentication, and infrastructure, into as few system as possible; make critical logs alertable; but be aware of user privacy in what you log
  • You might not find the root cause of a beach because of weakness in the environment, systems or people; practicing incident response can indentify these weakness
  • Attackers will target employee’s home, personal email, or device to breach the corporate security; Educate your employees to improve their security practices and involve the corporate security team even if they have personal security issues
  • Avoid putting secrets and keys into source code
  • Protect employees’ credential by integrating Single Sing On or Multi Factor Authentication
  • Be aware of insider threats
  • Measure and eliminate the security debt - cutting corners for fast growth

First Day Result of Improving Sleep Quality

Here is the first day result of following my 2017 new year resolution - improve sleep quality

  • Went to bed at 10:2x p.m. According to my Fitbit, slept at 10:34 p.m. and woke up at 3:53 a.m., and time asleep 5 hours and 9 minutes with 10 minutes restless. I know I woke up at 2:19 a.m. and 3:24 a.m. to check the time on my Fitbit. Then I am fully awake from a dream at 3:24 a.m.
  • Did not drink any sola the whole day
  • Drank half glass of water before going to bed
  • Did not exercise the whole day, and walked 7,578 steps
  • Shut down the computer at 10:1x p.m., did not read on the phone and tablet in bed
  • Kept the phone next to the bed as the alarm clock. When waking up, I checked if there is any phone call or text message (the anwers is no); I did not read any email even there are some on the phone.

Conclusion

  • Going to bed at 10 p.m. may be too early for me. I may move to 10:30 p.m.
  • When wake up, check for phone call or text message only. Do not read any email until 6 a.m. or the start of a normal day

New Year Resolution - Improve Sleep Quality

To improve the sleep quality in 2017, I have some plans (the source of these ideas from https://medium.com/personal-growth/how-to-wake-up-early-your-ultimate-blueprint-1f8bb2045b90)

  • Sleep at least 6 hours and 30 minutes each day. I will go to bed at 10 p.m. except Saturday.
  • Do not drink sola after 4 p.m. or 6 hours before bedtime. I do not drink alcohol and rarely drink coffee.
  • Drink half glass of water before going to bed
  • Do not exercise after 7 p.m. or 3 hours before bedtime
  • Do not read or watch on the smartphone, tablet, or computer after 9 p.m. or 1 hour before bedtime. This one will be difficult.
  • Do not put the smartphone or tablet next to the bed

Use WinSCP to Transfer Files in vCSA 6.7

This is a quick update on my previous post “ Use WinSCP to Transfer Files in vCSA 6.5 ”. When I try the same SFTP server setting in vCSA 6.7...