Search This Blog

vSAN Performance Service “Hosts Not Contributing Stats” Fix

I have a four-host vSAN cluster running vSAN 6.2. Recently the vSAN health’s Performance service check shows two of the hosts not contributing stats.

vsan.host.not.contrubting.stats.01

The following are all the steps that I tried during troubleshooting and ultimately fixing the issue in my environment. Some of the steps do not fix my issue, however they may be applicable to your situation. PS. I opened a VMware support case on this issue. The support engineer did not directly solve my issue. However, he did give the hint on the cause of the issue that led me to discover the solution.

  1. Turn off and turn on the Performance Services in vSphere web client, vSAN cluster, Manage, Settings, Health and Performance.
  2. Turn off the Performance Services, restart the vSAN management agent “/etc/init.d/vsanmgmtd restart”, then restart the service.
  3. Place the vSAN host in the maintenance mode and restart the host.
  4. SSH to the vCenter server appliance, restart the vmware-vpxd service “service vmware-vpxd restart”.
  5. Verify the vSAN storage provider status of each vSAN host is online in vSphere web client, vCenter server, Manage, Storage Providers. If the host’s vSAN provider is offline, unregister the host’s storage provider and synchronize all vSAN storage providers. This brings the host’s vSAN storage provider back online.
    Caution: doing this can cause the VMs on the host to failover to other hosts in the cluster.
    vsan.host.not.contrubting.stats.02
  6. (I think this is to begin to lead me to the ultimate fix) Check the certificate info of each vSAN host in Storage Provider. They should be issued by the same Platform Service Controller (my vCenter is the vCSA wit the external PSC, instead of the embedded PSC). In my case, the certificate of the two “problem” vSAN hosts is issued by the VC host; the certificate of the “good” vSAN hosts is issued by the PSC host. I don’t know what the cause of these hosts having different certificate issuers, since I don’t have the history of how these PSC and VC were deployed.
    vsan.host.not.contrubting.stats.03
    vsan.host.not.contrubting.stats.04
  7. To further confirm the ESXi host certificate is the problem
    1. Login vCenter server as “administrator@vsphere.local’
    2. Home, Administration, Deployment, System Configuration, Nodes, PSC node, Manage, Certificate Authority (if selecting VC node, there is no Certificate Authority tab under Manage)
    3. Enter the password of “administrator@vsphere.local” again
    4. Active Certificate, all the ESXi hosts are listed, except the two “problem” vSAN hosts
    5. It makes sense why the certificates of the two “problem” vSAN hosts are missing here, because they are issued by the VC host, not the PSC host. But it does not make sense how they received the “problem” certificate since there is no Certificate Authority on the VC host.
      vsan.host.not.contrubting.stats.05
  8. Once the cause is identified, the fix is to re-issue the certificate to the two “problem” vSAN hosts.
  9. In vSphere web client, the “problem” vSAN host, Manage, Settings, Certificate
    1. Here is also showed the host certificate issuing by the wrong host (the VC host)
    2. Click Renew to request a new certificate
    3. Caution: Once clicking the Renew button, the host HA agent was restarted. Some VMs on the host failed over to the remaining hosts, even the VMs seem no downtime.
      Before renewing the certificate
      vsan.host.not.contrubting.stats.06
      After renewing the certificate
      vsan.host.not.contrubting.stats.07
  10. Once the host certificates are re-issued by the PSC, the vSAN Performance service status is showed “Passed”
    vsan.host.not.contrubting.stats.08

Conclusion

  • The cause of the vSAN Performance service “Host Not Contributing Stats” in my case is the “problem” vSAN host having the wrong host certificate.
  • I don’t know how these “problem” hosts received the wrong host certificate.
  • When the vCSA with the external PSC, the host certificate is issued by the PSC host.
  • Re-issuing or renewing the host certificate will restart the host HA agent. It can cause the VMs on the host migrating to other hosts.

vCenter Server 6.5 Native High Availability Feature Summary

  • Available exclusively for vCenter Server Appliance (vCSA)
  • Consist of three nodes – active, passive, and witness nodes
    • Passive and Witness nodes are cloned from the existing vCSA (active node)
  • vCenter HA cluster can be enabled, disabled, or destroyed at any time
  • There is a maintenance mode to prevent planned maintenance from causing an unwanted failover
  • Use two types of replication between active and passive nodes
    • Native PostgreSQL synchronous replication for the vCenter Server database
    • A separated asynchronous file system replication for key data outside the database
  • Two vCenter HA deployment workflows
    • Basic: all vCenter HA nodes are deployed within the same cluster
    • Advanced: the active, passive, and witness nodes are deployed to different clusters
  • There is little benefit to using vCenter HA without also providing high availability at the Platform Service Controller layer
    • An external Platform Services Controller instance is required when there are multiple vCenter Server instances in an Enhanced Linked Mode configuration.
  • Failover can occur when a host failure, or when certain key services fail
  • For the initial release of vCenter HA, a recovery time objective (RTO) is about 5 minutes

I have already known about some of these information when testing vCenter HA in my lab. I highlighted the ones I learned from this white paper.

Source: “What’s New in VMware vSphere”" 6.5” technical white paper

New Year Resolution - Improve Productivity

Here is my another new year resolution in 2017 - improve productivity (see my previous 2017 new year resolution here. The source of these ideas are from http://www.businessinsider.com/bad-habits-that-killing-productivity-2016-12.

  • Get out of the bed when the alarm clock buzzes
  • Get enough sleep
  • Do not keep the tablet next to the bed. I keep the smartphone next to the bed as my alarm clock
  • Do not skip breakfast and drink some hot tea before going to the toilet in the morning
  • Complete the hardest and most important tasks at the beginning of the day
  • Do not check email throughout the day, especially in the middle of the night. When wake up in the morning, only check if there is missing call or text message. Do not read the email until later of the day
  • Do not eat junk food or eat less junk food
  • Focus on 3 ~ 5 of the most important goals and ignore the rest
  • Do not sit all day and walk 50,000 steps in a week
  • Do not multitask
  • Do not skip the workout
  • Do not look up the answer of a random question that just popped into your head. Write it down and search later
  • Do not overplan the schedule, instead plan for 4 ~ 5 hours of read work each day
  • Do not underplan
  • Do not accept a meeting unless the person who requested it has put forth a clear agenda and stated exactly how much time they will need
  • Abandon perfectionism

Lessons from Security Breaches

Here are my short summary of the article “Learning From A Year of Security Breaches” that are applicable to most of work environments.

  • Centralize logs, including host, application, authentication, and infrastructure, into as few system as possible; make critical logs alertable; but be aware of user privacy in what you log
  • You might not find the root cause of a beach because of weakness in the environment, systems or people; practicing incident response can indentify these weakness
  • Attackers will target employee’s home, personal email, or device to breach the corporate security; Educate your employees to improve their security practices and involve the corporate security team even if they have personal security issues
  • Avoid putting secrets and keys into source code
  • Protect employees’ credential by integrating Single Sing On or Multi Factor Authentication
  • Be aware of insider threats
  • Measure and eliminate the security debt - cutting corners for fast growth

First Day Result of Improving Sleep Quality

Here is the first day result of following my 2017 new year resolution - improve sleep quality

  • Went to bed at 10:2x p.m. According to my Fitbit, slept at 10:34 p.m. and woke up at 3:53 a.m., and time asleep 5 hours and 9 minutes with 10 minutes restless. I know I woke up at 2:19 a.m. and 3:24 a.m. to check the time on my Fitbit. Then I am fully awake from a dream at 3:24 a.m.
  • Did not drink any sola the whole day
  • Drank half glass of water before going to bed
  • Did not exercise the whole day, and walked 7,578 steps
  • Shut down the computer at 10:1x p.m., did not read on the phone and tablet in bed
  • Kept the phone next to the bed as the alarm clock. When waking up, I checked if there is any phone call or text message (the anwers is no); I did not read any email even there are some on the phone.

Conclusion

  • Going to bed at 10 p.m. may be too early for me. I may move to 10:30 p.m.
  • When wake up, check for phone call or text message only. Do not read any email until 6 a.m. or the start of a normal day

New Year Resolution - Improve Sleep Quality

To improve the sleep quality in 2017, I have some plans (the source of these ideas from https://medium.com/personal-growth/how-to-wake-up-early-your-ultimate-blueprint-1f8bb2045b90)

  • Sleep at least 6 hours and 30 minutes each day. I will go to bed at 10 p.m. except Saturday.
  • Do not drink sola after 4 p.m. or 6 hours before bedtime. I do not drink alcohol and rarely drink coffee.
  • Drink half glass of water before going to bed
  • Do not exercise after 7 p.m. or 3 hours before bedtime
  • Do not read or watch on the smartphone, tablet, or computer after 9 p.m. or 1 hour before bedtime. This one will be difficult.
  • Do not put the smartphone or tablet next to the bed

Configuring VCSA 6.5 Backup Lessons Learned

vCenter Server Appliance (vCSA) 6.5 comes with the built in backup functionality. Starting a backup is quite easy - login the vCSA web console and click Backup button on the Summary page (see this post for the step-by-step screen shots).
Even it looks a very simple task, I have learned a few lessons when configuring the vCSA backup.
Lesson #1: vCSA backup location is <host_name>/<folder_name>
If using FTP protocol, the backup location is not just the FTP server host name or IP address; it MUST include the folder name. There is a “/” between the host name and folder name.
Otherwise, the error message is “FTP location is invalid”.
vCSA.Backup.FTP.Location.Is.Invalid
Lesson #2: vCSA backup supports the FTP virtual host name if entering the username correctly - <ftp virtual hostname>|<ftp username>
See my Lesson #2 in “Setting Up IIS 8 FTP Server Lessons Learned” about the FTP virtual host name login. There is a “|” between the hostname and username.
Otherwise, the error message is “Access to the remote server is denied. Check your credentials and permissions”.
vCSA.Backup.Access.to.The.Remote.Server.Is.Denied
Lesson #3: Use curl to troubleshoot vCSA backup error
After entering the correct settings, vCSA backup wizard validates the settings and starts the backup. The backup fails with “BackupManager encountered an exception. Please check logs for details”, but it does not provide much details or the location of the log file.
vCSA.Backup.BackupManager.Encountered.An.Exception
After some digging, I found the backup log file in /var/log/vmware/applmgmt/backup.log. In the log file, there is a curl error “Connection time-out”.
vCSA.Backup.Backup.log
This gives me a hint that vCSA backup uses curl to transfer the backup file from vCSA to the FTP location. Recently I am also learning curl to transfer file, so I’m a little familiar with curl. (I will publish what I learn from curl in a future post).
From vCSA console, enter “curl -u <ftp user>:<password> -l <ftp server>”. It should list the file and directory on the FTP server. But I got the timeout error. I also tried running curl on a Windows computer, and got the timeout error too. This leads to me think the problem is on the FTP server. Finally the fix is to restart the FTP service (see Lesson #1 on “Setting Up IIS 8 FTP Server Lessons Learned”).
I am not sure why the wizard was able to successfully validate the FTP server setting when the FTP server connection is blocked by the Windows Firewall. When troubleshooting the Windows Firewall, I thought I could use the FTP command to connect to the FTP site, but using curl would fail. I’m not 100% sure about this, since I can’t replicate the issue again. After restarting the Microsoft FTP service, everything is working okay.
Anyway, using curl is the best tool to troubleshoot the vCSA backup failure.
Lesson #4: vCSA backup location must be an empty folder
After successfully running a backup, I try running the backup one more time with the same setting. I got the following error. (PS. In the screenshot below, I removed the virtual hostname on the FTP site, so I can just use the username).
vCSA.Backup.Location.Folder.Is.Not.Empty

Use WinSCP to Transfer Files in vCSA 6.7

This is a quick update on my previous post “ Use WinSCP to Transfer Files in vCSA 6.5 ”. When I try the same SFTP server setting in vCSA 6.7...