Search This Blog

VCSA 6.5 “The appliance management service is not running” Fix

Scenario

In vSphere Web Client 6.5, under Home, Administration, Deployment/System Configuration, Nodes, the vCenter Server node shows an error message “The appliance management service is not running”. An error message “HTTP response with status code 503, 503 Service Unavailable (Failed to connect to endpoint: _serverNamespace = /vmonapi action =Allow _port = 8900" also appears in the web client.

Troubleshooting

  • Login the VMware Appliance Management UI (https://psc:5480 or https://vc:5480). All the health status are good.
  • SSH to VC appliance. Check service status (KB2109887)
    • # service-control –list
    • # service-control –status
    • applmgmt (VMware Appliance Management Service) is running
    • vmonapi (VMware Service Lifecycle Manager API) is not running

Solution

  • Restart vmonapi service or restart all services
    • # service-control --start vmonapi
    • # service-control –start –all
  • PS: if restarting all the services, it may take some time before all services turn back to Good (green) in the node’s Summary page. e.g. VMware Performance Charts service takes more than 30 minutes to change from Warning, Unknown, and then Good.

VCSA 6.5 Syslog vs vRLI’s vSphere Integration

I write this post after reading William Lam’s “What logs do I get when I enable syslog in VCSA 6.5?” and doing some of my experiment on my VCSA 6.5 and vRLI 4.5 setup.

Background

Recently I completed a fresh VCSA 6.5 (external PSC and VC) deployment with vRealize Operations Manager (vROPS) 6.6 and vRealize Log Insight 4.5 installation. In vROPS, I configured vSphere and vRLI solutions; in vRLI, I configured vSphere and vROPS integration. I thought I completed all the setup until reading William’s blog post.

Confusion

There are a lot of information on his blog post. I was a little lost at the beginning, and I was wondering: should I configure VCSA syslog to vRLI? Is the same as vRLI’s vSphere integration? If I read his blog carefully, I would find the answer there. I didn’t fully understand it until I did my own experiment. Here is the quote. I highlighted a few key points.

I personally think the vSphere Integration is a nice solution if you have both Windows vCenter Server and the VCSA and to be able to get data consistency between the two platforms from a logging standpoint. It is definitely useful if you need to quickly enable all ESXi hosts connected to the vCenter Server and have them remotely syslog to the vRLI instance. If you only have the VCSA, you would get more information by configuring the remote syslog capability in VCSA rather than using the vSphere integration feature of vRLI. This especially true if you need the vpxd.log which is generally required for troubleshooting and debugging vCenter Server issues when calling into VMware Support. The other added benefit to using the VCSA option is that structure log entries are processed directly on the VCSA rather than having to be remotely queried via the vSphere APIs, processed and then store in vRLI which would add additional load onto vRLI, especially if you need to configure additional vCenter Server instances.

Summary

I summarize based on my understanding of this topic here. Please refer his blog for the full details.

  • VCSA 6.5 has a new remote syslog functionality comparing to VCSA 6.0. This function is not available in Windows vCenter Server 6.5
  • VCSA 6.5’s remote syslog configuration is in the VAMI UI (https://[VCSA]:5480). This setting available in both PSC and VC for external deployment. See William’s post’s “Logs forwarded by VCSA Deployment Type” for the logs forwarded in different VCSA deployment type
  • VCSA 6.0’s remote syslog configuration is in the vCenter via vSphere Web Client
  • VCSA 6.5 has a new Enhanced Logging feature (see William’s blog for what the enhanced means; see my screen shots in this post for a better example)
  • After completing vRLI’s vSphere integration, “enable streaming of events to syslog” is enabled (vSphere Web Client, vCenter, Configure, Advanced Settings, vpxd.event.syslog.enabled). This setting is mentioned in another person blog. I am not sure what the default VCSA setting is. Put it here for the reference only
  • VCSA 6.5 remote syslog is not configured even completing vSphere integration in vRLI
  • VCSA 6.5 remote syslog is “pushing” the logs to vRLI
  • vRLI’s vSphere integration is “pulling” the logs from VCSA (via vSphere API). This supports both a Windows vCenter Server and VCSA.
  • vRLI’s vSphere integration can also automatically configure the ESXi hosts connected to the vCenter Server and have them remotely syslog to vRLI. (vSphere Web Client, ESXi host, Configure, System/Advanced System Settings, Syslog.global.logHost)
  • By default, vCenter Server log (vpxd.log) is not forwarded to a remote syslog server. It is recommended enabling it for troubleshooting purposes. (vSphere Web Client, vCenter, Configure, Advanced Settings, config.log.outputToSyslog; then restart vCenter Server service in System Configuration, Services, VMware vCenter Server)
  • Other VCSA 6.5 logs can be forwarded to a remote syslog server. but it’s not supported by VMware. See the link at the end of William’s post for more details
  • This is the most important and useful point I have learned. VCSA 6.5 remote syslog sends more information to vRLI comparing to vRLI’s integration. I think this is what the Enhanced Logging means. See my screen shots below. For example, I modified the Tools Upgrades option on a VM.
    • Without VCSA remote syslog configured, vRLI has one entry in the log. It shows the name of the VM (highlighted in yellow)’s toolsUpgradePolicy is changed from “manual” to “upgradeAtPowerCycle”vRLI.log.without.VCSA.remote.log.enabled
    • With VCSA remote syslog configured, vRLI has two entries in the log. In additional to the regular log, the second entry shows the name of the user made the change (highlighted in the red box).vRLI.log.with.VCSA.remote.log.enabled
  • My recommendation is to configure both vRLI’s vSphere integration (for automate configuring the ESXi log host) and VCSA remote syslog (for the enhanced logging). This would duplicate some log entries in vRLI and consume more vRLI log storage. But it is well worthy!

vSphere 6.5 New Feature – VMware Orchestrated Restart

Let me back to the old ESXi 3 day – when I was just using the standalone ESXi hosts or vCenter without HA and DRS. In case of the power outage or air conditioning failure in the data center, all the ESXi hosts were powered down. Once the environment problem was resolved, I could manage the VM startup sequence by configuring the switched PDU to start the hosts accordingly, and configuring the VM startup order at the host level.

However, once I deployed vCenter Server with HA and DRS, I lost the control of the VM startup order. Because the VMs could be hosted at any host in the cluster. Someone said that I should not worry about the VM startup order in the cluster. Because the ESXi cluster would never go down if I had designed the infrastructure with enough redundancy. As we all know, we never have enough redundancy in a small ESXi deployment.

I have been curious why VMware do not “fix” this issue for so long. Until now, vSphere 6.5 introduces the VMware Orchestrated Restart feature. At the high level, the Orchestrated Restart, likes the VM affinity and anti-affinity rules, put the VMs in different VM groups and set the startup dependence among the VM groups. To learn more about this, please go to “What is VMware Orchestrated Restart?”.

I am so glad to know about this new vSphere 6.5 feature – one more reason to upgrading to vSphere 6.5.

vSAN Performance Service “Hosts Not Contributing Stats” Fix

I have a four-host vSAN cluster running vSAN 6.2. Recently the vSAN health’s Performance service check shows two of the hosts not contributing stats.

vsan.host.not.contrubting.stats.01

The following are all the steps that I tried during troubleshooting and ultimately fixing the issue in my environment. Some of the steps do not fix my issue, however they may be applicable to your situation. PS. I opened a VMware support case on this issue. The support engineer did not directly solve my issue. However, he did give the hint on the cause of the issue that led me to discover the solution.

  1. Turn off and turn on the Performance Services in vSphere web client, vSAN cluster, Manage, Settings, Health and Performance.
  2. Turn off the Performance Services, restart the vSAN management agent “/etc/init.d/vsanmgmtd restart”, then restart the service.
  3. Place the vSAN host in the maintenance mode and restart the host.
  4. SSH to the vCenter server appliance, restart the vmware-vpxd service “service vmware-vpxd restart”.
  5. Verify the vSAN storage provider status of each vSAN host is online in vSphere web client, vCenter server, Manage, Storage Providers. If the host’s vSAN provider is offline, unregister the host’s storage provider and synchronize all vSAN storage providers. This brings the host’s vSAN storage provider back online.
    Caution: doing this can cause the VMs on the host to failover to other hosts in the cluster.
    vsan.host.not.contrubting.stats.02
  6. (I think this is to begin to lead me to the ultimate fix) Check the certificate info of each vSAN host in Storage Provider. They should be issued by the same Platform Service Controller (my vCenter is the vCSA wit the external PSC, instead of the embedded PSC). In my case, the certificate of the two “problem” vSAN hosts is issued by the VC host; the certificate of the “good” vSAN hosts is issued by the PSC host. I don’t know what the cause of these hosts having different certificate issuers, since I don’t have the history of how these PSC and VC were deployed.
    vsan.host.not.contrubting.stats.03
    vsan.host.not.contrubting.stats.04
  7. To further confirm the ESXi host certificate is the problem
    1. Login vCenter server as “administrator@vsphere.local’
    2. Home, Administration, Deployment, System Configuration, Nodes, PSC node, Manage, Certificate Authority (if selecting VC node, there is no Certificate Authority tab under Manage)
    3. Enter the password of “administrator@vsphere.local” again
    4. Active Certificate, all the ESXi hosts are listed, except the two “problem” vSAN hosts
    5. It makes sense why the certificates of the two “problem” vSAN hosts are missing here, because they are issued by the VC host, not the PSC host. But it does not make sense how they received the “problem” certificate since there is no Certificate Authority on the VC host.
      vsan.host.not.contrubting.stats.05
  8. Once the cause is identified, the fix is to re-issue the certificate to the two “problem” vSAN hosts.
  9. In vSphere web client, the “problem” vSAN host, Manage, Settings, Certificate
    1. Here is also showed the host certificate issuing by the wrong host (the VC host)
    2. Click Renew to request a new certificate
    3. Caution: Once clicking the Renew button, the host HA agent was restarted. Some VMs on the host failed over to the remaining hosts, even the VMs seem no downtime.
      Before renewing the certificate
      vsan.host.not.contrubting.stats.06
      After renewing the certificate
      vsan.host.not.contrubting.stats.07
  10. Once the host certificates are re-issued by the PSC, the vSAN Performance service status is showed “Passed”
    vsan.host.not.contrubting.stats.08

Conclusion

  • The cause of the vSAN Performance service “Host Not Contributing Stats” in my case is the “problem” vSAN host having the wrong host certificate.
  • I don’t know how these “problem” hosts received the wrong host certificate.
  • When the vCSA with the external PSC, the host certificate is issued by the PSC host.
  • Re-issuing or renewing the host certificate will restart the host HA agent. It can cause the VMs on the host migrating to other hosts.

vCenter Server 6.5 Native High Availability Feature Summary

  • Available exclusively for vCenter Server Appliance (vCSA)
  • Consist of three nodes – active, passive, and witness nodes
    • Passive and Witness nodes are cloned from the existing vCSA (active node)
  • vCenter HA cluster can be enabled, disabled, or destroyed at any time
  • There is a maintenance mode to prevent planned maintenance from causing an unwanted failover
  • Use two types of replication between active and passive nodes
    • Native PostgreSQL synchronous replication for the vCenter Server database
    • A separated asynchronous file system replication for key data outside the database
  • Two vCenter HA deployment workflows
    • Basic: all vCenter HA nodes are deployed within the same cluster
    • Advanced: the active, passive, and witness nodes are deployed to different clusters
  • There is little benefit to using vCenter HA without also providing high availability at the Platform Service Controller layer
    • An external Platform Services Controller instance is required when there are multiple vCenter Server instances in an Enhanced Linked Mode configuration.
  • Failover can occur when a host failure, or when certain key services fail
  • For the initial release of vCenter HA, a recovery time objective (RTO) is about 5 minutes

I have already known about some of these information when testing vCenter HA in my lab. I highlighted the ones I learned from this white paper.

Source: “What’s New in VMware vSphere”" 6.5” technical white paper

New Year Resolution - Improve Productivity

Here is my another new year resolution in 2017 - improve productivity (see my previous 2017 new year resolution here. The source of these ideas are from http://www.businessinsider.com/bad-habits-that-killing-productivity-2016-12.

  • Get out of the bed when the alarm clock buzzes
  • Get enough sleep
  • Do not keep the tablet next to the bed. I keep the smartphone next to the bed as my alarm clock
  • Do not skip breakfast and drink some hot tea before going to the toilet in the morning
  • Complete the hardest and most important tasks at the beginning of the day
  • Do not check email throughout the day, especially in the middle of the night. When wake up in the morning, only check if there is missing call or text message. Do not read the email until later of the day
  • Do not eat junk food or eat less junk food
  • Focus on 3 ~ 5 of the most important goals and ignore the rest
  • Do not sit all day and walk 50,000 steps in a week
  • Do not multitask
  • Do not skip the workout
  • Do not look up the answer of a random question that just popped into your head. Write it down and search later
  • Do not overplan the schedule, instead plan for 4 ~ 5 hours of read work each day
  • Do not underplan
  • Do not accept a meeting unless the person who requested it has put forth a clear agenda and stated exactly how much time they will need
  • Abandon perfectionism

Lessons from Security Breaches

Here are my short summary of the article “Learning From A Year of Security Breaches” that are applicable to most of work environments.

  • Centralize logs, including host, application, authentication, and infrastructure, into as few system as possible; make critical logs alertable; but be aware of user privacy in what you log
  • You might not find the root cause of a beach because of weakness in the environment, systems or people; practicing incident response can indentify these weakness
  • Attackers will target employee’s home, personal email, or device to breach the corporate security; Educate your employees to improve their security practices and involve the corporate security team even if they have personal security issues
  • Avoid putting secrets and keys into source code
  • Protect employees’ credential by integrating Single Sing On or Multi Factor Authentication
  • Be aware of insider threats
  • Measure and eliminate the security debt - cutting corners for fast growth

Use WinSCP to Transfer Files in vCSA 6.7

This is a quick update on my previous post “ Use WinSCP to Transfer Files in vCSA 6.5 ”. When I try the same SFTP server setting in vCSA 6.7...