Eddie's Blog: vsan

Showing posts with label vsan. Show all posts

vSAN Performance Service “Hosts Not Contributing Stats” Fix

I have a four-host vSAN cluster running vSAN 6.2. Recently the vSAN health’s Performance service check shows two of the hosts not contributing stats.

The following are all the steps that I tried during troubleshooting and ultimately fixing the issue in my environment. Some of the steps do not fix my issue, however they may be applicable to your situation. PS. I opened a VMware support case on this issue. The support engineer did not directly solve my issue. However, he did give the hint on the cause of the issue that led me to discover the solution.

Turn off and turn on the Performance Services in vSphere web client, vSAN cluster, Manage, Settings, Health and Performance.
Turn off the Performance Services, restart the vSAN management agent “/etc/init.d/vsanmgmtd restart”, then restart the service.
Place the vSAN host in the maintenance mode and restart the host.
SSH to the vCenter server appliance, restart the vmware-vpxd service “service vmware-vpxd restart”.
Verify the vSAN storage provider status of each vSAN host is online in vSphere web client, vCenter server, Manage, Storage Providers. If the host’s vSAN provider is offline, unregister the host’s storage provider and synchronize all vSAN storage providers. This brings the host’s vSAN storage provider back online.
Caution: doing this can cause the VMs on the host to failover to other hosts in the cluster.
(I think this is to begin to lead me to the ultimate fix) Check the certificate info of each vSAN host in Storage Provider. They should be issued by the same Platform Service Controller (my vCenter is the vCSA wit the external PSC, instead of the embedded PSC). In my case, the certificate of the two “problem” vSAN hosts is issued by the VC host; the certificate of the “good” vSAN hosts is issued by the PSC host. I don’t know what the cause of these hosts having different certificate issuers, since I don’t have the history of how these PSC and VC were deployed.
To further confirm the ESXi host certificate is the problem

Login vCenter server as “administrator@vsphere.local’
Home, Administration, Deployment, System Configuration, Nodes, PSC node, Manage, Certificate Authority (if selecting VC node, there is no Certificate Authority tab under Manage)
Enter the password of “administrator@vsphere.local” again
Active Certificate, all the ESXi hosts are listed, except the two “problem” vSAN hosts
It makes sense why the certificates of the two “problem” vSAN hosts are missing here, because they are issued by the VC host, not the PSC host. But it does not make sense how they received the “problem” certificate since there is no Certificate Authority on the VC host.

Once the cause is identified, the fix is to re-issue the certificate to the two “problem” vSAN hosts.
In vSphere web client, the “problem” vSAN host, Manage, Settings, Certificate

Here is also showed the host certificate issuing by the wrong host (the VC host)
Click Renew to request a new certificate
Caution: Once clicking the Renew button, the host HA agent was restarted. Some VMs on the host failed over to the remaining hosts, even the VMs seem no downtime.
Before renewing the certificate

After renewing the certificate

Once the host certificates are re-issued by the PSC, the vSAN Performance service status is showed “Passed”

Conclusion

The cause of the vSAN Performance service “Host Not Contributing Stats” in my case is the “problem” vSAN host having the wrong host certificate.
I don’t know how these “problem” hosts received the wrong host certificate.
When the vCSA with the external PSC, the host certificate is issued by the PSC host.
Re-issuing or renewing the host certificate will restart the host HA agent. It can cause the VMs on the host migrating to other hosts.

VSAN 6.2 On-disk Format Upgrade Fails at 5%

I am working on upgrading our VSAN from 6.1 to 6.2. See this from the upgrade step overview.

After upgrading each VSAN host to ESXi 6.0U2 (the latest build 4510822 as of 11/01/2016), the last step is to upgrade the on-disk format from v2 to v3.

In our case, the on-disk format upgrade fails at 5% with the error message “General Virtual SAN error. Disk Format conversion failed due to unexpected error”.

However, check the disk format in VSAN cluster, Manage, Settings, Virtual SAN / Disk Management. A disk group is upgraded to the interim version 2.5 each time I run the on-disk format upgrade. In the screen shots below, I ran the on-disk format upgrade twice. Two of the disk groups are upgraded to v2.5.

I keep running the on-disk format upgrade. In our VSAN, we have 4 hosts with 2 disk groups on each node. The on-disk format failed six times. On the seventh time, all disk groups are upgraded to v2.5.

Then the upgrade moves forward to the next process - starting remove disks from one of the VSAN host.

I have not figured out the cause of the failure. Re-running the upgrade process until all the disk groups are upgraded to the format v2.5 is able to keep the process moving forward.

VSAN v6 Provision Thick Disk

I always think when creating or migrating VM on a VSAN datastore, its disk should be thin provisioned. However, I discovered some VM disks in our VSAN datastore are “thick” provisioned even all the VM storage policies are set to 0% object space reservation. How is it possible? After some digging, here is what I learn.

Thick Disk Format on VSAN

VSAN defines the disk type (thin or thick) via the Object Space Reservation setting in the VM Storage Policies. By default, this value is 0%, implying the disk is deployed as thin.

If the value is set to 100%, meaning the space for the disk is fully reserved, which can be thought of as full, thick provisioned. This behaves similarly to thick provision lazy zeroed. There is no eager-zeroed thick format on VSAN. (reference: Virtual SAN 6.2 Design and Sizing Guide, page 65)

Benefit to Provision Thick Disk on VSAN

Based on my understanding of VSAN disk IO operating (VSAN mirrors write IOs to all active mirrors, there are acknowledged when they hit the flash buffer!), typically there is no performance difference between thin and lazy zeroed thick provision on VSAN. Remember, there is no eager-zeroed thick format on VSAN (see above). Also see the Yellow-Bricks post. (PS: Duncan’s post may misspeak about VSAN eager zero thick provision.)

Provision Thick Disk on VSAN (Intentionally or By Accident)

There are several possible ways to provision a thick disk on VSAN.

Possibility #1

Define a thick VM Storage Policy
Set the Object Space Reservation to 100%
Use vSphere Web Client (cannot use vSphere C# Client)
Select the thick VM storage policy

Possibility #2

Use vSphere C# Client
Select “Thick Provision Lazy Zeroed” or “Thick Provision Eager Zeroed” on the disk type
I don’t know what the actual impact on VSAN when selecting eager zero. In my test, the VM disk is still created correctly. I will do more research and post an update.

Possibility #3

P2V a physical server to VM
By default, P2V uses thick provision on the disk
Change to Destination Disk to thin provision by select Advanced, Destination layout, Type, Thin

For VSAN 5.5, there is one more method, see here.

Change Thick Provisioned Disk to Thin on VSAN

Unfortunately, there is not a simple way to change a thick provisioned disk to thin on VSAN. Simply changing the VM storage policy on the disk has no impact.

In order to convert a thick disk to thin provisioned, do a storage migration of the disk to a SAN / NFS / local storage, then migrate back to the VSAN datastore. Make sure select the thin provision storage policy during the migration.

Do Not Upgrade Dell Server with H730 and FD332-PERC Controller to VSAN 6.2

VMware released VSAN 6.2 on March 15, 2016. However, if your VSAN is running on a Dell server with H730 or FD332-PERC controller, do not upgrade to VSAN 6.2.

See KB2144614 for more information.

VSAN Free Storage Catches

VSAN is a hot topic nowadays. Once it is set up, it’s easy to management and use. No more creating LUN and zoning.

We recently experienced some catches about its free available storage - at least we didn’t think about or were told before; or maybe our expectation to VSAN was too positive.

Our VSAN hardware disk configuration:

3 x Dell PowerEdge R730 nodes
2 x 400 GB SDD per node (372.61 GB is shown in VSAN Disk Management)
14 x 1 TB SATA per node (931.51 GB is shown in VSAN Disk Management)
Two disk groups (7 SATA + 1 SSD) per node

Calculation of each node storage capacity (RAW):

931.51 x 14 = 13,041.14 GB = 12.73549 TB

Total storage capacity (RAW)

931.51 x 14 x 3 = 39,123.42 GB = 38.20646 TB

This calculation matches the storage capacity shown in the VSAN Cluster’s Summary.

We are adding more VMs to the VSAN. Once the free storage drops below about 12 TB (about one node’s RAW capacity), the VSAN health check starts showing critical alert “Limits Health - After 1 additional host failure” (KB2108743).

And the component resyncing starts more frequently.

My take away:

I understand there is an overhead for VSAN (or any storage product) to offer the redundancy. But the way VSAN displaying the free storage is quite difference than the traditional SAN storage and it can be confused. The free storage shown in VSAN does not mean you should use it. Otherwise, the VMs may be down when a host is down or taken down for maintenance.
The used storage in the Summary tab is the previsioned storage, not the actual space in use.
The frequent resyncing component can potentially impact the overall VSAN storage performance.

Recover Microsoft Cluster VMs Not Power On After Migration

A lesson to remember if you do not have the time to read this entire post: do not migrate the cluster VMs without fully understanding the impact.

Here is our story.

We had a Microsoft SQL 2008 Cluster VMs in the CIB (see my previous post about various Microsoft Cluster VMs configuration). The shared disks of the cluster VMs were on an EMC SAN. When the free space of EMC SAN was running low, an engineer migrated the cluster VMs (the VMs were powered off during the migration) to the VSAN v.6.1 hosts and storage. The migration completed successfully, but the VMs would not power on with the error message “Cannot use non-thick disks with clustering enabled (sharedBus='physical'). The disk for scsi1:0 is of the type thin.”

Because VSAN does not support Microsoft Cluster with the shared disk (non shared disk cluster, e.g. SQL AlwaysOn Availability Group is supported), this is no option but migrating the VMs back to the original hosts and SAN storage.

PS: In this case, the new target storage is VSAN. I think if the new target storage were the traditional SAN, the cluster would break too. Because the cluster VMs were not shared anymore after the migration (see below). But you probably could recover the cluster by reconfiguring the VMs to share the shared disks without migrating the VMs back to the original storage.

When we reviewed the disks of the migrated VMs on the VSAN storage, each VM had its own copy of the shared disks. So the cluster VMs were not shared the shared disks any more. We could not simply migrate the VMs back to the original hosts and SAN storage.

When we reviewed the original EMC SAN storage, the VMDK files of the shared disks were still left there, only the non shared disk (e.g. the OS’s C drive) was completely migrated to the VSAN storage.

Recovery Procedure:

Document the SCSI controller ID (e.g. SCSI (1:0)) of each shared disk from the migrated VMs. This may not be very important. But we are going to use the same SCSI controller for each corresponding disk when re-adding the shared disks
Since the VMDK files of the shared disks were still left on the original SAN storage, we can speed up the recovery by migrating the non shared disks of each VMs only. In this case, we are only migrating the hard disk 1 of each VM (the OS drive) back to the original SAN.
How to migrate only the OS drive back to the original host and storage? We used VMware vCenter Converter, and only select the hard disk 1. This worked beautifully.

PS. In this case the VMs were migrated to the VSAN storage. We could not use scp to copy the VMDK file manually between the hosts. If we want to use scp, we need to migrate the VMDK files to a non-VSAN storage first. This is why I think vCenter Converter is the best tool in this case.
Now the non-shared disk of each VM are back to the original host and SAN storage. Make sure both VMs are registered on the same ESXi host.
If the VMs were not on the same ESXi host, use Migrate, Change host, check the checkbox “Allow host selection with this cluster” (this option is not selected by default) to put both VMs on the same ESXi host.
Re-add the SCSI controller(s) to the first VM and set the SCSI Bus Sharing to Virtual
Re-add the shared disks using the existing VMDK files to the first VM; match the SCSI ID documented in the first step. We also make sure the order of the hard drives matching the original VM’s configuration

Power on the first VM
Log in Windows and verify the shared drives’ drive assignments are correct
Launch Failover Cluster Manager to verify the cluster services and applications are online
Re-add the SCSI controller(s) to the second VM and set the SCSI Bus Sharing to Virtual
Re-add the shared disks using the existing VMDK files to the second VM; match the SCSI ID documented in the first step
Power on the second VM
Log in Windows and verify no shared drive is shown in Windows Explorer, and they should be shown “reserved” in the Disk Management
Launch Failover Cluster Manager to verify the second node is online

Fix A VSAN Host Shows 0 of 0 Disks In Use

We have three hosts running on VSAN 6.1. Today the Disk Management in vSphere Client shows one of the hosts 0 of 0 Disk in Use.

And in VSAN General, it shows the warning of Mixed On-disk Format Version, and there is an upgrade button next to it. (Do Not Click It - I didn’t click it, and am not sure what the impact would be). Because our VSAN environment is built from scratch with VSAN 6.1, it is not upgrade from VSAN 5.5. It does not make sense the disk format requires an upgrade.

Troubleshoot

Run VSAN Health check, everthing is green.

The affected host shows all the disks under its Manage, Storage, Storage Devices.

Solution

Click the first icon under Storage Devices to refresh the host’s storage information.

Now the Disk Management and On Disk Fromat are back to normal.

VSAN Storage Controller Cache

In “VSAN 6.0 Design and Sizing Guide” v.1.0.5, April 2015, under Storage controller cache considerations section, “VMware’s recommendation is to disable the cache on controller if possible. Virtual SAN is already caching data at the storage layer – there is no need to do this again at the controller layer. If this cannot be done due to restrictions on the staorge controller, the recommendation is to set the cache to 100% read.”.

However in “VSAN Ready Nodes”“VSAN Ready Nodes”, the storage controller in some configuration includes the cache. For example, the storage controller in the Dell PowerEdge R630.

Why includes the controller cache when VMware recommends disabing it?

It turns out the controller cache allows the larger queue depth – see this.

In “VSAN 6.0 Design and Sizing Guide”, VMware recommends the minimum queue depth is 256, and choose a controller with a much larger queue depth when possible.

For more information about the queue depth, see the following

Disk Controller features and Queue Depth? (Yellow-Bricks)
Why Queue Depth matters! (Yellow-Bricks)
Queue Depth info in the VSAN HCL! (Yellow-Bricks)
“Community” VSAN Storage Controller Queue Depth List (virtuallyGhetto)

Eddie's Blog

Search This Blog