Hello.
I have a problem with my Hyper-V cluster.
It is simply a failover cluster with Hyper-V role consisting of two nodes. It uses SOFS share for VM storage.
SOFS is run by second storage failover cluster dedicated solely for this role. Storage cluster consisting of two nodes and shared iSCSI storage, disks added as CSV and SOFS shares are on them.
All Hyper-V and SOFS cluster nodes have dedicated 2x10G interfaces, so SMB3 multichannel is in place.
- SMBv1 removed
- NETBIOS disabled
- TCP timestams enabled "netsh int tcp set global timestamps=enabled"
- Enabled TcpAckFrequency and TcpNoDelay REG_DWORD 1 in HKEY LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\<SAN interface GUID>
Approximately every two weeks all VMs hang due to losing connection to SOFS share.
Symptoms:
- UNC address \\SOFS.INSIDE.LOCAL cannot be accessed from Hyper-V cluster nodes with error "The remote procedure failed and did not execute." https://i.imgur.com/ye69RKt.png
- SOFS share can be accessed by UNC address \\SOFS from Hyper-V cluster nodes
- SOFS share can be accessed directly by \\SOFS.INSIDE.LOCAL\SHARENAME from Hyper-V cluster nodes
- SOFS share can be accessed from any other servers by \\SOFS.INSIDE.LOCAL or \\SOFS
Known workaround: Reboot Hyper-V cluster nodes or only one of two nodes. Rebooting SOFS cluster nodes doesn't help.
OS: Windows Server 2016 everywhere, 2018-06 updates
Of course I can go back to directly connecting iSCSI storage to Hyper-V cluster, but in my case this dedicated SOFS storage cluster was in place to simplify Hyper-V and (in future) SQL cluster nodes setup. So I won't need to update storage array software on
all cluster nodes (~20 nodes in future) when new version comes out and all storage array-host relationships will be only between two nodes and array for troubleshooting reasons.
I believe that problem is somewhere in SMB client-server relations.
I've already tried this in Hyper-V nodes "Set-SmbClientConfiguration -MaxCmds 32768" and on SOFS nodes "Set-SmbServerConfiguration -MaxThreadsPerQueue 64 -AsynchronousCredits 8192" but it didn't help. All other SMB settings are default.
From my point of view this setup looks pretty simple: Hyper-V running VMs with storage over SMB without any insane or special things.
Captured problem with procmon https://i.imgur.com/ewDDpL9.png
Captured problem with network monitor: https://i.imgur.com/gbVvrZm.png (with filter ProtocolName == "SMB2")
In this sample 10.10.10.101 - SOFS node #1 SAN interface 0 and 10.10.10.155 - HV node #5 SAN interface 0
Looks like problem in RPC over SMB communication via Server Service Remote Protocol (https://msdn.microsoft.com/en-us/library/dd303117.aspx) but I have no idea whats the problem there.
According to this blog post (https://blogs.technet.microsoft.com/josebda/2013/10/30/automatic-smb-scale-out-rebalancing-in-windows-server-2012-r2/) type of access of Hyper-V servers to SOFS share should be considered symmetric because both SOFS nodes identically
connected to SAN via iSCSI but I see a lot of 30814 events logged with 1 second interval first stating that share type is asymmetric https://i.imgur.com/LJ425BN.png and second stating that it is symmetric https://i.imgur.com/MnfxtDQ.png .
I can't find any documentation (except that blog post) about this behavior, and how SOFS determines type of share (symmetric/asymmetric).
Also in SMB witness client eventlog I can see a lot of events "Witness registration has completed." and "Witness Client received a share move request".
This events looks related, but I can't investigate further inside this SMB interaction.
Yes, we have got support case opened (118072618661320) but I can't get any response for more than two weeks now.