Category Archives: 13183

Adjusting Exchange 2010 DAG Failover in High Latency Networks

Some Exchange 2010 DAG implementations have DAG members that are separated by high latency WAN networks.  In these networks you may find that the increased latency causes unexpected DAG failovers or failures. 



This is especially likely to happen with a two node DAG with a File Share Witness (FSW).  When network latency increases to the point that the cluster heartbeat threshold is reached, the node furthest from the FSW will go offline.  The node that’s in the same LAN as the FSW will stay online because it maintains quorum.



There are two properties that specify cluster health, as measured in heartbeats.

  • CrossSubnetDelay specifies the heartbeat interval (in milliseconds) between subnets. The default is 1000 milliseconds (1 second).
  • CrossSubnetThreshold specifies how many heartbeats can be missed between subnets before cluster failover (or failure) occurs. The default is 5 heartbeats.

With the default settings, any WAN latency that causes 5 missed heartbeats over 5 seconds causes the cluster to fail. This matches the settings used by the SameSubnetDelay and SameSubnetThreshold properties.



If WAN latency causes unexpected cluster failover or failures, adjust the CrossSubnetDelay value to its maximum value of 4000 milliseconds (4 seconds) and the CrossSubnetThreshold property to its maximum value of 10,  With these settings the cluster will not failover or fail until there are 10 missed heartbeats, 4 seconds apart (40 seconds).



This is accomplished from Powershell by doing the following:



  • From one of the DAG members open the Windows Powershell Modules in Administrative Tools.  This will launch Powershell and import all the Windows Powershell modules for installed features, including the new Windows Failover Cluster module.
  • Run the following one-liner to configure the maximum values:



$cluster = Get-Cluster; $cluster.CrossSubnetThreshold = 10; $cluster.CrossSubnetDelay = 4000



  • Check your settings using the following cmdlet:

Get-Cluster | fl *




Since cluster properties are instantly replicated between all nodes in the cluster, this only needs to be configured from one node in the DAG.  The changes go into effect immediately and there is no need to restart services or the server.



Note that you can configure the same properties using cluster.exe, but I’m using Powershell here because cluster.exe is deprecated in Windows Server 2008 R2.

Exchange DAG Cluster Service Terminated with Error 7024

I ran into an interesting issue at a client site yesterday on an Exchange 2010 SP1 DAG member.  One DAG member’s databases would not mount (even Public Folders) and the Cluster service would not start.  The DAG is configured as a three member stretched DAG, with two nodes in the main site and another in the DR site.



The event IDs logged were:



Log Name:      System
Source:        Service Control Manager
Date:          10/23/2011 12:07:44 AM
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      EX03.domain.local
Description:
The Cluster Service service terminated with service-specific error Log service encountered an invalid log block..



Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          10/23/2011 12:00:43 AM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      EX03.domain.local
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.



Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          10/23/2011 12:00:43 AM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      EX03.domain.local
Description:
Cluster node ‘EX02′ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.



Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          10/23/2011 12:00:43 AM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      EX03.domain.local
Description:
Cluster node ‘EX01′ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

The cluster failed when two of the three DAG members (EX01 and EX02) went offline at 12:00:43 AM due to a network failure in the active site.  For some reason, this corrupted the CLUSDB.BLF file on the member in the DR site which prevented that node from coming online when the network came back up.  CLUSDB.BLF is a CLFS Base Log File used by the cluster service which contains metadata that is used to manage access to the log data. 



To correct the problem, navigate to the %WINDIR%\Cluster folder and rename CLUSDB.BLF to CLUSDB.BLF.OLD.  Then restart the Exchange server.  The cluster service will generate a new CLUSDB.BLF file on restart.  The cluster service will be able to start and the databases will mount.

Configuring Domain Controller Usage in Exchange 2007 CCR Geo-Clusters

Exchange 2007 Cluster Continuous Replication (CCR) can be configured to span different geographic sites.  These are sometimes called “stretch” or “geographically dispersed” clusters.  In Windows Server 2003, special networking configurations need to be made to stretch a single subnet across the two geographically dispersed locations.  This is made much easier using Windows 2008, since the 2008 clustering service can span different subnets, one in each location.

Even so, Exchange 2007 requires that both nodes of the CCR cluster must reside in the same Active Directory site.  Best practice says that there should be redundant Global Catolog servers in each location, in case of an outage in either location.  The trouble is that if each node of the CCR cluster and all the Global Catalogs reside in the same AD site, Exchange servers may (probably will) bind to a GC that is not in the same geographic location as the server, which can lead to problems. 

Consider the following example:


A CCR geo-cluster exists in an Active Directory site called E2K7.  NODE1 is in San Francisco and NODE2 is in Las Vegas.  There are two Global Catalog servers in each site, SFDC1 and SFDC2 in San Francisco and LVDC1 and LVDC2 in Las Vegas.  Because all six servers reside in the same AD site, Exchange will bind to any one of the four GCs.  In this example, NODE1 is active and NODE2 happens to be using SFDC1 for Global Catalog and Configuration Domain Controller services.  During this time, NODE2 is reaching across the WAN for GC services, which is not very efficient.

If there is a location specific outage in San Francisco (earthquake, power interruption, or some yahoo takes out a fiber trunk with a backhoe) the CCR cluster will fail over to Las Vegas, but the GC NODE2 is using (SFDC1) is unavailable, too.  Exchange services will not fail over correctly and an outage occurs — something that the CCR cluster is supposed to prevent.

The way to design around this problem is to configure the CCR node in each location to exclude the GCs in the remote location.  This is done using the following command from the Exchange Management Console, as shown for NODE2:

Set-ExchangeServer -id NODE2 -StaticExcludedDomainControllers:sfdc1.domain.com,sfdc2.domain.com

Note that the Domain Controllers specified must be in FQDN form, separated by commas, with no spaces.   You would do the same for NODE1, specifying LVDC1 and LVDC2.

The result is that each node will always use the local GCs for that node.  If both of those local GCs are unavailable for some reason, Exchange will temporarily bind to any GC in a remote site in the domain.  This binding will occur automatically within 15 minutes.  When the local GCs become available again, Exchange will re-bind to them within 15 minutes.  Perfect!



While researching this article, I came across something unexpected.  I set the StaticExcludedDomainControllers value using the Set-ExchangeServer cmdlet and it works as expected.  But when I try to view the configuration using the Get-ExchangeServer cmdlet, the value appears empty, as shown:


The reason it shows null is because the StaticDomainControllers, StaticGlobalCatalogs, StaticConfigDomainController, and StaticExcludedDomainControllers variables are stored in the Exchange server’s registry, not in Active Directory.  According to Microsoft, this is “by design” to prevent performance issues caused by the Remote Registry call needed to retrieve the values.  I’m not aware of any other cmdlet that has this behavior.

In any event, to view the configuration of these variables you must use the -Status switch, as shown: