CCR and Multi-Site Environments

 I have been hearing more and more people talk about the virtues of using CCR with a node in each site. This talk has escalated now that Windows Server 2008 has released to manufacturing. With Windows Server 2008 Failover Cluster environments now have the ability to have nodes in multiple sites without having to use Virtual LANs (VLANs) to provide the networking support.

On the surface, CCR and Windows Server 2008 in a multi-site cluster sounds like the answer to many organization needs. Obviously, I am setting up the argument against this kind of implementation. OK, maybe it wasn’t obvious to some of you. <G>

Anyways, here is a rough sketch (this means that lots of non-discussed components are not shown, i.e. CAS, DC/GC, DNS, etc.) of how this would look if you had two physical locations with them both being in the same AD site to support CCR. In the drawing, Node1 is the active node and replication traffic flows over the WAN link to Node2 which is the passive node. If you look at the drawing, you should immediately see some issues.

CCR - Multi Site

Consideration number 1. Where should you put the FSW? In this drawing, it is in the site on the left. Well, what if that is the site that goes down in a flood, tornado, meteor strike, or whatever? If the FSW is lost along with one of the nodes, there will not be an automated failover. OK, this is fixable since we can manually force the cluster to start, but it will impact life in the real world if there is a major disaster, especially if you lose your administrators along with the site. Make sure you document the process in your DR documentation as somebody else might need to perform the task.

Consideration number 2. How do you know which Hub Transport to use for the transport dumpster in order to back fill the surviving node? After all HT1 and HT2 are in the same AD site, which means that they would be used in a load balanced manner, so it is not possible to use one of them to provide full replay of lost transactions. Yes, you can hard code which HT to use, but that makes no sense to me in an HA environment as if you did that, you would lose the redundancy/load balancing functionality gained by having multiple HTs in a site. Of course, you might even have two in the same physical location. Also, let’s say you hard code HT1 for the CMS and it is active on Node1. If you do that, then you lose the transport dumpster along with the location in the event of a major disaster. OK, so let’s say you hard code HT2 for the CMS which is active on Node1. That would mean all of your traffic would be going across the WAN link, which is not exactly a good idea.

Consideration number 3. What about the use of the Wide Area Network (WAN) and its uncontrolled use by many different services? After all, if both physical locations are in the same AD site, will you have issues with clients logging on and authenticating across the WAN link? Will you have problems with the Clustered Mailbox Server (CMS) using the Hub Transport (HT) on the other side of the WAN link? What about the HT using the wrong Domain Controller/Global Catalog server and thus all of its queries being run over the WAN link? Again, you can hard code some of these settings for some applications and services, but even if you do that, there is again the issue of potentially losing redundancy/load balancing.

Consideration number 4. Using Windows Server 2008 and its multi-site improvements impacts DNS and resolution. For example, when Node1 is active, its VIP address is registered with the CMS name. If there is a failover, then the other VIP (for the physical location of Node2) must be registered within DNS and DNS updates needs to be replicated to all DNS servers in the organization. During the time of the updates and shortly after, there will be clients that have the old VIP address in its cache, so it will resolve incorrectly until the cache is updated on the clients. This is not an Exchange issue, but something else that should be considered.

So, what do I recommend? I am glad you asked that question. If you didn’t, too bad, I will answer it anyways.

I highly recommend using CCR within a single physical site that is also an AD site. For disaster recovery reasons, I recommend using Standby Continuous Replication (SCR) to copy transactions to a remote site’s Exchange mailbox server.

FYI, I updated based on some of Scott Schnoll’s comments to me. Scott had some excellent points regarding my concerns listed above. I won’t go through them one by one, but it basically came down to my making the assumption that CCR in a multi-site (stretched AD site) environment would be configured for automatic failover. I did make this assumption because if we were looking for a manual process that would require administrator intervention to get it up and running, then we should be talking SCR, not CCR. High Availability (HA) and Disaster Recovery (DR) are very different in my mind. HA means that processes are automated to reduce downtime to a minimal amount. DR is something that is done when there is a major disaster that requires steps to be taken to recover the environment. CCR is an HA technology and SCR is a DR technology, in my opinion.

Domain Controllers as Cluster Nodes – Bad Idea

This is an issue that pops up all the time when it comes to best practices and building server clusters.  

It is considered a very bad practice, in the community, to run Domain Controllers (DCs) as nodes in a cluster. While Microsoft says it is possible, and it is even discussed in KB171390

So, why do so many people recommend against doing it? Let’s hit the main reasons:

  • Microsoft clearly recommends against it in KB281662

  • It is not supported for Exchange per KB898634

  • There are known issues with file share clusters per KB834231

  • The SQL team strongly recommends against it for performance reasons

  • Some hotfixes for DC/GCs may not be recommended for clusters

  • There is overhead involved with running the DC/GC on each node of approximately 130 MB of RAM, plus issues with replication traffic and overhead involved with responding to authentication and logon requests

  • There are issues with multihomed DCs where the private connections also get registered in DNS and can cause many systems to fail to properly logon/authenticate – the check box to not register the private heartbeat connection is not honored by a domain controller without proper hotfixes or registry hacks

  • If they are the only DCs in the org, then they must also be Global Catalog servers (GCs) and must also host DNS

  • If they host DNS, they should point to each other for their own DNS resolution, which will cause failures in resolution if one node is down

  • There are issues with FSMO roles and how will they will be handled if the node that hosts them is down

  • There are problems with the first node coming online if it is the only DCs in the org because the cluster service needs to validate its own account, but it can’t find the DC if the node is pointing to the other for DNS per proper DNS practices and the same is true for services such as SQL and Exchange that use service accounts

  • There are issues with possible failures if the DC is too busy being a DC and the cluster service can’t access the quorum drive as required

  • The hisecdc security template will break clustering if it is used to secure domain controllers

It is vital to remember why we are implementing server clustering when it comes to making decisions like making the nodes domain controllers. We implement server clustering because the service that the cluster is hosting is vital to the company, and we want to mitigate against risks that could cause its failure. Putting nodes on domain controllers introduces too many new risks to the cluster and that is a huge violation of high availability practices.

I hope I have made it clear that it is a really bad idea to use DCs as cluster nodes.

Windows Server 2008 Clustering Documents

Microsoft released a whole set of white papers and made them available for download off the same page here.

The documents available include:

  • Microsoft High Availability Strategy White Paper.doc

  • Overview of Failover Clustering with Windows Server 2008.doc

  • Quick Migration with Hyper-V.doc

  • Windows Server 2008 Failover Clustering Architecture Overview.doc

  •  WS2008 Failover Clustering Datasheet.doc

  • WS2008 Multi Site Clustering.doc

They are all fantastic reads and I highly recommend downloading them.

IT Manager Webcast: Delivering High Availability

Manish Kalra, will be delivering a 60 minute presentation on the value of High Availability. I strongly recommend that everyone give it a view. Details below:

IT Manager Webcast: Delivering High Availability to Your Infrastructure (Level 100)  
Event ID: 1032365516
Register Online
Language(s):     English.
Product(s):     Windows Server.
Audience(s):     IT Professionals.
Duration:     60 Minutes
Start Date:     
Thursday, February 07, 2008 11:00 AM Pacific Time (US & Canada)
Event Overview

Organizations put a lot of value on mission-critical servers and rely on them heavily to run their businesses. As a result, server downtime can be very costly. For every benefit and advantage brought to an organization by an IT solution, technology and business decision-makers should also think about how to deal with the inevitable downtime of these solutions. In this webcast, we discuss how Microsoft can help you avoid downtime and deliver an available infrastructure to meet the demands of your business and customers.

Presenter: Manish Kalra, Product Manager, Microsoft Corporation

Manish Kalra is a product manager in the Windows Server group, where he is responsible for high availability planning. Manish joined Microsoft in 2000 and has held positions as a product manager for Microsoft Systems Management Server, a management solutions specialist, and an infrastructure consultant with Microsoft Consulting Services.

View other sessions from: IT Manager Connections: Build Business and Careers on the Microsoft Platform.

If you have questions or feedback, contact us.

Registration Options
Event ID: