This is kind of a regurgitation of a couple of threads on the microsoft.public.exchange.clustering newsgroup. In the threads, there are questions regarding the whole Active/Active issue. Several people, including a couple of good friends and a couple of top-notch Microsofties pointed out the evils of Active/Active. To be clear, Microsoft supports A/A for Exchange but does not recommend it. Best practices are developed based on the experiences of Microsoft’s internal usage (often referred to as eating their own dog food), the early deployment programs, and through trouble reports and the experiences of many customers as reported and tracked though PSS.
Over the years, I have explained to my students that Active/Passive is the best practice when it comes to clustering Exchange. Almost always a student will protest stating that their managers and others don’t want a wasted node so they want to know why A/A is such a problem. I point out that the store.exe is well known for sucking up all the RAM it can get. So, if you have two servers (node1 and node2) both running store.exe and consuming a very large amount of RAM on each node, then you can expect problems with the failover of one resource hog to a node where another resource hog lives. According to all of the literature, the store.exe on the surviving node should give up enough memory for the store.exe on the failing node to exist along with it as both store.exe’s will basically drain down (this is a real high level summary and not the term normally used, but I think it helps to understand what is happening) so they will both have smaller memory footprints and can coexist. In practice, this process is less than smooth. Another concern that is well documented is that if both Exchange Virtual Servers (EVSs) life on the same node, their stores and storage groups add together and apply to constraints in Exchange. For example, if EVS1 has three storage groups and EVS2 has three storage groups, when you combine them, they exceed the limits for Exchange (a max of 4 storage groups) and they both will not function on the same node.
Anyways, the issue in this discussion was around performance. With two active nodes, their memory, their CPUs, and their disk spindles should (according to some basic logic) provide better overall performance than one active node with the same resources. At first glance this makes a great deal of sense.
When you dig deeper, this common sense stops making sense. Wow, did I just type that? Try to follow me here (it should be easy, I am a pretty big guy).
According to Microsoft, if you use Exchange in a best practice configuration, you should manage the resource consumption so that you don’t exceed 80% of CPU. This is for a single server. If you consider that two nodes are active in an A/A cluster, and since there is a need to failover to a single node, then in order to maintain best practice configurations each node should be only utilized up to 40% of CPU utilization. This is basic math in that 40+40=80. This is discussed in 815180 here http://support.microsoft.com/default.aspx?scid=kb;en-us;815180&product=exch2003. This article also discusses the limit of 1,900 concurrent users per node. This article, however, doesn’t address the added scalability of multiple server backplanes, multiple fiber adapters, and multiple spindles. So, the argument then becomes, do you really get enough benefit out of the additional I/O provided with an A/A cluster while still strictly limiting CPU? I would probably go so far as to say yes, but only because it is very clear from all of the Exchange work that I have done that the disk I/O is the limiting factor for higher performance.
So to summarize the arguments/discussions:
- Provides greater disk I/O, but this is assuming that you would not use the same number of spindles for a similar A/P configuration. I am not sure if this is a fair assumption. I can say from experience, it is easier to ask for more spindles from the storage group when you have multiple active nodes. I don’t feel that most organizations would find using the increased number of spindles for an A/P configuration to be within reason.
- Provides more RAM for two store.exe’s when not in a failed state which results in better performance.
- Provides greater throughput using additional HBAs and additional server backplanes when not in a failed state which results in better performance.
- Provides the same CPU as an A/A based on the 40/40 rule previously discussed.
- Provides the same performance whether in a failed state or not.
- Is a best practice, and is the recommended configuration.
There are other issues to consider, for example:
- Does the inter-Exchange messaging (email from node1 to node2) and the loss of single instance storage override any performance gains from A/A?
- With two A/A nodes fully subscribed at 40% of CPU, are they hitting I/O bottle necks and thus utilizing the additional RAM, additional spindles, and the greater backplane bandwidth?
- Is there a tendency to over subscribe the two A/A nodes in most organizations so they are well over 40% CPU utilization?
- Is there a tendency to over subscribe a single node in A/P as well?
The reason I bring up this whole topic in this blog entry is that the A/A vs A/P issue really isn’t as cut and dried as many of us would like to believe.
However, it is strongly discouraged for a reason, and those reasons are because of HA impacts. Therefore, Active-Active is evil. 🙂