13 Nov 2007

On Follow-Up

Author: q | Filed under: Frustrations, SBS, Security

In yesterday”s post, I covered the problems being seen in the community regarding the unexpected behavior on SBS 2003 R2 boxes because of a problem with a WSUS definition update. Given the volume of traffic that post generated (more hits in the first 4 hours of that post than any other single post on this blog, period), there were a lot of people impacted by this issue, and apparently not a lot of information out there. Yes, i found a number of threads in other discussion foums, but most hinted at the behavior an didn”t document the full code of the errors, etc. So there was a lot of internet traffic and human effort expended over this issue yesterday.

Late yesterday afternoon (well, my time anyway) the Official WSUS Blog finally put up a post about the issue and detailed the causes behind it. A few hours earlier, the folks at the Official SBS Blog put up a post detailing the resolution, specifically noting that the normal course of updates for the WSUS services on the server would fix the problem so that today everyone”s SBS boxes should be back to normal.

I checked on the last of my managed servers this morning, the one I left untouched to test this theory for myself,and sure enough,it updated and WSUS and the Performance Reports are back to “normal” on the servers.

So, all”s well that ends well, right? Ah, not exactly.

This event has raised some concern in the community about the WSUS product and the SBS R2 implementation of WSUS. For the remainder of this post, I”m not speaking for the community, but from my own personal concerns about the topic.

Hindsight allows us to look back and see that, in the grand scheme of things, this was not a major catastrophe. In fact, the server that I left completely untouched yesterday to test the automatic update fix had no performance issues at all. The customer who uses this server didn”t lose a piece of e-mail, didn”t lose access to the server, didn”t lose any productivity, in fact, they were never aware that there was even an issue that we were looking at. That”s good, because that”s one less client I have to explain this to, and that makes my life a little easier today.

But at the time we were dealing with this yesterday, we didn”t have that insight. What initially looked like a Performance monitor issue quicky became a WSUS issue, and in the midst of it, we had no idea if WSUS was completely broken or what it might take to get it back or what other functionality might be affected. To be honest, when something affects a class of devices across the world, I”m a litlte more apt to spend time to figure out how this could be impacting my own client base, who I am ultimately responsible for. The lack of information was frustrating (one of the reasons I put the post up yesterday, so that hopefully someone who was seeing the issue could get concrete evidence that there was a larger problem and someone was looking into it, even if it wasn”t an official Microsoft source) and I really, really hate operating in a vacuum. In total, our operation lost 75% of our business day identifying the problem, diagnosing the problem, communicating with others about the problem, and ultimately implementing the workaround for a few of our clients to get them back on track, given that we still didn”t know the breadth of the problem. And I know we were not the only business impacted in this way.

Ultimately, I”m concerned that given the nature of the problem and the “fix,” the community has absolutely no way to ensure that this issue won”t happen again. By the very nature of the way WSUS operates, and specifically the way SBS R2 implements WSUS, the exact type of mistake made by Microsoft yesterday could happen again and bring down thousands of WSUS processes again. This fact is what is giving me serious pause about WSUS in general and the SBS R2 implementation of WSUS.

In the interest of full disclosure, I am NOT a WSUS guru by any stretch of the imagination. The extent of my understanding of the R2 implementation of WSUS is to make sure that I leave the default settings enabled so that I can see the Green Check of Health and not the Blue Check of Misconfiguration, which should help me better identify when my R2 installations are out of compliance. Reports say that those who manually installed WSUS, specifically configuing it to only identify updates that are needed by that particular installation, were not affected by the problem yesterday. In fact, since the problematic update was for a BETA build of a product that I do not have installed at ANY of my client sites since I am not participating in that particular beta, I should not have had any system pull down the dictionary for that particular product. But somehow, an SBS R2 box with a single NIC card (i.e., could never run ISA to begin with, much less one that was not participating in the ISA Nitro beta) got the definition update for this beta program and lived with a crashed WSUS for a full 24 hours. At least, that”s the way I understand it, given my relative inexperience with WSUS.

This simply should not have happened.

For the next few days, I now get to spend time learning about WSUS and see how I can modify the configuration of WSUS on the servers I manage to minimize the risk of this happening again. This means I have to reprioritize my workload so that I can try to make sure my clients have a lower risk of being affected by a problem that, quite frankly, may never appear again. But given Murphy”s Law, if I take the road that it won”t happen again so I don”t need to do anything, as soon as I leave the country (which is happening in less than a week) another mistake will happen that will impact these boxes, and the rest of my operation will be left scrambling to deal with the issue while I”m stuck in a plane. Thanks a lot, Microsoft, for recalibrating my work week for me.

Understand, I don”t specifically fault Microsoft for making a mistake. Who among us hasn”t made mistakes? Though some have said that this type of mistake shoudl never have occured, well, stuff happens, you know. What I do fault Microsoft for is the design of the system which allowed this particular mistake to have such a widespread impact on systems that should never have seen this specific update, ever. How did a server that”s not even capable of running ISA get a definition update for a product that”s not even a released product? This is what I have to spend time on now, getting a better understanding of how WSUS works so that I better understand the risks I am putting on my clients by using this tool.

Wait, did I just say that running WSUS increases the risk vector for my clients? I thought the entire purpose of WSUS was to help reduce the risk vector for my clients. Ironic.

Leave a Reply