One of my clients servers stopped working over the weekend – literally just stopped. We could not login via the console, remote connectivity was unresponsive. We had to power cycle the system to get it running again. After it came up, I began to comb the event logs for more information. I found that it stopped working over the weekend on Saturday at around 1:50pm. I know this because the Windows system logs every five minutes to a file to say “I was running at this time…” and then on a reboot, if it was unscheduled, it reports in the system event log with an event ID of 6008 the following text “The previous system shutdown at 1:50:52 PM on 17/01/2009 was unexpected.” The system event log also held an event with an event ID of 1 and a source of WHEA_Logger the following event “An uncorrected hardware error occurred. A record describing the condition is contained in the data section of this event.”
Hmm – interesting – it thinks there’s been a significant hardware error that caused it not to blue screen, but to lock up massively. Digging into the WHEA_Logger, I found this document that describes just what the Windows Hardware Error Architecture is and what it does.
I looked into the HP iLo logs which should log all hardware failures but it is clear. This suggests that the OS thinks there was a major Hardware failure, but that the Hardware knows nothing about it. I don’t like this at all. The plan of attack at this point is to upgrade the firmware and drivers on the server to the current versions across the board and monitor the situation. If the system fails again then we would log the fault with HP (it’s a HP DL 580 G5) and given we will have already upgraded the firmware and drivers to the latest, it should short cut some of the diagnostics they would normally perform.