Server Health Checks

I’d like to share some of the things I look at while do a health check on a server.  Its funny how few resources there are out there on the Internet.  I believe people keep this kind of stuff to them self because they are scared they are going to miss something and they will never live it down.  My response to that is, So What!  Heck, I don’t claim to know it all but why not share what I do know and maybe others can share via the Comments!!!

When I’m troubleshooting I like to compartmentalize what I”m looking for.  With that my health checks are set up the same way.  I also believe health checks are quick snapshots of the health of a server.  Sure there are tools that you can use to analyze systems further but in this case we are doing a quick health check.  Not all of these need to be done but some should, you get to decide.

CPU

Occasional high CPU spikes are ok as long as you are aware of the process causing this. A server should maintain 80% CPU utilization for an extended period of time.  If it does it may be time to upgrade.  Its a good idea to keep Task Manager open during the duration of your troubleshooting to see trends.

Check CPU Usage

  1. Open Task Manager

  2. Check the Processes tab, ensure there are no processes consuming excessive CPU

  3. Check the Performance tab, ensure there are no single CPU’s that have excessive CPU usage

Check CPU HW

  1. Open Device Manager (right click computer –> Manage)

  2. Ensure that no CPU’s have red X or yellow ! underneath the Processors

Processes

This is one area that you may not want to do for quick health checks but is something you should be familiar with.  Task Manager only gives you basic info on processes and you will find that you may need to dig a bit deeper.  For that I recommend Process Monitor from the great SysInternal tools.  Process Explorer can also be used.  In fact download and play with all these tools…they will save your bacon, I guarantee it.

In-Depth Check
SysInternals:

Copy Process Monitor locally, then launch it.

  1. Analyze each process and watch what operations open the reg keys, file etc.

Copy Process Explorer locally, then launch it.

  1. Analyze each process based upon the number of threads, handles, loaded DLL’s,etc.

Two great webcasts can be viewed here to see these types of tools in action.

Memory

General rule of thumb is to make sure the general memory utilization does not exceed 80%within a given period of time.

Check Memory Availability

    1. Open Task Manager
    2. Select the Performance tab

    3. Look at the Physical memory box,and multiply the total memory by .2

    4. If the total available memory is less than this number then the box is currently utilizing more than 80 percent of the memory.

Current utilization by process

  1. Select the Process tab

  2. Check the ‘show processes from all users’ box in the bottom left corner

  3. Click the column header ‘Mem Usage’ to sort the processes by memory utilization, highest to lowest. This will help you determine what processes are currently utilizing the memory on the box and can help you narrow your search for memory intensive processes.

Network

Check NIC HW

  1. Verify both ends of the network cable are securely seated in the port

  2. On the back of the server verify you have a green blinking link light on the NIC port

  3. Verify NIC HW is working properly by using Device Manager and ensure the active NICs are showing green

  4. Verify gateway, IP, subnet mask, DNS, DNS suffixes, etc. are properly configured.

  5. If everything is properly configured and HW is working, you should be able to get a ping response from the gateway.

Check Network Connections
Here are some other checks you should perform to ensure proper network connectivity:

  1. ipconfig /all will display all you TCP/IP settings including you MAC address

  2. ipconfig /flushdns will flush your dns resolver cache

  3. ipconfig/displaydns will display what is in your dns name cache

  4. Netstat -an command will show all the connections & ports from a machine

  5. Nbtstat command will show net bios tcp/ip connection stats

  6. Tracert <IP or DNS Name> command will show you the path the packet takes, the routers, and the response time for each hop.

  7. pathping <IP or DNS Name> command combines ping and tracert to the 100th degree.  It pings each hop 100 times and is great for testing wan connectivity

Disk Space

All kinds of bad stuff can happen when your disk space is filling up.  The best way to alleviate this is to write a script to notify you when you reach a certain threshold. In a future post I”ll share a method for you to do just that…however if there is a problem and you need to perform a health check then here is how you check the space the old fashion way.

To check disk space manually:

  1. Right Click on My Computer

  2. Select Manage

  3. Select Disk Management

  4. Validate each disk more than 10 percent free space

Event Logs

Event logs can reveal a more historical perspective on what is going on with the system and applications. Things to look for when troubleshooting event logs is to query either the system or the application logs and look for the presence of events that have a timestamp near the time of the issue you are troubleshooting.

Events have 3 categories in the event viewer:

  • Informational: Noted with a white icon and letter ‘i’. Successful operations are logged as informational. Usually not used in troubleshooting problems or failures

  • Warning: Noted with a yellow icon and exclamation point. These usually are looked up as they serve as predictive future failure indicators, such as disk space running low, dhcp ip address lease renewal failures, etc.

  • Error: Noted with a red circle icon and ‘x’. These are indications that something has failed outright and are a good starting point for troubleshooting.

When looking at event logs, use the information to determine the following:

  • Is the incident tied to a particular time or outage incident?

  • Is this a one-off, or has this particular error occurred multiple times in the past?

  • Does this error appear on other systems or is it unique to the system that has failed?

Also make sure you take a look at eventcombmt from Microsoft.  This tool allows you to search the logs of multiple machines.  The benefit to this is to see if a specific error or warning message is also occurring on other systems.  This can help rule out issues.

Services

Troubleshooting services should be limited to the specific that is affected by the problem being troubleshot. Each server will have specific services varying upon the types of applications running. You should document how your servers services are configured to and compare that to the server in question to see if anything is not configured correctly.

Cluster

Servers that host applications and services that require high availability should be clustered so that if one node fails the other can pick up the workload.  Clustered servers need the same type of health checks as stand-alone systems except you will want to check on the health of the cluster.

Check Cluster Resource Status

  1. Open Cluster Administrator: Log onto server, select Start –> Run –> cluadmin

  2. Check the Resources and ensure all are Online

  3. If Cluster Administrator does not open, ensure that the Cluster Service is running on the node.

  4. Cluster resource status can also be checked from a remote server. From a command prompt, just type – cluster res <cluster name>

Client Side Health

  1. Right click on My Computer, select Manage

  2. Open Device Manage

  3. Drill down to SCSI and RAID Controllers, verify that the HBA HW is visible and does not show any errors

  4. If it does not show up in Device Manager, you may need to re-scan for the HW, re-seat the fiber card, or re-install the driver.

  5. If the HBA is showing healthy in Device Manager, open the tool that you use to view configuration and settings for the fiber card and verify there aren’t any transmit/receive errors on link statistics or counters

Switch Health

  1. Make sure fiber is properly connected to each switch

  2. Make sure switch has no errors

  3. If you’re using zoning verify it is properly configured

Check Fiber and SAN Connectivity

  1. Log onto san appliance and verify that the SAN is in general good health and no major errors are present for the controllers, loops, switches, or ports.

  2. Ensure that the LUNs are presented to the servers in the cluster

NLBS

Some applications will require you to spread the load across multiple servers.  Web servers are a very popular choice to network load balance.  As with clusters we will need to check the status of the load balancing.

Check NLBS Status CMD Line

  1. From a command prompt on the local system, run ‘wlbs query’. This will give you the convergence status of the local node with the nlbs cluster.

  2. Other useful NLBS commands: wlbs stop (stops nlbs), wlbs start (starts nlbs), wlbs drainstop (drains node)

Check NLBS Configurations

  1. Open up the network properties –> Network Load Balancing, right click & select Properties

  2. On the Cluster Parameters tab, verify that the IP address is configured for the shared NLBS IP and that the subnet mask, domain, and operation mode are configured correct1y.

  3. On the Host Paramters tab, make sure each node of the cluster has a unique host identifier. Also verify the IP and subnet mask are configured for the local values.

  4. Also make sure that your switch has a static ARP entry if using multi-cast NLBS. The entry should be that of the virtual MAC of the cluster. To get the virtual MAC of the cluster, you can run the following command: WLBS IP2MAC <virtual IP address>

Name Resolution

To healthcheck name resolution, open a command prompt and enter the following

  • nslookup <servername>

Verify that the servername is correctly entered in DNS

If a record does not show up in the DNS query, or maps to a different name, perform a reverse lookup by IP address to see what name is associated with the IP address * nslookup <IP address>

If no name shows up associated with the IP address, log into the domain controller and check the DNS records for this particular name/ip address

  1. From a Domain Controller go to start–>run–>dnsmgmt.msc

  2. Expand the Forward Lookup Zones

  3. Expand the zone for you primary zone that holds the records for the system/s you are troubleshooting

Validate that the record exists. If it does not exist manually enter the record name and IP address by right clicking on this same zone,

  1. Select new host (a)

  2. Enter the name and IP address

  3. Check the box next to Create associated pointer (PTR) record

  4. Click add Host

Additionally log back into the node that you manually entered the record for and ensure that DNS is registering in DNS

  1. Right click on the My Network Places icon on the desktop and select Properties

  2. Double click on the primary adapter

  3. Select properties

  4. Highlight internet protocol (TCP/IP) and select properties

  5. Validate the IP addresses of the DNS servers are correct

  6. Select Advanced

  7. Select DNS tab

  8. Make sure the box is checked next to Register this connection’s address in DNS

As I wrap this up I realize there is so much more that can be done.  Each application type of server needs its own set off health checks.  For example web servers, terminal servers and database servers.  Remember this is just the baseline for each server and that other components can and should be layered on top of it.  Again I would love to hear from others so please feel free to add you comments below.

1 Comment

  1. Ravi says:

    Thanks a lot for your valuble info.

Leave a comment

*