Whole farm is down because timer jobs are not running

One of my clients this week managed to take his entire farm offline this week by upsetting the timer service. First a little background – currently they are scrambling to get SharePoint back to a happy state. Why? Well, as happens with lots of customers, SharePoint is too successful. When we originally setup their farm and upgraded from SPS2003 to MOSS 2007 they had about 20 GB of content that was growing at a very controlled pace. Fast forward a little more than a year and their content database is about 320 GB. YIKES! Even scarier most of their data is in one site collection. This is bad, very bad! Typical guidance is your content databases should be less than 100 GB.

Part of this growth has forced some moving of the databases to different drives and a database restore to deal with another issue. Well, anytime you want to move SharePoint databases around you should run the command stsadm –o preparetomove as documented by Cory Burns in the post Detaching databases in MOSS. If you didn’t you will start getting sync errors once an hour such as:

Failure trying to synch web application 09a21da5-4485-4b00-8268-772aea7fea12, ContentDB 65301403-c277-4b4c-ad5a-e822572d10ea: A duplicate site ID 3b3a4372-aa91-4e0c-ba57-2567958d81bb(http://portal/sites/test1) was found. This might be caused by restoring a content database from one server farm into a different server farm without first removing the original database and then running stsadm -o preparetomove. If this is the cause, the stsadm -o preparetomove command can be used with the -OldContentDB command line option to resolve this issue.

Cory then goes on how to fix it using stsadm –o sync. This is where my client was. He ran this command but for some reason (possible him specifying the wrong switches and accidently deleting a content db) the command hung up for a long period of time, and the portal users were unable to access the environment. So he killed the stsadm process. From that point all hell broke loose.

For several hours they attempted a lot of fixes found on the web. One of the fixes had them rename the folder located at C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\<guid>\. This was a bad option. The folder contains XML files for all of the timer job definitions that need to be ran and the idea was renaming the folder would cause SharePoint to create a new empty copy of the folder and then it could start creating the xml files again and get back to work. Nope, that isn’t how it works. What they needed to do was delete all of the XML files and leave the folder alone. Then when they restarted the timer service the proper XML files would have magically reappeared.

Hope this helps you

Shane

SharePoint Consulting

6 thoughts on “Whole farm is down because timer jobs are not running”

  1. I want to clarify that the “sync” operation of STSADM actually has zero destructive switches. There is a switch called “-deleteolddatabases” but it is not what the name implies. This switch simply purges the sync tables which will force SharePoint to repopulate them the next time the sync job runs.

    Just bad naming on Microsoft’s part.

  2. Hi Shane,

    Thanks for the post. Last thing I heard, Microsoft did not officially support moving the configuration database – period.

    However I’ve become very familiar with the procedure around moving and restoring config databases over the past year. It’s really not that big of a deal.

    Firstly, I always use SQL2005 aliases. That way you can freely move all your databases without having to change anything (well almost – see below) in SharePoint. You just change the database server in the alias, and hew presto you’re redirected to the new SQL box.

    Secondly, if you either move or restore the config database, you must ALWAYS clear the SharePoint cache on all servers in the farm. Otherwise chances are your timer service will always refuse to play ball.

    This is the procedure you describe above, and the best practice I’ve developed is:

    1) Stop the timer service on all servers in the farm

    2) On all servers in the farm, delete all the xml files in C:\Documents and Settings\All Users\Application Data\Microsoft\SharePoint\Config\\.

    [NB: If you've done repeated deployments on the same farm/server, you may have several GUIDs in that folder. To establish which is the current one in use, check the id string in this reg key: HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Secure\ConfigDB]

    Do NOT delete the cache.ini file!!!!!

    3) Open the cache.ini file in that folder, and replace the contents with the number 1, and save. Repeat on all servers in the farm.

    4) Restart the timer service on all servers.

    I’ve written a nice little script which takes care of all of this, let me know if you want a copy. sharepoint@antpoole.com

    Regards,
    Anthony

  3. The web is filled with lots of tricks on how to get into the databases, what files and fields can be hacked, and cool tricks that MS does not support. Although thses are really greate and can get you out of a bind, it is the I’ll do it myself endgame. If going down this path does not fix the problem, your on your own to figure out a fix since you will not be supported by ms. Creating a new content databse takes no more than a few minutes (maybe even less time thatn all the cool hack steps), and you know you will not be haunted later by some setting that was burried so deep in the database you would never have found it.

    I suggest tricks be used when all supported methods have failed and you are totally without hope.

  4. I have had a problem with my timer jobs ever since i have moved the config database onto a new sql2005 server, the timer jobs link errors and looks like its still looking for my old sql server as i get the following error message :-
    Unable to connect to database. Check database connection information and make sure the database server is running.

    If anyonce could offer any advise it would be greatly appreciated. Regards
    Nick

  5. @Nick

    Did you ever resolve the timer job link error:
    Unable to connect to database. Check database connection information and make sure the database server is running.

    romie

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>