It’s been quite some time since I wrote about changing passwords on a Windows service, and then provided a simple tool written in Visual Basic to propagate a password among several systems sharing the same account.
I hinted at the time that this was a relatively naïve approach, and that the requirement to bring all the services down at the same time is perhaps not what you want to do.
So now it’s finally time for me to provide a couple of notes about how this operation could be done better.
1. If you can’t afford an outage, don’t have a single point of failure
One complaint I have heard at numerous organisations is this one, or words to this effect:
“We can’t afford to cycle the service on a password rotation once every quarter, because the service has to be up twenty-four hours a day, every day.”
That’s the sort of thing that makes novice service owners feel really important, because their service is, of course, the most valuable thing in their world, and sure enough, they may lose a little in the way of business while the service is down.
So how do you update the service when the software or OS needs patching? How do you fix bugs in your service? What happens when you have to take it down because the password has escaped your grasp? [See my previous post on rotating passwords as a kind of “Business Continuity Drill”, so that you know you can rotate the password in an emergency]
All of these activities require stopping and cycling the service.
Modern computer engineering practices have taken this into consideration, and the simplest solution is to have a ‘failover’ service – when the primary instance of the service is taken offline, the secondary instance starts up and takes over providing service. Then when the primary comes back online, the secondary can go back into slumber.
This is often extended to the idea of having a ‘pool’ of services, all running all the time, and only taking out one instance at a time as you need to make changes, bringing the instance back into operation when the change is complete.
Woah – heady stuff, Mr Jones!
Sure, but in the world of enterprise computing, this is basic scaling, and if your systems of applications can’t be managed this way, you will have problems as you reach large scale.
So, a single instance of a service that you can’t afford to go offline – is a failure from the start, and an indication that you didn’t think the design through.
2. Old passwords and new passwords are both valid – for a while
OK, so that sounds like heresy – if you’ve changed the password on an account, it shouldn’t be possible for the old password to work any more, should it?
Well, yes and no.
Again, in an enterprise world, you have to consider scale.
Changing the password on an account isn’t an instantaneous operation. That password change has to be distributed among the authentication servers you use (in the Windows world, this means domain controllers replicating new password information).
To account for this, and the prospect that you may have a process running that didn’t yet have a chance to pick up the new password, most authentication schemes allow tokens and/or passwords to be valid for some period after a password change.
By default, NTLM tokens are valid for an hour, and Kerberos tickets are valid for ten hours.
This means that if you have a pool or fleet of services whose passwords need to change, you can generally take the simple process of iteratively stopping them, propagating the new password to them, and then re-starting them, without the prospect of killing the overall service that you’re providing (sure, you’ll kill any connections that are specifically tied to that one service instance, but there are other ways to handle that).
3. Even if you don’t trust that, there’s help – use two accounts
Interesting, but I can’t afford the risk that I change the password just before my token / ticket is going to expire.
Very precious of you, I’m sure.
OK, you might have a valid concern that the service startup might not be as robust as you hoped, and that you want to ensure you test the new startup of the service before allowing it to proceed and provide live service.
That’s very ‘enterprise scale’, too. There’s nothing worse than taking down a dozen servers only to find that they won’t start up again, because the startup code requires that they talk to a remote service which is currently down.
You wouldn’t believe how many systems I’ve seen where the running service is working fine, but no more can be started up because startup conditions for the service cannot be replicated any longer.
So, to allow for the prospect that you may fail on restarting your services, here’s what I want you to do:
- Start with a large(-ish) pool of services. [See “no single point of failure” above]
- All of your services are running as one user account. Create another user account, with the same rights and privileges. Make sure that access rights are provided to the group that these accounts are a member of, rather than to individual accounts.
- Wind down one of the services, and shut it down.
- Change the downed service to use the second account you just created.
- Start up the downed service.
- Monitor this newly started service to make sure it starts up successfully and is providing correct service. (Yes, this means that you have the ability to roll-back if something goes wrong)
- Repeat steps 3 – 6 with each of the other services in the pool in turn, until all are using the second account.
- De-activate / disable logon to the old user account. Do not delete it.
As you can probably imagine, when you next do this process, you don’t need to create the second user account for the server, because the first account is already there, but disabled. You can use this as the account to switch to.
This way, with the two accounts, every time a password change is required, you can just follow the steps above, and not worry.
You should be able to merge this process into your standard patching process, because the two follow similar routines – bring a service down, make a change, bring it up, check it for continued function, go to the next service, continue until all services are done.
So, with those techniques under your belt – and the necessary design and deployment practices to put them into place – you should be able to handle all requests to rotate passwords, as well as to handle patching of your service while it is live.
Sorry that this doesn’t come with a script to execute this behaviour, but there are some things I’m hoping you’ll be able to do for yourselves here, and the bulk of the process is specific to your environment – since it’s mostly about testing to ensure that the service is correctly functioning.