Lessons Learned from last week’s Windows Azure Outage – Data Recovery

With last week’s Windows Azure Outage I’ve learned some lessons and I already talked about some in the previous posts, in this one I’ll focus on Data Recovery, since one of the important part of the outage is that we get scared of losing our data. Fortunately this didn’t happened in this one, and why was that? Have you thought about it?

So why wasn’t any data loss in this outage? Let’s dig into this one.

Normally our data is either placed inside Windows Azure Storage or even SQL Azure, and so Windows Azure has in-place for both of them one automatic process that for each content we place in these two options we get 3 replicas and they are placed in different parts of the Data Center, or in the Windows Azure Storage case 1 of the replicas is placed in a different Data Center in the same Region. This was very important to avoid data loss, since what happened was that in this “Leap Year bug outage” we didn’t have the complete Data Centers shutting down and so there were parts that still continued working maintaining our data. Of course this replication strategy doesn’t work for all problems since if all the Data Center crashes at the same time there might be data loss, but that isn’t the most normal outage, and so this way they are solving the biggest problems.

Also the fact that Microsoft has at least 2 Data Centers in the same region reduces a lot the possibilities of having some data loss.

But what if all the Data Center had went completely destroyed for some reason. Would I continue to have my Data?

And the answer is “it depends”. In the case of Windows Azure Storage the answer would be no, because we would have a replica in the other Data Center in the same region so we would be able to get back into action, it would take just a bit more time. If we were talking about the other services the answer would be different because the replicas in SQL Azure is placed inside the same Data Center and so if everything goes down, and also the machines goes down we could lose everything, but what’s the odds of that? Not two high.

If you don’t like your odds with this the best thing you could do was implementing a Data Recovery strategy, like replicating all your data to another Data Center inside your app, like for example with SQL Azure, we could use SQL Azure Data Sync to sync the database to another one in a different Data Center and even Region, or even use the SQL Azure Import/Export capability to have some “backups” (this isn’t a really backup since we don’t have the actual transaction log, but will be enough since it has all the Schema and Data in a particular time providing us a way to “restore” our data to a previous state) being placed in a Windows Azure Blob Storage Container, or even copied to one of our On-Premises machines or any other machine.

Another option would be to had the service available in several different geographies and fallback to the other ones in case of a outage like this, but of course this has costs, and maybe in some cases it would be enough to just point it to a static site inside Windows Azure Storage, or in other cases point into another deployment you have elsewhere in order not to stop working. It always depends on the business requirements we are talking in this case.

So with this I think one thing we should never forget is Data Recovery since thinking about this in the architecture phase will help a lot your business when something like this happens, and not even only in situations like this since there are also some that happen even without outages, like because of some bad update that was pushed, or any other issue.

I hope this helps you also understand how to plan and have measures in place to avoid data loss. Also I’ll continue to blog about some other lessons learned here.

Leave a Reply

Your email address will not be published. Required fields are marked *