Visual Studio Live @ Las Vegas Presentations – Tips and Tricks on Architecting Windows Azure for Costs

Unfortunately I wasn’t able to go and speak in Visual Studio Live @ Las Vegas as it was scheduled, due to an illness that made it impossible for me to travel, and stay in bed for a few days.

But even if I wasn’t there I would like to share with you some of the points on this topic “Tips and Tricks on Architecting Windows Azure for Costs”.

Tips & Tricks On Architecting Windows Azure For Costs
View more presentations from Nuno Godinho
The Key points to achieve this are:
  • Cloud pricing isn’t more complex than on-premises, it’s just different
  • Every component has it’s own characteristics, adjust them to your needs
  • Always remember that Requirements impact costs, choose the ones that are really important
  • Always remember that Developers and the way things are developed impact costs, so plan, learn and then code.
  • Windows Azure pricing model can improve code quality, because you pay what you use and very early can discover where things are going out of plan
  • But don’t over-analyze! Don’t just block because things have impacts, because even today the same things are impacting you, the difference is that normally you don’t see them that quickly and transparently, so “GO FOR IT”, you’ll find it’s really worth it.

In some next posts I’ll go in-depth into each one of those.

Special thanks for Maarten Balliauw for providing a presentation he did previously that I could work on.

Visual Studio Live @ Las Vegas Presentations – Architecture Best Practices in Windows Azure

Unfortunately I wasn’t able to go and speak in Visual Studio Live @ Las Vegas as it was scheduled, due to an illness that made it impossible for me to travel, and stay in bed for a few days.

But even if I wasn’t there I would like to share with you some of the points on this topic “Architecture Best Practices in Windows Azure”.

Here are 10 key Architecture Best Practices in Windows Azure:

  1. Architect for Scale
  2. Plan for Disaster Recovery
  3. Secure your Communications
  4. Pick the right Compute size
  5. Partition your Data
  6. Instrument your Solution
  7. Federate your Identity
  8. Use Asynchronous and Reduce Coupling
  9. Reduce Latency
  10. Make Internal Communication Secure

In some next entries I’ll go in-depth into each one of those.

Lessons Learned from last week’s Windows Azure Outage – Data Recovery

With last week’s Windows Azure Outage I’ve learned some lessons and I already talked about some in the previous posts, in this one I’ll focus on Data Recovery, since one of the important part of the outage is that we get scared of losing our data. Fortunately this didn’t happened in this one, and why was that? Have you thought about it?

So why wasn’t any data loss in this outage? Let’s dig into this one.

Normally our data is either placed inside Windows Azure Storage or even SQL Azure, and so Windows Azure has in-place for both of them one automatic process that for each content we place in these two options we get 3 replicas and they are placed in different parts of the Data Center, or in the Windows Azure Storage case 1 of the replicas is placed in a different Data Center in the same Region. This was very important to avoid data loss, since what happened was that in this “Leap Year bug outage” we didn’t have the complete Data Centers shutting down and so there were parts that still continued working maintaining our data. Of course this replication strategy doesn’t work for all problems since if all the Data Center crashes at the same time there might be data loss, but that isn’t the most normal outage, and so this way they are solving the biggest problems.

Also the fact that Microsoft has at least 2 Data Centers in the same region reduces a lot the possibilities of having some data loss.

But what if all the Data Center had went completely destroyed for some reason. Would I continue to have my Data?

And the answer is “it depends”. In the case of Windows Azure Storage the answer would be no, because we would have a replica in the other Data Center in the same region so we would be able to get back into action, it would take just a bit more time. If we were talking about the other services the answer would be different because the replicas in SQL Azure is placed inside the same Data Center and so if everything goes down, and also the machines goes down we could lose everything, but what’s the odds of that? Not two high.

If you don’t like your odds with this the best thing you could do was implementing a Data Recovery strategy, like replicating all your data to another Data Center inside your app, like for example with SQL Azure, we could use SQL Azure Data Sync to sync the database to another one in a different Data Center and even Region, or even use the SQL Azure Import/Export capability to have some “backups” (this isn’t a really backup since we don’t have the actual transaction log, but will be enough since it has all the Schema and Data in a particular time providing us a way to “restore” our data to a previous state) being placed in a Windows Azure Blob Storage Container, or even copied to one of our On-Premises machines or any other machine.

Another option would be to had the service available in several different geographies and fallback to the other ones in case of a outage like this, but of course this has costs, and maybe in some cases it would be enough to just point it to a static site inside Windows Azure Storage, or in other cases point into another deployment you have elsewhere in order not to stop working. It always depends on the business requirements we are talking in this case.

So with this I think one thing we should never forget is Data Recovery since thinking about this in the architecture phase will help a lot your business when something like this happens, and not even only in situations like this since there are also some that happen even without outages, like because of some bad update that was pushed, or any other issue.

I hope this helps you also understand how to plan and have measures in place to avoid data loss. Also I’ll continue to blog about some other lessons learned here.

Lessons Learned from last week Windows Azure outage – Redundancy

As you might already know, Windows Azure had an Outage last week, and this generated a lot of “fuzz” around it, as well as when the same happened to Amazon last year, or even other providers. Based on these outages, lots of people are now saying that Cloud Computing shouldn’t be an option because it sometimes fails, because it has outages and so on. This isn’t really a very correct approach to the issue, since when we have everything inside our own Data Center, sometimes bad things happen also, from someone doing an update in the network and that crashes it, machines that just “die” from one moment to the other, and a lot more.

What these outages remind us is that even when going to the Cloud, and Windows Azure for example, we need to continue to analyze the impacts that an outage in our solution might have in our business, because Cloud Computing provides us a better platform and a way for us to be more secure, since they already have some Disaster Recovery and Data Replication mechanisms, but they also provide SLA’s and if we need more than those we need to really work on it and architect for it right from the start. And this isn’t a Cloud Computing fault, it’s really a requirement that our business has, and will have inside or outside or our own Data Center.

When we are dealing with something that is inside out own Data Center what we do is Redundancy. Let’s talk on a real-world example not IT related. Airplanes don’t need so many engines to fly, but they have them because if 50% fails, the other 50% will still get the airplane to reach to the desired destination without problems, and of course the level of redundancy depends on the reliability of the engines, and also the impact of that failure. Since airplanes don’t work very well without engines working, this is critical so sometimes you see 50% Redundancy and some other 75% Redundancy, like it happened in the earlier days. So we need to do the same with our solutions when building the on Windows Azure, and that is understand the impacts that an outage has for the business and then plan Redundancy and Disaster Recovery based on those, but we have some things that we can count on already, that is how Windows Azure takes care of Storage, SQL Azure, Compute and so on, since it provides us SLA’s that will provide us a level of security already very good. Also in case of Storage, provide us 3 replicas of everything that is placed inside the Storage account, being it Tables, Queues or even Blobs, and also Geo-Replicates 1 copy into another Data Center in the same region. What this does is that when for example a Data Center goes down, like it happened last week with Windows Azure, normally isn’t all the Data Center, and so some part will continue to be available, and as soon as the platform identifies that some part of the Data Center is down, the same platform will take the primary replica and place it as the original one, and them everything works again, but if for some reason all the Data Center goes down, it will continue to have a replica in the other Data Center of the same Region.

If we talk about SQL Azure, the same thing happen as the Storage, just the Geo-Replication isn’t there, so if all the Data Center goes down, there’s no Geo-Replica fallback process and so we need to plan for it.

So based on all this we should really look at Redundancy and Disaster Recovery as a very important part of our Architecture and System design, but we also need to take into account that this means costs, and so we need to get the right approach for the customer, because there’s no solution “One fits all” for this.

In some next posts I’ll talk about some approaches to designing Windows Azure for Redundancy and Disaster Recovery.

Also you can leave a comment and say what you’d like to hear about and I’ll do my best to write about it.

Importance of Affinity Groups in Windows Azure

During this last week some friends asked me more about what are the Affinity Groups in Windows Azure, and their benefits since for some people this is nothing more than a way to logically group both Compute and Storage.

In order to explain this we need to dig a little deep in terms of how Windows Azure Data Centers are created. Basically Windows Azure Data Centers are built using “Containers” that inside are full of clusters and racks. Each of those Containers have specific services, like for example, Compute and Storage, SQL Azure, Service Bus, Access Control Service, and so on. Those containers are spread across the data center and each time we subscribe/deploy a service the Fabric Controller (which chooses based on our solution configuration where the services should be deployed) can place our services spread across the data center.

Now one thing that can happen is we need to be very careful in where we create the several services, because if we place the Hosted Service in North Central US and then the Storage Account in South Central US, this won’t be very good both in terms of Latency or Costs, since we’ll get charged whenever we get out of the Data Center. But even if we choose the same Data Center, nothing tells us that the services will be close together, since one can be placed in one end of the Data Center and the other in the other end, and so this will remote the costs and make the latency better, but it would be great to go a little further like placing them in the same Container, or even in the same Cluster. The answer for this is Affinity Groups.

Basically Affinity Groups is a way to tell the Fabric Controller that those two elements, Compute and Storage, should always be together and close to one another, and what this does is when the Fabric Controller is searching for the best suited Container to deploy those services will be looking for one where it can deploy both in the same Cluster, making them as close as possible, and reducing the latency, and increasing the performance.

So in summary, Affinity Groups provide us:

  • Aggregation, since it aggregates our Compute and Storage services and provide the Fabric Controller the information needed for them to be kept in the same Data Center, and even more, in the same Cluster.
  • Reducing the Latency, because by providing information to the Fabric Controller that they should be kept together, allow us to get a lot better latency when accessing the Storage from the Compute Nodes, which makes difference in a highly available environment.
  • Lowering costs, as by using them we don’t have the possibility of getting one service in one Data Center and the other in another Data Center if for some reason we choose the wrong way, or even because we choose for both one of the “Anywhere” options in the.

Based on this, don’t forget to use Affinity Groups right from the start, since it’s not possible after having deployed both the Compute or Storage to change them into an Affinity Group.

To finalize, and since now you can be thinking that this would be very interesting for other services also, no other services are able to take advantage of this Affinity, since neither of them share the same Container.

Hope this helped and see you in the CLOUD.