Exporting Stack Overflow Data to SQL Azure

There are many ways to export data to SQL Azure. Plain T-SQL INSERT statements, bulk copy utility (bcp.exe) of SQL Server 2008 R2, SQL Azure Migration Wizard and SQL Server Integration Services (SSIS) are some of them. This weekend I wanted to upload the latest Stack Overflow data dump to the SQL Azure database for testing purposes. As you know, Stack Overflow data dump is available only in XML format and exporting XML into SQL Azure (or even SQL Server 2005/2008/R2) is a bit tricky because none of the out-of-the box utilities directly supports this. Remember, I am not talking about the XML data type columns which store the complete XML as a BLOB. Rather, the attributes of XML elements and child elements will contain the column values. For example, here is a simple XML fragment that represents an order and its product line items:

When importing this XML data into SQL Azure, the OriderID and CustomerID attribute values go into their respective columns of the Order table and ProductID goes into its own column in the OrderItems table. Most importantly, the relationship between the Order and Product elements (entities) should be maintained when importing this XML.

On a bit of investigation I found XML Bulk Load, a COM library from Microsoft doing exactly what I wanted but I am not sure how it would work with SQL Azure and didn’t spend further on that (for the moment). I thought I would do the XML data import using the ADO.NET class System.Data.SqlClient.SqlBulkCopy. Actually, this approach turned out be very simple and did the job quite well. With a blank database and tables to receive the Stack Overflow data, the below few lines of code imports the XML data into SQL Azure database:

var sbc = new SqlBulkCopy (@”Server=tcp:server_name.database.windows.net;Database=StackOverflow;User ID=user_id@server_name;Password=azure_password;Trusted_Connection=False;Encrypt=True;”)
    BatchSize = 100,
    DestinationTableName = “Users”,
    NotifyAfter = 1000

var ds = new DataSet();
sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler((sender, e) => Console.WriteLine(e.RowsCopied + ” rows copied.”));


    ds.ReadXml (@”E:\Downloads\Stack Overflow Data Dump – Jun 10\Content\Users.xml”);
    foreach (DataColumn dc in ds.Tables[0].Columns)
catch (Exception xcp)



As you can see, it’s not a lot of code but it did the job.

One big challenge I had in the whole process was that some of the XML data dump files were too large, above 2GB and the ADO.NET data set declined to read such files to infer the XML schema. The result was ‘out of memory’ exception – obvious! My work-around was to split the large XML files into sizes that the data set can read and proceed with the same code above (I used the Windows version of the famous UNIX shell commands head and tail).

Architecture Vs. Design

“What is the difference between architecture and design?” Or how is one different from the other? From time to time I come across this question at my work and online forums – An all too often discussed topic in the software world!

This is not an uncommon question in today’s poorly defined and often overlapping roles and responsibilities that architects, senior developers and developers play today in many organizations. It becomes more confusing when a single person or everyone in a small team does a mix of architecture, design and coding without a defined boundary. The mere term architecture could mean many things depending on the context it is referenced in. But for this writing I mostly assume application architecture and design. Yes, there are other architectures types exist in this software world!

Application architecture is all about decomposing an application into modules and sub-modules, defining their attributes and responsibilities and the relationships among them. During this process, numerous parameters are considered and thoroughly analyzed and based on the merits and demerits of each, various constraints/trade-offs are made. Strict lines are drawn to arrive at an optimal architecture that solves the business problem and aligns well with the organization’s enterprise architecture or business requirements directly. Ideally, software architecture should be technology and platform neutral but often it is defined around a specific technology such as J2EE, .NET and open-source, which is not too bad, in my perspective. On the other end, it potentially locks down oneself into a specific stream where alternatives exist. Instead of going in depth in to the subject, here is the crisp of software architecture.

The term architecture itself doesn’t have a global definition but standards bodies such as IEEE and SEI-CMM have their own versions and reinforce them in their respective publications/open talks. However, here is my version closely derived from one of the standards body’s definition that I feel closely defines software architecture: a software architecture is the definition and description of the organization of a system in terms of its sub-systems, modules, their interdependencies, relationships, interaction with external systems and how they communicate with each other cohesively in order to give the intended functionality and meet the definition of the system in question.

In the above representative diagram (a loose form of architecture diagram), a CRM system has been broken down into, for simplicity, three modules. One of the modules, User Interface is further broken into multiple modules representing various UI options the CRM system is expected to support. Further, Rich Client is pieced into by client types, and so on. Don’t get carried away by this break-and-divide rule, too much of de-modularization can result in too many small pieces making the whole process complicated.

Like its definition, software architecture does not have a single representation to represent itself. In order to communicate its purpose and meaning to various stakeholders in an unambiguous way, different types of representations or views have been developed. For example, deployment architecture doesn’t make any sense for an end user, right? Multiple architectural representations exist each targeting a different type of audience. For example, Rational’s 4+1 View has:

  1. Use case view (business users)
  2. Logical view (architects)
  3. Development view (developers)
  4. Process View (performance tuning engineers/testers)
  5. Physical View (IT engineers)

Microsoft has a similar one:

  1. Conceptual view (business users)
  2. Logical view (architects)
  3. Physical view (Developers)
  4. Implementation view (IT engineers)

To an extent, architecture is all about making the right tradeoffs and decisions at the right time based on the priority and importance of various system features because they could potentially influence other down-level sub-systems and their behavior.

Design is the realization process of application architecture, in terms of a specific technology, platform and set of tools. Design process breaks down sub-modules to lower levels and provide technology-specific design definitions of each broken-down module. In case of dev view/physical view, it involves defining abstractions, contracts or interfaces, inter-class or module communication mechanisms, classes using various patterns, all detail enough for a programmer to actually implement the design by writing code. Design specs are the input for programmers. In some cases, an intermediate state is brought in between design and development, intended to develop pseudo code implementation of complex logic/algorithms in the design.

In today’s RAD ecosystem, the difference between design and architecture is often blurred thus making it difficult to understand the difference between them. Of course, for small applications, one may not see the real benefit of developing architecture and design due to unnecessary overhead resulting it redundant information and duplication of effort.


I had been thinking of writing on this subject for quite some time now but Thanks to a co-architect who forwarded me a write up I wrote few years ago on this subject and cut my laziness! :-)