Exporting Stack Overflow Data to SQL Azure

There are many ways to export data to SQL Azure. Plain T-SQL INSERT statements, bulk copy utility (bcp.exe) of SQL Server 2008 R2, SQL Azure Migration Wizard and SQL Server Integration Services (SSIS) are some of them. This weekend I wanted to upload the latest Stack Overflow data dump to the SQL Azure database for testing purposes. As you know, Stack Overflow data dump is available only in XML format and exporting XML into SQL Azure (or even SQL Server 2005/2008/R2) is a bit tricky because none of the out-of-the box utilities directly supports this. Remember, I am not talking about the XML data type columns which store the complete XML as a BLOB. Rather, the attributes of XML elements and child elements will contain the column values. For example, here is a simple XML fragment that represents an order and its product line items:

When importing this XML data into SQL Azure, the OriderID and CustomerID attribute values go into their respective columns of the Order table and ProductID goes into its own column in the OrderItems table. Most importantly, the relationship between the Order and Product elements (entities) should be maintained when importing this XML.

On a bit of investigation I found XML Bulk Load, a COM library from Microsoft doing exactly what I wanted but I am not sure how it would work with SQL Azure and didn’t spend further on that (for the moment). I thought I would do the XML data import using the ADO.NET class System.Data.SqlClient.SqlBulkCopy. Actually, this approach turned out be very simple and did the job quite well. With a blank database and tables to receive the Stack Overflow data, the below few lines of code imports the XML data into SQL Azure database:

var sbc = new SqlBulkCopy (@”Server=tcp:server_name.database.windows.net;Database=StackOverflow;User ID=user_id@server_name;Password=azure_password;Trusted_Connection=False;Encrypt=True;”)
{
    BatchSize = 100,
    DestinationTableName = “Users”,
    NotifyAfter = 1000
};

var ds = new DataSet();
sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler((sender, e) => Console.WriteLine(e.RowsCopied + ” rows copied.”));

try
{

    ds.ReadXml (@”E:\Downloads\Stack Overflow Data Dump – Jun 10\Content\Users.xml”);
    foreach (DataColumn dc in ds.Tables[0].Columns)
    {
        sbc.ColumnMappings.Add(dc.Captiondc.Caption);
    }
    sbc.WriteToServer(ds.Tables[0]);
}
catch (Exception xcp)
{

    Console.WriteLine(xcp.Message);
}
finally
{
    sbc.Close();
    ds.Dispose();
}
Console.WriteLine(“Done…!”);

 

As you can see, it’s not a lot of code but it did the job.

One big challenge I had in the whole process was that some of the XML data dump files were too large, above 2GB and the ADO.NET data set declined to read such files to infer the XML schema. The result was ‘out of memory’ exception – obvious! My work-around was to split the large XML files into sizes that the data set can read and proceed with the same code above (I used the Windows version of the famous UNIX shell commands head and tail).