As some of you may know, I assist Susan with administering & maintaining the blogs here at msmvps.com. For various reasons, over the past few months I have become familiar with various approaches to blog migrations – most notably the BlogML project. As a result, I’ve sort of become the neighborhood go-to guy for moving blogs, including assisting Steve Riley with his move from msinfluentials.com to wordpress.com
A couple weeks ago I was presented with an intriguing request / challenge. My friend Wayne Small over at sbsfaq.com had been running his site and his blog on SharePoint for several years, but was now in the process of moving everything over to a single Word Press site, and wanted to migrate his content from his SharePoint blog. The challenge wasn’t so much getting the content in to Word Press, since there are several importers available, including a BlogML importer. The problem was getting the content out of SharePoint 3.0. BlogML exporters for most platforms are web based, allowing you to initiate the export from within the blog platform, and download the resulting export file. While I have coding experience, I don’t have any experience building add-ins for SharePoint and wasn’t about to open that can of worms, so I decided for a different approach.
For those of you who don’t know, there is rather impressive integration between SharePoint 3.0 & Access 2007, so I opted to use Access to extract the information out of Wayne’s old SharePoint blog. This approach actually gave me more flexibility in meeting the various requirements:
- Where the SharePoint blog used Categories, Wayne wanted to use Tags in Word Press.
- We wanted to migrate all content – posts, comments, & embedded content (images in posts, etc.)
Moving categories to tags seemed simple enough, however I discovered that the the current 2.0 iteration of BlogML doesn’t support tags (which admittedly surprised me). As a result, Aaron Lerch’s BlogML import class for Word Press did not support tags either. Scoring the web, I found that Wayne John had updated Aaron’s BlogML import class to allow for importing tags from Blog Engine exports. I did a quick & dirty install of Blog Engine on my sandbox server so I could examine its default BlogML export so I could match how it tagged its XML to identify post tags.
One of the major behind-the-scenes differences between SharePoint blogs & Word Press is how embedded content is stored. When you are composing posts using an offline editor such as Windows Live Writer, inserted images are stored differently in each platform when the post is uploaded & published. Word Press stores the images in the file system on the site, whereas SharePoint stores the images in the database as attachments to the post record. Luckily for us, Access 2007 can handle attachments on SharePoint lists natively. In early test runs, I found that there was some duplication in image names between various posts in Wayne’s blog (especially capture.png). As a result, I decided to save the attachments for each SharePoint post to a different folder to avoid name collision issues.
So – how does the final solution look?
- I created a new Access 2007 database, and used the External Data functionality to link to the Posts, Comments, & Categories lists on the SharePoint blog site.
- I created a simple form that allowed me to enter the path & filename I wanted for the resulting BlogML export file, as well as a path to where I wanted the embedded images from the SharePoint blog stored. Obviously, the form contained a Start button as well . . .
- When the start button was clicked, the code behind the button did all the heavy lifting:
- It creates the BlogML export file using the path / filename listed on the form, and writes the various header information.
- We open a new recordset containing the Posts table.
- We call a helper function to format the post published date how the XML file wants it.
- We open a second recordset that contains the attachments for the current post we are processing
- If the current post has attachments, then:
- We save each attachment to the local file system, using the patch specified on the main form. To prevent filename collision, we create a new subfolder for each post, using SharePoint’s numeric post ID as the folder name. (So if we listed C:\export as the folder we wanted to save embedded images to on the main form, we would end up with something like C:\export\<post_id>\capture.png)
- For each attachment, we populate the <attachment /> node of the BlogML output.
- We parse the body of the current post and replace every path we find pointing to the old attachment path with the new attachment path. E.g., links to embedded content in the SharePoint blog are referenced via “/Lists/Posts/Attachments/<post_ID>/<filename>” where once we import content in to Word Press, the path will be something like “/wp-content/uploads/<post_ID>/<filename>” By updating the relative paths to our embedded content during the export, we can better insure that our embedded content will transfer seamlessly. This isn’t the case with most BlogML exports, because they are simply exporting the content, not updating embedded links to match the target platform.
- We cycle through and repeat for each attachment for the current post.
- We write the post content to the Output file
- For each category listed in the Posts recordset, we write a <Tag /> to the output file (to match Wayne’s requirements).
- We open a third recordset which contains all of the comments for the current post. If the current post has comments, then:
- We call a helper function to format the comment date how the XML file wants it.
- We write the current comment to the output file
- We cycle through & repeat for each remaining comment for the current post.
- We cycle through and repeat for each post
- We write the closing tags to the output file and complete the process.
So in the end, we have one BlogML.xml file with all of the blog content (posts, comments, categories, tags, etc.), and a folder that contains all of the exported embedded images from the SharePoint blog posts (in separate sub-folders by post ID).
At this point, Wayne simply had to copy the embedded images subfolders to his /wp-content/uploads folder for his site, then run the BlogML import process to import the content generated. Voila!
Post on SharePoint blog: SBS 2008 R2 I want it now!
Migrated post on WordPress blog: SBS 2008 R2 I want it now!
Notice the embedded image is displayed as expected – and if you look at the image properties on the Word Press post, you’ll see it is referencing the relative path to the image on the Word Press site. In addition, the original post date & author info has been maintained, as have all comments, with their original comment dates & author info.
One caveat that is important to share – when I was first starting to test the import process, the BlogML import in Word Press was failing with an error that invalid characters were encountered on line X at column Y. However, opening the BlogML export file in notepad, wordpad, or IE – I couldn’t see anything that appeared to be an invalid character. Finally, opening the BlogML output file in Visual Studio 2008 allowed me to see the invalid characters, and I was able to remove them with a simple find/replace in Visual Studio, after which the import process in Word Press completed successfully. I’m not sure what caused these characters to be present in the first place – perhaps the authoring tool Wayne used, or perhaps something in the export process, pulling from SharePoint to Access to text.
But anyway, that was one of my recent side-projects. Let me know what you think – the Access database I used was rather rough around the edges and relied on a number of assumptions I was able to make which resulted in certain options being hard-coded. If there is enough interest, I’ll polish it up a bit and post it for others to use to export their SharePoint blogs to BlogML.