Extracting data from Word forms with XSL Transforms (XSLT)

“What’s the point,” pondered Alice, “Of getting other people to stuff things in a box, if one cannot ever get them out?”

Ok, she never did say that, but it’s the sort of thing Alice would wonder.

Particularly if she noticed how often modern businesses send around Word forms with input fields designed to be filled out by team members, only to then be manually copied into spreadsheets, databases, or other documents.

I’d put this as the second most irritating waste of document functionality.

And it doesn’t have to be this way.

What is a Word form, anyway?

FIrst, let’s look at what you get with a Word form. There really isn’t anything quite as specific a beast as a Word form. It’s just a Word document. With form fields. Form fields are places into which users can type text, check boxes, select from drop down lists, etc.

Once form fields have been put into a document, the original document author can ‘restrict’ the document such that only editing the form fields is allowed. This is usually done with a password, to make it less likely that others will edit the document beyond the form fields.

The presence of a password should not be taken to indicate that this is a security measure.

Removing the restriction can be done by guessing the password, or accessing the settings.xml inside the docx file, and changing the value of “w:enforcement” from “1” to “0”. Other methods include saving to RTF, then editing the file in a text editor before saving it as docx again.

Restricting the document is done to make it less likely that blithe nonces will return your document to you with changes that are outside of the fields you’ve provided to them, or with fields removed. This is important, because you can’t as easily extract data from a document if you don’t know where it is.

Show me

Annotation 2020-05-16 165908

Here’s what a form looks like when it’s restricted for editing, and has a number of form field elements provided – I’ve given a text field for a person’s name, a drop-down list for their zodiac sign, and a check box for education level. This is the sort of thing you might expect a form to be really useful for collecting.

Now that you’ve sent this out to a hundred recipients, though, you want to extract the data from each form.

Secret knowledge: docx files are ZIP files containing XML files

First we’ve got to get the part of the document containing the data out. Knowing, as we do, that a docx file is just a ZIP file full of XML files, we could unzip it and go searching for the data. I’ve already done that – the data is in the file called “word/document.xml”. You could just rename the docx file to a zip file, open it in Explorer, navigate into the ‘word’ folder, and then drag the document.xml file out for handling, but that’s cumbersome, and we want an eventual automated solution.

Turning to PowerShell

Yeah, you could write this in a batch file using whatever ZIP program you’ve downloaded, it wouldn’t be that difficult, but I’m thinking about PowerShell a lot these days for my automation. Here’s code that will take a docx file and extract just the word/document.xml component into an output file whose name is provided.

# Load up the types required to handle a zip file.
Add-Type -AssemblyName System.IO.Compression.Filesystem

Function Get-DocXDocFile ($infilename, $outfilename){
$infileloc = [System.IO.Path]::Combine($pwd,$infilename)
$zip = [System.IO.Compression.ZipFile]::OpenRead($infileloc)
$zip.Entries | where { $_.FullName -eq “word/document.xml” } | foreach {
$outfileloc = [System.IO.Path]::Combine($pwd,$outfilename)
[System.IO.Compression.ZipFileExtensions]::ExtractToFile($_, “$outfileloc”,$true)
}
}

 

By now, if you’re like me, you’ve opened up that XML file and looked into it, and decided you don’t care that much to read its entrails.

That’s OK, I did it for you.

The new-style fields are all in “w:sdt” elements, and can be found by the “w:tag” name under the “w:sdtPr” element.

Old-style fields are all in “w:fldChar” elements, and can be found by the “w:name” value under the “w:ffData” element.

In XPath, a way of describing how you find a specific element / attribute in an XML file, that’s expressed as follows:

//w:sdt/w:sdtPr/w:tag[@w:val=’Txt1Tag’]/../..

//w:fldChar/w:ffData/w:name[@w:val=’Text1′]/../..

This does assume that you gave each of your fields names or tags. But it would be madness to expect data out if you aren’t naming your fields.

If you’re handy with .NET programming, you’re probably half way done writing the code to parse this using XmlDocument.

If only there was an easy – well, easier – way to do this!

If you’re not handy with .NET programming, you might need something a little (but sadly, not a lot) easier.

Remember those XPath elements? Wouldn’t it be really cool if we could embed those into a document, and then have that document automatically expand them into their contents, so we could do that for every form file we’ve got?

Well, we can.

It’s called XSLT

Short for Extensible Stylesheet Language Transformation (which is definitely long enough to need something to be short for it), XSLT, which really has no good pronunciation because I’m never going to say something that sounds like “ex-slut” at work, XSLT is a way to turn one XML-formatted document into some kind of output.

Let’s say we’re working with the document I outlined above (and which I will forget to attach to this blog post until someone points it out). We’ve already extracted document.xml, and with the right XSL file, and a suitable XSLT command (such as the Microsoft msxml tool, or whatever works in your native environment), we can do something like this:

Annotation 2020-05-18 172303

Maybe instead of text, you prefer something more like CSV:

Annotation 2020-05-18 191107

I will probably forget to attach the XSL stylesheets that I used for these two transformations to this blog post.

Maybe next time we can see about building this into a tool…

Here’s the files I forgot to add: ExtractData

Leave a Reply

Your email address will not be published. Required fields are marked *