âWhatâs the point,â pondered Alice, âOf getting other people to stuff things in a box, if one cannot ever get them out?â
Ok, she never did say that, but itâs the sort of thing Alice would wonder.
Particularly if she noticed how often modern businesses send around Word forms with input fields designed to be filled out by team members, only to then be manually copied into spreadsheets, databases, or other documents.
Iâd put this as the second most irritating waste of document functionality.
And it doesnât have to be this way.
FIrst, letâs look at what you get with a Word form. There really isnât anything quite as specific a beast as a Word form. Itâs just a Word document. With form fields. Form fields are places into which users can type text, check boxes, select from drop down lists, etc.
Once form fields have been put into a document, the original document author can ârestrictâ the document such that only editing the form fields is allowed. This is usually done with a password, to make it less likely that others will edit the document beyond the form fields.
The presence of a password should not be taken to indicate that this is a security measure.
Removing the restriction can be done by guessing the password, or accessing the settings.xml inside the docx file, and changing the value of âw:enforcementâ from â1â to â0â. Other methods include saving to RTF, then editing the file in a text editor before saving it as docx again.
Restricting the document is done to make it less likely that blithe nonces will return your document to you with changes that are outside of the fields youâve provided to them, or with fields removed. This is important, because you canât as easily extract data from a document if you donât know where it is.
Hereâs what a form looks like when itâs restricted for editing, and has a number of form field elements provided â Iâve given a text field for a personâs name, a drop-down list for their zodiac sign, and a check box for education level. This is the sort of thing you might expect a form to be really useful for collecting.
Now that youâve sent this out to a hundred recipients, though, you want to extract the data from each form.
First weâve got to get the part of the document containing the data out. Knowing, as we do, that a docx file is just a ZIP file full of XML files, we could unzip it and go searching for the data. Iâve already done that â the data is in the file called âword/document.xmlâ. You could just rename the docx file to a zip file, open it in Explorer, navigate into the âwordâ folder, and then drag the document.xml file out for handling, but thatâs cumbersome, and we want an eventual automated solution.
Yeah, you could write this in a batch file using whatever ZIP program youâve downloaded, it wouldnât be that difficult, but Iâm thinking about PowerShell a lot these days for my automation. Hereâs code that will take a docx file and extract just the word/document.xml component into an output file whose name is provided.
# Load up the types required to handle a zip file.
Add-Type -AssemblyName System.IO.Compression.FilesystemFunction Get-DocXDocFile ($infilename, $outfilename){
$infileloc = [System.IO.Path]::Combine($pwd,$infilename)
$zip = [System.IO.Compression.ZipFile]::OpenRead($infileloc)
$zip.Entries | where { $_.FullName -eq “word/document.xml” } | foreach {
$outfileloc = [System.IO.Path]::Combine($pwd,$outfilename)
[System.IO.Compression.ZipFileExtensions]::ExtractToFile($_, “$outfileloc”,$true)
}
}
By now, if youâre like me, youâve opened up that XML file and looked into it, and decided you donât care that much to read its entrails.
Thatâs OK, I did it for you.
The new-style fields are all in âw:sdtâ elements, and can be found by the âw:tagâ name under the âw:sdtPrâ element.
Old-style fields are all in âw:fldCharâ elements, and can be found by the âw:nameâ value under the âw:ffDataâ element.
In XPath, a way of describing how you find a specific element / attribute in an XML file, thatâs expressed as follows:
//w:sdt/w:sdtPr/w:tag[@w:val=’Txt1Tag’]/../..
//w:fldChar/w:ffData/w:name[@w:val=’Text1′]/../..
This does assume that you gave each of your fields names or tags. But it would be madness to expect data out if you arenât naming your fields.
If youâre handy with .NET programming, youâre probably half way done writing the code to parse this using XmlDocument.
If youâre not handy with .NET programming, you might need something a little (but sadly, not a lot) easier.
Remember those XPath elements? Wouldnât it be really cool if we could embed those into a document, and then have that document automatically expand them into their contents, so we could do that for every form file weâve got?
Well, we can.
Short for Extensible Stylesheet Language Transformation (which is definitely long enough to need something to be short for it), XSLT, which really has no good pronunciation because Iâm never going to say something that sounds like âex-slutâ at work, XSLT is a way to turn one XML-formatted document into some kind of output.
Letâs say weâre working with the document I outlined above (and which I will forget to attach to this blog post until someone points it out). Weâve already extracted document.xml, and with the right XSL file, and a suitable XSLT command (such as the Microsoft msxml tool, or whatever works in your native environment), we can do something like this:
Maybe instead of text, you prefer something more like CSV:
I will probably forget to attach the XSL stylesheets that I used for these two transformations to this blog post.
Maybe next time we can see about building this into a toolâŠ
Here’s the files I forgot to add: ExtractData
2 Responses to Extracting data from Word forms with XSL Transforms (XSLT)