Let’s take a break from the text encoding idea real quick so I can talk about a new tool that I recently got into..

One of the things that every product needs, regardless of how simple it is to use, is good documentation. It’s not fun, it takes time, and isn’t technically intriguing. Regardless, it has to be done. The part that myself and team members have struggled with is a tool take makes it easy. We looked at a few commercial applications such as RoboHelp, but it always left me the impression we were rabbit hunting with a Barrett M107 .50 rifle. Our requirements were pretty simple:

  • Easy to use
  • Text based – This makes differentials and merging easy
  • Reasonably priced
  • Able to produce different types of documents (HTML, PDF, etc)

penandpaper We finally settled on what is the best solution (not to mention, it’s open source and free) called DocBook. It’s based on XML, and does have a standard. XML is extremely flexible, and their output is generated by XSL transformations, so we can easily customize the output to meet our requirements. We started using the e-novative DocBook Environment, which gives you a simple command line environment for compiling your DocBook books. It too uses a GPL license, so you can customize it to your needs.

A simple book looks something like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE book
  PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" "file:/c:/docbook/dtd/docbookx.dtd"
    <!ENTITY % global.entities SYSTEM "file:/c:/docbook/include/global.xml">

    <!ENTITY % entities SYSTEM "entities.xml">
<book lang="en">
        <title>My First Book</title>
        <pubdate>May, 2008</pubdate>
            <holder>Kevin Jones</holder>
    <title>My First Chapter</title>
        <title>How to crash a .NET Application</title>
        <para>Call System.AppDomain.Unload(System.AppDomain.CurrentDomain)</para>

Pretty simple, right? Each book can be broken down into separate chapters, which are broken down into sections, then paragraphs. It takes care of some dirty work for you, such as maintaining a Table of Contents for you. It offers a lot of other standard features as well, embedding graphics, referencing other places in the document.

Since DocBook is capable of understanding external entities, I can place chapters, sections, any part of the document that I want into another file and create an <!ENTITY … > for it.

Compiling it is pretty easy. From the e-novative environment, just use the docbook_html.bat for docbook_pdf.bat to create your generated output, something like this:

>C:\docbook\bat\docbook_html.bat MyFirstBook

MyFirstBook is the name of the project in the projects folder, which is all automatically created for you by the docbook_create.bat script. Using the compiler, the out-of-box HTML template looks like this:

(click for full image)

There you have it, a simple documentation tool. Not very pretty at the moment, but of course it’s easy enough to theme it to your company or product by changing the XSL.

Text Encoding (Part-1)

I was recently on the ASP.NET Forums and a member was asking, “How can I figure out the encoding of text?” and that got me thinking. There should be a reasonable way to do this, right? It’s a useful thing to know. First, we need a little background on how text is encoded into bytes.

Long ago, back when 64K of memory was a big deal, characters took up a single byte. A byte ranges from 0 – 255, which allows us to support a total of 256 characters. Seems like plenty, no? English has 26, 52 for both cases, 62 with numbers, 92 with punctuation, and a few extra for line breaks, carriage returns, and tabs. So about 100, give or take a few. So what’s the problem?

Well, this worked great and all, but other languages use different characters. The Cyrillic language by itself supports 33 letters. This is where encoding was introduced. In order to support multiple character sets, what each byte meant was determined by its encoding. This was done simply by knowing what encoding was used.

textIn today’s world, where that average calculator has more memory than PCs did long ago, we now also use 2 byte encoding. That means that we can support 255 to the second power of characters, or 65,536. That is enough to support all languages in a single encoding, even though it takes up double the space. Problem solved, right? Not exactly.

While in this day and age we support double byte encoding, there are still other factors involved, such as the endianness (the order of the bytes. Big endian is backwards). Even then, there is still a lot of legacy data to support that is still single byte.

Say I give you a big binary chunk of data, and I tell you to convert it to text. How do you know which encoding is used? How do you even know which language it is in? I could be giving you a chunk of data using IBM-Latin. So how do we figure this out? Some smarts and process of elimination. Let’s start with things we know.

All of the non single-byte encodings have what’s called a Byte Order Mark, or BOM for short. This is a small amount of binary data pre-appended to the rest of the data that identifies which encoding it is. In .NET world, this is called the Preamble. Since the BOM is an ISO standard, it is always the same for the encoding used regardless if you are using .NET, Python, Ruby on Rails, etc. We can look at our data and see if the BOM can tell us.

To achieve this in .NET, we will be using most of the classes in the System.Text namespace. Specifically, the Encoding class. An instance of the encoding class has a method called GetPreamble(). Which will give us our BOM for that encoding. A BOM can be from 2 – 4 bytes, depending on the number of bytes used in the encoding. Remember when I said two bytes would be plenty? Well I fibbed, since there is an encoding called UTF-32 that supports 4 bytes (a whopping 4.2 billion character support).

We can then check our data to see if it starts with the BOM.

private static bool DataStartsWithBom(byte[] data, byte[] bom)
bool success = data.Length >= bom.Length && bom.Length > 0;
for (int j = 0; success && j < bom.Length; j++)
success = data[j] == bom[j];
return success;

So lets look at this method. This method takes our data, and a BOM, and determines if the data starts with the BOM. There are a few assumptions:

  1. The data length is always greater than or equal to the BOM. If it is not, then there is no BOM at all, and we’ll cover that in a bit.
  2. The BOM’s length is always greater than zero.

So let’s put it to use (assume the local data is a byte[]):

foreach (EncodingInfo encodingInfo in Encoding.GetEncodings())
Encoding encoding = encodingInfo.GetEncoding();
byte[] bom = encoding.GetPreamble();
if (DataStartsWithBom(data, bom))
return encoding;

Here, we get all of the encodings that .NET knows of, and looks to see if our data byte array starts with that encodings BOM. If the encoding has no BOM, the DataStartsWithBom method will handle that with the bom.Length > 0 on the 3rd line. Once we know the encoding, we can decode it. You have to ensure that you don’t actually try to decode the BOM itself:

encoding.GetString(data, bom.Length, data.Length – bom.Length);

Pretty straight forward so far, right?

Yes? OK let’s move on. What about the case where we can’t figure it out by the BOM? Most encodings don’t have a BOM, only the UTF encodings do. ISO and OEM encodings, do not.

This is where it gets tricky, and where some pretty complex algorithms can come into play. The most important piece of information that you can have at this point, is knowing which language the text is in. With that, we can take a reasonable stab at which encoding is it.

.NET supports languages through the System.Globalization.CultureInfo class. This class will be very useful from here on forward. Let’s take baby steps on attacking this problem, and while we don’t know everything, we can use clues.

Each language has what’s called an ANSI encoding. This a standard encoding used for that language assigned by the American National Standards Institute. The ANSI encoding is always a single byte encoding. This seems like a reasonable place to start.

We can get this Encoding by calling cultureInfoInstance.TextInfo.ANSICodePage. This only gives us the numeric code page (an identifier), but it’s simple enough to create an instance of the Encoding class with the code page by calling Encoding.GetEncoding(int codePage).

How do I figure out the language? Chances are you know what language your users are using, or at least most of them. A case where you wouldn’t know is screen-scraping. That can be figured out by looking at the encoding of the response. You can do that by looking at the ContentEncoding property off of the HttpResponse instance.

In most cases, this will probably work. By no means am I saying, “this will always work” in fact, there are a lot of bases that I haven’t covered that I hope to in future blog posts. There are other code bits out there that do this already, and do a good job, but it’s always good to know how it actually works, and fully understand the problem you are trying to solve.

So what’ll be in part 2? How to decode text without knowing the language, and maybe in part two (part 3?) lossy decoding.

Too Little Too Late?

Adobe Logo, Flash Logo, and the Adobe name are used under fair use. I booted up my PC today and saw this nice message from Adobe telling me that there was an update for my Flash installation.

I couldn’t help but notice that one of the highlighted features was support for HD content. I can’t help but feel that this is in response to Silverlight’s support for High Definition content. I wonder though, is it too little too late? I’ve heard a lot of stories from people about switching from Flash to Silverlight just for the support for HD. Now don’t misunderstand me, I think this will be a huge hit for the Flash community, and definitely merits use.

This means to me that Silverlight is doing something right, and is going to be able to hold its ground.

CMAP User Group Presentation

On Tuesday, May 6th at the CMAP User Group Meeting will be Heroes Happen {here} Launch where I will be discussing the new features in SQL 2008 and Steve Michelotti will be discussing the new C# 3.0 Language Enhancements. If you are in the Baltimore area, I encourage you to come to the presentation to learn some cool new stuff.

Take a look at the meeting details here for more information and directions.