May 03

So, you know everything about text, right?–part VIII

Posted in .NET Basics C#      Comments Off on So, you know everything about text, right?–part VIII

As we’ve seen, all chars are represented by 16 bit Unicode values. If you’re a win 32 programmer and you’ve been lucky enough to go “managed”, then I bet nobody is as happy as you because this means that you no longer have to write that lovely code for converting between MBCS and Unicode, right? Unfortunately, there are still times when we do need to encode and decode strings. For instance, if we need to send a file for a specific client, we might need to encode the string. If you don’t know anything about encodings, then this primer by Joel Spolsky is a fantastic read!

By default, and if we don’t specify an encoder, all encodings operations end up using the UTF-8 encoder. With UTF-8, characters can be encoded with 1, 2, 3 or 4 bytes. Since characters below 0x0080 are encoded with a  single char, this type of encoding tends to work well with chars used in the USA. European languages tend also to use chars between 0x0080 and 0x07FF, which require the use of 2 bytes. East Asian languages characters require 3 bytes and surrogates pairs will always be encoded with 4 bytes.

Even though UTF-8 is a popular encoding, it’s not that efficient when you need to encode characters above the 0x07FF char. In those cases, using UTF-16 or UTF-32 might be a better option. With UTF-16, all characters require 2 bytes. In practice, this means that you won’t get any compression at all (like you do when using UTF-8 with chars below 0x0080), but the operation should be fast (after all, this is a “direct copy” of a .NET char because they’re represented with 2 bytes too!). UTF-32 encodes all chars as 4 bytes. Even though it uses more space, it will simplify the algorithm used for traversing the chars because you don’t have to worry with surrogate pairs.

.NET does expose two other predefined encoders: UTF-7 and ASCII. UTF-7 uses 7 bits to encode a char and it should only be used if you have legacy systems which require this format. ASCII encodes a char into an ASCII character (no surprise here!) and you need to be careful because you might end up loosing chars when you use it (chars greater than 0x07F can’t be converter and are discarded during the encoding).

Besides this encoders, you should also know that you can encode any char to a specific code page (if you do, then keep in mind that you might end up loosing chars if they can’t be represented in that code page). In practice, you should always work with UTF-16 or UTF-8. The only excuse to use one of the other encoders is if you have to work with legacy systems. And I guess this covers up the theory. In the next post, we’ll take a look at some code. Stay tuned for more!