Subtle Bug When Converting Strings to Lowercase

Suppose that you want to convert a std::string object to lowercase.

The first thing you would do is probably searching the std::string documentation for a convenient easy simple method named to_lower, or something like that. Unfortunately, there’s nothing like that.

So, you might start developing your own “to_lower” function. A typical implementation I’ve seen of such custom function goes something like this: For each character in the input string, convert it to lowercase invoking std::tolower. In fact, there’s even this sample code on cppreference.com:

// From http://en.cppreference.com/w/cpp/string/byte/tolower

std::string str_tolower(std::string s) {
    std::transform(s.begin(), s.end(), s.begin(), 
                   [](unsigned char c) { return std::tolower(c); }
                  );
    return s;
}

Well, if you try this code with something like str_tolower(“Connie”), everything seems to work fine, and you get “connie” as expected.

Now, since C++ folks like storing UTF-8-encoded text in std::string objects, in some large code base someone happily takes the aforementioned str_tolower function, and invokes it with their lovely UTF-8 strings. Fun ensured! …Well, actually, bugs ensured.

So, the problem is that str_tolower, under the hood, calls std::tolower on each char in the input string. While this works fine for pure ASCII strings like “Connie”, such code is a bug farm for UTF-8 strings. In fact, UTF-8 is a variable-width character encoding. So, there are some Unicode “characters” (code points) that are encoded in UTF-8 using one byte, while other characters are encoded using two bytes, and so on, up to four bytes. The poor std::tolower has no clue of such UTF-8 encoding features, so it innocently spits out wrong results, char by char.

For example, I tried invoking the above function on “PERCHÉ” (the last character is the Unicode U+00C9 LATIN CAPITAL LETTER E WITH ACUTE, encoded in UTF-8 as the two-byte sequence 0xC3 0x89), and the result I got was “perchÉ” instead of the expected “perché” (é is Unicode U+00E9, LATIN SMALL LETTER E WITH ACUTE). So, the pure ASCII characters in the input string were all correctly converted to lowercase, but the final non-ASCII character wasn’t.

Actually, it’s not the std::tolower function: It’s that this function was misused, invoking it in a way that the function was not designed for.

This is one of the perils of taking std::string-based C++ code that initially worked with ASCII strings, and thoughtlessly reuse it for UTF-8-encoded text.

In fact, we saw a very similar bug in a previous blog post.

So, how can you fix that problem? Well, a portable way is using the ICU library with its icu::UnicodeString class and its toLower method.

On the other hand, if you are writing Windows-specific C++ code, you can use the LcMapStringEx API. Note that this function uses the UTF-16 encoding (as almost all Windows Unicode APIs do). So, if you have UTF-8-encoded text stored in std::string objects, you first have to convert it from UTF-8 to UTF-16, then invoke the aforementioned API, and finally convert the UTF-16-encoded result back to UTF-8. For these UTF-8/UTF-16 conversions, you may find my MSDN Magazine article on “Unicode Encoding Conversions with STL Strings and Win32 APIs” interesting.

 

Leave a Reply

Your email address will not be published. Required fields are marked *