Maps with Case Insensitive String Keys

How to implement a map with case insensitive string keys? If you use the standard std::map associative container with std::string or std::wstring as key types, you get a case sensitive comparison by default.

If you take a look at std::map documentation, you’ll see that in addition to the key type and value type, there’s also a third template parameter that you can plug into std::map: it’s a comparison function object to sort the keys. The default option for this comparison function is std::less<Key>.

So, if you provide a custom comparison object that ignores the key string case, you can have a map with case insensitive keys:

map<string, ValueType, StringIgnoreCaseLess> myMap;

Now, the question is: What would such comparison object look like?

A Failed Approach

An approach I saw sometimes (e.g. on StackOverflow) is to use std::lexicographical_compare, comparing the strings char-by-char, after invoking tolower on each char. Basically, the idea is to compare the lowercase versions of each corresponding characters in the input strings. The code from the aforementioned SO answer follows:

// Code from: https://stackoverflow.com/a/1801913/1629821

struct ci_less
{
  // case-independent (ci) compare_less binary function
  struct nocase_compare
  {
    bool operator() (const unsigned char& c1, const unsigned char& c2) const {
      return tolower (c1) < tolower (c2); 
    }
  };
  
  bool operator() (const std::string & s1, const std::string & s2) const {
    return std::lexicographical_compare 
        (s1.begin (), s1.end (),   // source range
         s2.begin (), s2.end (),   // dest range
         nocase_compare ());       // comparison
  }
};

The problem with this code is that it doesn’t work for international strings. In fact, while for a pure ASCII string like “Connie” you can use this technique to successfully compare “Connie” with “connie” or “CONNIE”, this won’t work for strings containing international characters.

For example, consider the Italian word “perché”. The last character in “perché” is U+00E9, i.e. the ‘LATIN SMALL LETTER E WITH ACUTE’, which is encoded in UTF-8 as the hex byte sequence 0xC3 0xA9. Its uppercase form is É U+00C9 (encoded in UTF-8 as 0xC3 0x89). So, let’s assume you use the UTF-8 encoding to store your international text in std::string objects. Well, invoking tolower char by char, as implemented in the aforementioned SO answer, will fail. In fact, tolower is unable to correctly process UTF-8 sequences (at least in the Microsoft VS2015 CRT implementation I used in my tests).

A Better Approach for International Text

So, how to fix that? Well, on Windows there’s a CompareStringEx API (available since Vista, according to the MSDN documentation) that seems to work, at least in my tests with some Italian text. You can call this API passing the NORM_IGNORECASE comparison flag.

As for most Windows APIs, the Unicode encoding used for text is UTF-16. So, let’s start writing a nice C++ wrapper around this CompareStringEx C-interface API. Let’s assume that we want to compare two UTF-16 wstrings, and that they are “ordinary” strings that don’t contain embedded NULs. A possible implementation for this C++ helper function follows:

// C++ wrapper around the Windows CompareStringEx C API
inline int CompareStringIgnoreCase(const std::wstring& s1, 
                                   const std::wstring& s2)
{
    // According to the MSDN documentation, the CompareStringEx function 
    // is optimized for NORM_IGNORECASE and string lengths specified as -1.

    return ::CompareStringEx(
        LOCALE_NAME_INVARIANT,
        NORM_IGNORECASE,
        s1.c_str(),
        -1,
        s2.c_str(),
        -1,
        nullptr,        // reserved
        nullptr,        // reserved
        0               // reserved
    );
}

Now, you can simply invoke this C++ helper function inside the comparison object that will be used with std::map:

// Comparison object for std::map, ignoring string case
struct StringIgnoreCaseLess
{
    bool operator()(const std::wstring& s1, const std::wstring& s2) const
    {
        // (s1 < s2) ignoring string case
        return CompareStringIgnoreCase(s1, s2) == CSTR_LESS_THAN;
    }
};

This basically implements the condition (s1 < s2) ignoring the string case.

And, finally, you can simply plug this comparison object into a std::map with UTF-16-encoded wstring keys:

map<wstring, ValueType, StringIgnoreCaseLess> myCaseInsensitiveStringMap;

Or, using the nice C++11 alias template feature:

template <typename ValueType>
using CaseInsensitiveStringMap = std::map<std::wstring, ValueType, 
                                          StringIgnoreCaseLess>;

// Simply use CaseInsensitiveStringMap<ValueType>

You can find some compilable C++ sample code on GitHub.

Note on UTF-8 String Keys

If you really want to use UTF-8-encoded std::string keys, then you have to add some code to the comparison object, to first convert the input strings from UTF-8 to UTF-16, and then invoke CompareStringEx for the UTF-16 text comparison.

If you are working on C++ code that is already Windows platform specific, I think that choosing the UTF-16 encoding to represent international text is more “natural” (and more efficient) than converting back and forth between UTF-8 and UTF-16.

 

Detecting Unicode Space Characters

Some programmers love UTF-8 as they believe they can reuse old “ANSI” APIs with UTF-8-encoded text. UTF-8 does have some advantages (like being endian-neutral), but pretending to blindly reuse old “ANSI” APIs for UTF-8 text is not one of them.

For example: There are various space characters defined in Unicode.

You can use the iswspace function to check if a Unicode UTF-16 wide character is a white-space, e.g.:

#include <ctype.h>      // for iswspace
#include <iostream>

int main()
{
    // Let’s do a test with the punctuation space (U+2008)
    const wchar_t wch = 0x2008;

    if (iswspace(wch)) {
        std::cout << "OK.\n";
    }
}

The corresponding old “ANSI” function is isspace: Can you use it with Unicode text encoded in UTF-8? I’m open to be proven wrong, but I think that’s not possible.