Maps with Case Insensitive String Keys

How to implement a map with case insensitive string keys? If you use the standard std::map associative container with std::string or std::wstring as key types, you get a case sensitive comparison by default.

If you take a look at std::map documentation, you’ll see that in addition to the key type and value type, there’s also a third template parameter that you can plug into std::map: it’s a comparison function object to sort the keys. The default option for this comparison function is std::less<Key>.

So, if you provide a custom comparison object that ignores the key string case, you can have a map with case insensitive keys:

map<string, ValueType, StringIgnoreCaseLess> myMap;

Now, the question is: What would such comparison object look like?

A Failed Approach

An approach I saw sometimes (e.g. on StackOverflow) is to use std::lexicographical_compare, comparing the strings char-by-char, after invoking tolower on each char. Basically, the idea is to compare the lowercase versions of each corresponding characters in the input strings. The code from the aforementioned SO answer follows:

// Code from: https://stackoverflow.com/a/1801913/1629821

struct ci_less
{
  // case-independent (ci) compare_less binary function
  struct nocase_compare
  {
    bool operator() (const unsigned char& c1, const unsigned char& c2) const {
      return tolower (c1) < tolower (c2); 
    }
  };
  
  bool operator() (const std::string & s1, const std::string & s2) const {
    return std::lexicographical_compare 
        (s1.begin (), s1.end (),   // source range
         s2.begin (), s2.end (),   // dest range
         nocase_compare ());       // comparison
  }
};

The problem with this code is that it doesn’t work for international strings. In fact, while for a pure ASCII string like “Connie” you can use this technique to successfully compare “Connie” with “connie” or “CONNIE”, this won’t work for strings containing international characters.

For example, consider the Italian word “perché”. The last character in “perché” is U+00E9, i.e. the ‘LATIN SMALL LETTER E WITH ACUTE’, which is encoded in UTF-8 as the hex byte sequence 0xC3 0xA9. Its uppercase form is É U+00C9 (encoded in UTF-8 as 0xC3 0x89). So, let’s assume you use the UTF-8 encoding to store your international text in std::string objects. Well, invoking tolower char by char, as implemented in the aforementioned SO answer, will fail. In fact, tolower is unable to correctly process UTF-8 sequences (at least in the Microsoft VS2015 CRT implementation I used in my tests).

A Better Approach for International Text

So, how to fix that? Well, on Windows there’s a CompareStringEx API (available since Vista, according to the MSDN documentation) that seems to work, at least in my tests with some Italian text. You can call this API passing the NORM_IGNORECASE comparison flag.

As for most Windows APIs, the Unicode encoding used for text is UTF-16. So, let’s start writing a nice C++ wrapper around this CompareStringEx C-interface API. Let’s assume that we want to compare two UTF-16 wstrings, and that they are “ordinary” strings that don’t contain embedded NULs. A possible implementation for this C++ helper function follows:

// C++ wrapper around the Windows CompareStringEx C API
inline int CompareStringIgnoreCase(const std::wstring& s1, 
                                   const std::wstring& s2)
{
    // According to the MSDN documentation, the CompareStringEx function 
    // is optimized for NORM_IGNORECASE and string lengths specified as -1.

    return ::CompareStringEx(
        LOCALE_NAME_INVARIANT,
        NORM_IGNORECASE,
        s1.c_str(),
        -1,
        s2.c_str(),
        -1,
        nullptr,        // reserved
        nullptr,        // reserved
        0               // reserved
    );
}

Now, you can simply invoke this C++ helper function inside the comparison object that will be used with std::map:

// Comparison object for std::map, ignoring string case
struct StringIgnoreCaseLess
{
    bool operator()(const std::wstring& s1, const std::wstring& s2) const
    {
        // (s1 < s2) ignoring string case
        return CompareStringIgnoreCase(s1, s2) == CSTR_LESS_THAN;
    }
};

This basically implements the condition (s1 < s2) ignoring the string case.

And, finally, you can simply plug this comparison object into a std::map with UTF-16-encoded wstring keys:

map<wstring, ValueType, StringIgnoreCaseLess> myCaseInsensitiveStringMap;

Or, using the nice C++11 alias template feature:

template <typename ValueType>
using CaseInsensitiveStringMap = std::map<std::wstring, ValueType, 
                                          StringIgnoreCaseLess>;

// Simply use CaseInsensitiveStringMap<ValueType>

You can find some compilable C++ sample code on GitHub.

Note on UTF-8 String Keys

If you really want to use UTF-8-encoded std::string keys, then you have to add some code to the comparison object, to first convert the input strings from UTF-8 to UTF-16, and then invoke CompareStringEx for the UTF-16 text comparison.

If you are working on C++ code that is already Windows platform specific, I think that choosing the UTF-16 encoding to represent international text is more “natural” (and more efficient) than converting back and forth between UTF-8 and UTF-16.

 

Detecting Unicode Space Characters

Some programmers love UTF-8 as they believe they can reuse old “ANSI” APIs with UTF-8-encoded text. UTF-8 does have some advantages (like being endian-neutral), but pretending to blindly reuse old “ANSI” APIs for UTF-8 text is not one of them.

For example: There are various space characters defined in Unicode.

You can use the iswspace function to check if a Unicode UTF-16 wide character is a white-space, e.g.:

#include <ctype.h>      // for iswspace
#include <iostream>

int main()
{
    // Let’s do a test with the punctuation space (U+2008)
    const wchar_t wch = 0x2008;

    if (iswspace(wch)) {
        std::cout << "OK.\n";
    }
}

The corresponding old “ANSI” function is isspace: Can you use it with Unicode text encoded in UTF-8? I’m open to be proven wrong, but I think that’s not possible.

 

Printing UTF-8 Text to the Windows Console

Let’s suppose you have a string encoded in UTF-8, and you want to print it to the Windows console.

You might have heard of the _setmode function and the _O_U8TEXT flag.

The MSDN documentation contains this compilable code snippet that you can use as the starting point for your experimentation:

// crt_setmodeunicode.c  
// This program uses _setmode to change  
// stdout to Unicode. Cyrillic and Ideographic  
// characters will appear on the console (if  
// your console font supports those character sets).  
  
#include <fcntl.h>  
#include <io.h>  
#include <stdio.h>  
  
int main(void) {  
    _setmode(_fileno(stdout), _O_U16TEXT);  
    wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");  
    return 0;  
}  

So, to print UTF-8-encoded text, you may think of substituting the _O_U16TEXT flag with _O_U8TEXT, and use printf (or cout) with a byte sequence representing your UTF-8-encoded string.

For example, let’s consider the Japanese name for Japan, written using the kanji 日本.

The first kanji is the Unicode code point U+65E5; the second kanji is U+672C. Their UTF-8 encodings are the 3-byte sequences 0xE6 0x97 0xA5 and 0xE6 0x9C 0xAC, respectively.

So, let’s consider this compilable code snippet that tries to print a UTF-8-encoded string:

#include <fcntl.h>
#include <io.h>
#include <stdint.h>
#include <iostream>

int main()
{
    _setmode(_fileno(stdout), _O_U8TEXT);

    // Japanese name for Japan, 
    // encoded in UTF-8
    uint8_t utf8[] = { 
        0xE6, 0x97, 0xA5, // U+65E5
        0xE6, 0x9C, 0xAC, // U+672C
        0x00 
    };

    std::cout << reinterpret_cast<const char*>(utf8) << '\n';
}

This code compiles fine. However, if you run it, you’ll get this error message:

Error when trying to print UTF-8 text to the console
Error when trying to print UTF-8 text to the console

So, how to print some UTF-8 encoded text to the Windows command prompt?

Well, it seems that you have to first convert from UTF-8 to UTF-16, and then use wprintf or wcout to print the UTF-16-encoded text. This isn’t optimal, but at least it seems to work.

 

What Is the Encoding Used by the error_code message String?

std::system_error is an exception class introduced in C++11, that is thrown by various functions that interact with OS-specific APIs. The platform-dependent error code is represented using the std::error_code class (returned by system_error::code). The error_code::message method returns an explanatory string for the error code. So, what is the encoding used to store the text in the returned std::string object? UTF-8? Some other code-page?

To answer this question, I spelunked inside the MSVC STL implementation code, and found this _Winerror_message helper function that is used to get the description of a Windows error code.

STL _Winerror_message helper function
STL _Winerror_message helper function

This function first calls the FormatMessageW API to get the error message encoded in Unicode UTF-16. Then, the returned wchar_t-string is converted to a char-string, that is written to an output buffer allocated by the caller.

The conversion is done invoking the WideCharToMultiByte API, and the CodePage parameter is set to CP_ACP, meaning “the system default Windows ANSI code page” (copy-and-paste’d from the official MSDN documentation).

I think in modern C++ code, in general it’s a good practice to store UTF-8-encoded text in std::strings. The code pages are a source of subtle bugs, as there are many of them, they can change, and you end up getting garbage characters (mojibake) when different char-strings using different code pages are mixed and appended together (e.g.  when written to UTF-8 log files).

So, I’d have preferred using the CP_UTF8 flag with the WideCharToMultiByte call above, getting a char-string containing the error message encoded as a UTF-8 string.

However, this would cause mojibake bugs for C++/Windows code that uses cout or printf to print message strings, as this code assumes CP_ACP by default.

So, my point is still that char-strings should in general use the UTF-8 encoding; but unless the Windows console and cout/printf move to UTF-8 as their default encoding, it sounds like the current usage of CP_ACP in the error message string is understandable.

Anyway, due to the use of CP_ACP in the wchar_t-to-char string conversion discussed above, you should pay attention when writing error_code::message strings to UTF-8-encoded log files. Maybe the best thing would be writing custom code to get the message string from the error code identifier, and encoding it using UTF-8 (basically invoking FormatMessage followed by WideCharToMultiByte with CP_UTF8).

Thanks to Stephan T. Lavavej, Casey Carter and Billy O’Neal for e-mail communication on this issue.

 

The CStringW with wcout Bug Under the Hood

I discussed in a previous blog post a subtle bug involving CStringW and wcout, and later I showed how to fix it.

In this blog post, I’d like to discuss in more details what’s happening under the hood, and what triggers that bug.

Well, to understand the dynamics of that bug, you can consider the following simplified case of a function and a function template, implemented like this:

void f(const void*) {
  cout << "f(const void*)\n";
}

template <typename CharT> 
void f(const CharT*) {
  cout << "f(const CharT*)\n";
}

If s is a CStringW object, and you write f(s), which function will be invoked?

Well, you can write a simple compilable code containing these two functions, the required headers, and a simple main implementation like this:

int main() {
  CStringW s = L"Connie";
  f(s);
}

Then compile it, and observe the output. You know, printf-debugging™ is so cool! 🙂

Well, you’ll see that the program outputs “f(const void*)”. This means that the first function (the non-templated one, taking a const void*), is invoked.

So, why did the C++ compiler choose that overload? Why not f(const wchar_t*), synthesized from the second function template?

Well, the answer is in the rules that C++ compilers follow when doing template argument deduction. In particular, when deducing template arguments, the implicit conversions are not considered. So, in this case, the implicit CStringW conversion to const wchar_t* is not considered.

So, when overload resolution happens later, the only candidate available is f(const void*). Now, the implicit CStringW conversion to const wchar_t* is considered, and the first function is invoked.

Out of curiosity, if you comment out the first function, you’ll get a compiler error. MSVC complains with a message like this:

error C2672: ‘f’: no matching overloaded function found

error C2784: ‘void f(const CharT *)’: could not deduce template argument for ‘const CharT *’ from ‘ATL::CStringW’

The message is clear (almost…): “Could not deduce template argument for const CharT* from CStringW”: that’s because implicit conversions like this are not considered when deducing template arguments.

Well, what I’ve described above in a simplified case is basically what happens in the slightly more complex case of wcout.

wcout is an instance of wostream. wostream is declared in <iosfwd> as:

typedef basic_ostream<wchar_t, char_traits<wchar_t>> wostream;

Instead of the initial function f, in this case you have operator<<. In particular, here the candidates are an operator<< overload that is a member function of basic_ostream:

basic_ostream& basic_ostream::operator<<(const void *_Val)

and a template non-member function:

template<class _Elem, class _Traits> 
inline basic_ostream<_Elem, _Traits>& 
operator<<(basic_ostream<_Elem, _Traits>& _Ostr, const _Elem *_Val)

(This code is edited from the <ostream> standard header that comes with MSVC.)

When you write code like “wcout << s” (for a CStringW s), the implicit conversion from CStringW to const wchar_t* is not considered during template argument deduction. Then, overload resolution picks the basic_ostream::operator<<(const void*) member function (corresponding to the first f in the initial simplified case), so the string’s address is printed via this “const void*” overload (instead of the string itself).

On the other hand, when CStringW::GetString is explicitly invoked (as in “wcout << s.GetString()”), the compiler successfully deduces the template arguments for the non-member operator<< (deducing wchar_t for _Elem). And this operator<<(wostream&, const wchar_t*) prints the expected wchar_t string.

I know… There are aspects of C++ templates that are not easy.

 

Wiring CStringW with Output Streams

We saw earlier that there’s a subtle bug involving CStringW and wcout.

If you really want to make CStringW work with wcout without the additional GetString call, you can follow a common pattern that is used to enable a C++ class to work with output streams.

In particular, you can define an overload of operator<<, that takes references to the output stream and to the class of interest, e.g.:

std::wostream& operator<<(std::wostream& os, 
                          const CStringW& str)
{
    return (os << str.GetString());
}

Note that the call to CString::GetString is hidden inside the implementation of this operator<< overload.

With this overload, the following code will output the expected string:

CStringW s = L"Connie";
wcout << s;

Note that this will work also for other wcout-ish objects (output streams), like wostringstream.

 

Subtle Bug with CStringW and wcout: Where’s My String??

Someone wrote some C++ code like this to print the content of a CStringW using wcout:

CStringW s = L"Connie";
…
wcout << s << …

The code compiles fine, but the output is a hexadecimal sequence, not the string “Connie” as expected. Surprise, surprise! So, what’s going on here?

Well, wcout doesn’t have any clue on how to deal with CStringW objects. All in all, CStringW is part of ATL; instead wcout is part of the <iostream>: they are two separate worlds.

However, CStringW defines an implicit conversion operator to const wchar_t*. This makes it possible to simply pass CStringW objects to Win32 APIs expecting input C-style NUL-terminated string pointers (although, there’s a school of thought that discourages the use of implicit conversions in C++).

So, wcout has no idea how to print a CStringW object. However, wcout does know how to print raw pointers (const void*). So, in the initial code snippet, the C++ compiler first invokes the CStringW implicit const wchar_t* conversion operator. Then, the <iostream> operator<< overload that takes a const void* parameter is used to print the pointer with wcout.

In other words, the hexadecimal value printed is the raw C-style string pointer to the string wrapped by the CStringW object, instead of the string itself.

If you want to print the actual string (not the pointer), you can invoke the GetString method of CStringW. GetString returns a const wchar_t*, which is the pointer to the wrapped string; wcout does know how to print a Unicode UTF-16 wchar_t string via its raw C-style pointer. In fact, there’s a specific overload of operator<< that takes a const wchar_t*; this gets invoked, for example, when you pass wchar_t-string literals to wcout, e.g. wcout << L”Connie”, instead of picking the const void* overload that prints the pointer.

So, this is the correct code:

// Print a CStringW to wcout
wcout << s.GetString() << …

Another option is to explicitly static_cast the CStringW object to const wchar_t*, so the proper operator<< overload gets invoked:

wcout << static_cast<const wchar_t*>(s) << …

Although I prefer the shorter and clearer GetString call.

(Of course, this applies also to other wcout-ish objects, like instances of std::wostringstream.)

P.S. In my view, it would be more reasonable that “wcout << …” picked the const wchar_t* overload also for CStringW objects. It probably doesn’t happen for some reason involving templates or some lookup rule in the I/O stream library. Sometimes C++ is not reasonable (it happens also to C).

Safe Array Sample Code on GitHub

I uploaded some sample code on GitHub, that shows how to create safe arrays in C++ and consume them in a C# application.

This project contains a C++ DLL (with a C interface) that exports two functions to produce safe arrays containing bytes and strings.

Then, there’s a WinForms C# application that consumes these safe arrays using proper PInvoke declarations, and shows their content on screen.

 

Marshal STL string vectors Using Safe Arrays

Suppose you have a vector<string> in some cross-platform C++ code, and you want to marshal it across module or language boundaries on the Windows platform: Using a safe array is a valid option.

So, how can you achieve this goal?

Well, as it’s common in programming, you have to combine together some building blocks, and you get the solution to your problem.

A safe array can contain many different types, but the “natural” type for a Unicode string is BSTR. A BSTR is basically a length-prefixed Unicode string encoded using UTF-16.

ATL offers a convenient helper class to simplify safe array programming in C++: CComSafeArray. The MSDN Magazine article “Simplify Safe Array Programming in C++ with CComSafeArray” discusses with concrete sample code how to use this class. In particular, the paragraph “Producing a Safe Array of Strings” is the section of interest here.

So, this was the first building block. Now, let’s discuss the second.

You have a vector<string> as input. An important question to ask is what kind of encoding is used for the strings stored in the vector. It’s very common to store Unicode strings in std::string using the UTF-8 encoding. So, there’s an encoding impedance here: The input strings stored in the std::vector use UTF-8; but the output strings that will be stored as BSTR in the safe array use UTF-16. Ok, not a big problem: You just have to convert from UTF-8 to UTF-16. This is the other building block to solve the initial problem, and it’s discussed in the MSDN Magazine article “Unicode Encoding Conversions with STL Strings and Win32 APIs”.

So, to wrap up: You can go from a vector<string> to a safe array of BSTR strings following this path:

  1. Create a CComSafeArray<BSTR> of the same size of the input std::vector
  2. For each string in the input vector<string>, convert the UTF-8-encoded string to the corresponding UTF-16 wstring
  3. Create a CComBSTR from the previous wstring
  4. Invoke CComSafeArray::SetAt() to copy the CComBSTR into the safe array

The steps #1, #3, and #4 are discussed in the CComSafeArray MSDN article; the step #2 is discussed in the Unicode encoding conversion MSDN article.

Subtle Bug with std::min/max Function Templates

Suppose you have a function f that returns a double, and you want to store in a variable the value of this function, if this a positive number, or zero if the return value is negative. This line of C++ code tries to do that:

double x = std::max(0, f(/* something */));

Unfortunately, this apparently innocent code won’t compile!

The error message produced by VS2015’s MSVC compiler is not very clear, as often with C++ code involving templates.

So, what’s the problem with that code?

The problem is that the std::max function template is declared something like this:

template <typename T> 
const T& max(const T& a, const T& b)

If you look at the initial code invoking std::max, the first argument is of type int; the second argument is of type double (i.e. the return type of f).

Now, if you look at the declaration of std::max, you’ll see that both parameters are expected to be of the same type T. So, the C++ compiler complains as it’s unable to deduce the type of T in the code calling std::max: should T be int or double?

This ambiguity triggers a compile-time error.

To fix this error, you can use the double literal 0.0 instead of 0.

And, what if instead of 0 there’s a variable of type int?

Well, in this case you can either static_cast that variable to double:

double x = std::max(static_cast<double>(n), f(/* something */));

or, as an alternative, you can explicitly specify the double template type for std::max:

double x = std::max<double>(n, f(/* something */));