Comparing Unicode Strings Containing Combining Characters

Suppose to have a Unicode character that is a precomposed character, i.e. a Unicode entity that can be defined as a sequence of one or more characters. For instance: é (U+00E9, Latin small letter e with acute). This character is common in Italian, for example you can find it in “Perché?” (“Why?”).

This é character can be decomposed into an equivalent string made by the base letter e (U+0065, Latin small letter e) and the combining acute accent (U+0301).

So, it’s very reasonable that two Unicode strings, one containing the precomposed character “é” (U+00E9), and another made by the base letter “e” (U+0065) and the combining acute accent (U+0301), should be considered equivalent.

However, given those two Unicode strings defined in C++ as follows:

  // Latin small letter e with acute
  const wchar_t s1[] = L"\x00E9";

  // Latin small letter e + combining acute
  const wchar_t s2[] = L"\x0065\x0301";

calling wcscmp(s1, s2) to compare them returns a value different than zero, meaning that those two equivalent Unicode strings are actually considered different (which makes sense from a “physical” raw byte sequence perspective).

However, if those same strings are compared using the CompareStringEx() Win32 API as follows:

  int result = ::CompareStringEx(
    LOCALE_NAME_INVARIANT,
    0, // default behavior
    s1, -1,
    s2, -1,
    nullptr,
    nullptr,
    0);

then the return value is CSTR_EQUAL, meaning that the two aforementioned strings are considered equivalent, as initially expected.

 

Leave a Reply

Your email address will not be published. Required fields are marked *