The Sticky Preprocessor-Based TCHAR Model – Part 1: Introduction

If you have been doing a fair amount of Win32 programming in C++, chances are good that you have been exposed to some basic APIs like SetWindowText.

Its prototype is very simple:

BOOL SetWindowText(HWND hWnd,
                   LPCTSTR lpString);

The LPCTSTR typedef is equivalent to const TCHAR*: it basically represents a pointer to an input NUL-terminated string. The purpose of this API is to change the text of the specified window’s title bar, or the text of the control (if the hWnd parameter represents a control) using the string passed as second parameter.

But, truth be told, there’s no SetWindowText function implemented and exposed as a Win32 API!

There are actually two slightly different functions: SetWindowTextA and SetWindowTextW.

This can be easily verified spelunking inside <WinUser.h>:

WINUSERAPI
BOOL
WINAPI
SetWindowTextA(
    _In_ HWND hWnd,
    _In_opt_ LPCSTR lpString);
WINUSERAPI
BOOL
WINAPI
SetWindowTextW(
    _In_ HWND hWnd,
    _In_opt_ LPCWSTR lpString);
#ifdef UNICODE
#define SetWindowText  SetWindowTextW
#else
#define SetWindowText  SetWindowTextA
#endif // !UNICODE

Removing some “noise” (don’t get me wrong: SAL and calling conventions are important; it’s “noise” just from the particular perspective of this blog post) from the above code snippet, and substituting the LPCSTR and LPCWSTR typedefs with their longer equivalent forms, we have:

// LPCSTR == const char*
BOOL SetWindowTextA(HWND hWnd, 
                    const char* lpString);

// LPCWSTR == const wchar_t*
BOOL SetWindowTextW(HWND hWnd, 
                    const wchar_t* lpString);

So, basically, the main difference between these two functions is in the string parameter: the function with the A suffix (SetWindowTextA) expects a char-based string, instead the function with the W suffix (SetWindowTextW) expects a wchar_t-based string.

These char-based strings are commonly called “ANSI” or “MBCS” (“Multi-Byte Character Set”) strings. The “A” suffix originates from “ANSI”.

Conversely, the wchar_t-based strings are commonly called “wide” strings, or Unicode strings. And, as you can easily imagine, the “W” suffix stems from “wide”.

The ANSI/MBCS form refers to legacy strings, with lots of associated potential problems including mismatching code page mess.

The Unicode form is the “modern” one, and should be the preferred form in Windows applications written in C++. Note that, in this context, the particular Unicode encoding used is UTF-16 (with wchar_t being a UTF-16 16-bit code unit in Visual C++).

Now, let’s have a look at the last part of the aforementioned code snippet:

#ifdef UNICODE
#define SetWindowText  SetWindowTextW
#else
#define SetWindowText  SetWindowTextA
#endif

So, it’s clear that SetWindowText is just a preprocessor #define, expanded to SetWindowTextW in Unicode builds (which have been the default since VS2005!), and to SetWindowTextA in ANSI/MBCS builds (which IMHO should be considered deprecated).

The Unicode vs. ANSI/MBCS mode is controlled by the UNICODE preprocessor label.

As already written, Unicode builds have been the default since VS2005; anyway, you can change the build mode via the Visual Studio IDE,  following the path: Project Properties | Configuration Properties | General | Character Set (as described, for example, in this StackOverflow answer).

The idea of this legacy TCHAR model is to basically allow C/C++ Win32 programmers to have a single code base, using a common “generic” character type named TCHAR (instead of explicitly using char and wchar_t), and a single apparent function name (for example: SetWindowText), and have this TCHAR expanded to either char or wchar_t, and the proper corresponding A-ending or W-ending function be called, depending on the particular ANSI/MBCS or Unicode build mode setting.

In this model string literals should be decorated with TEXT or _TEXT or _T, such that in ANSI/MBCS builds those literals are expanded, for example, as in “Connie”, instead in Unicode builds an L prefix is automatically added, making it L“Connie”.

Following this TCHAR model, a SetWindowText call would appear in C++ code something like this:

SetWindowText(myWindow, TEXT("Connie"));

In ANSI/MBCS builds, SetWindowText would actually be expanded to SetWindowTextA, TEXT(“Connie”) to “Connie”, so the above statement gets transformed to:

SetWindowTextA(myWindow, "Connie");

Instead, in Unicode builds, SetWindowText is expanded to SetWindowTextW, TEXT(“Connie”) becomes L“Connie” (with the L prefix denoting a Unicode UTF-16 string literal), and the aforementioned statement becomes:

SetWindowTextW(myWindow, L"Connie");

So, given a single code base, you could switch the build mode between Unicode and ANSI/MBCS, and automatically get two different binary executables: one Unicode-enabled, and the other one using the ANSI/MBCS legacy APIs.

Well, this might have made sense in the old days of Windows, when Unicode-enabled versions of Windows (for example: Windows 2000, XP, etc.) coexisted with older Unicode-unaware versions of the OS, which didn’t implement the “W” version of the Win32 APIs. So you could build software capable of targeting both Unicode-enabled and Unicode-unaware versions of Windows, starting from a common single TCHAR-enabled code base, and just #define’ing/#undef’ing a few preprocessor macros (UNICODE and _UNICODE), more or less…

Anyway, considering that recent widespread versions of Windows, like Windows 7, are Unicode-enabled, there’s really no reason nowadays to use this legacy messy TCHAR model: just build your Windows C++ applications in Unicode.

(Bonus historical note: to simplify creating Unicode-aware applications for Windows 95 and 98, Microsoft built UNICOWS.DLL or “cows”, a.k.a. “Microsoft Layer for Unicode”, released in July 2001.)

However, this TCHAR preprocessor-based model has some nasty effects still today, as we’ll see in the next blog post.

 

Comparing Unicode Strings Containing Combining Characters

Suppose to have a Unicode character that is a precomposed character, i.e. a Unicode entity that can be defined as a sequence of one or more characters. For instance: é (U+00E9, Latin small letter e with acute). This character is common in Italian, for example you can find it in “Perché?” (“Why?”).

This é character can be decomposed into an equivalent string made by the base letter e (U+0065, Latin small letter e) and the combining acute accent (U+0301).

So, it’s very reasonable that two Unicode strings, one containing the precomposed character “é” (U+00E9), and another made by the base letter “e” (U+0065) and the combining acute accent (U+0301), should be considered equivalent.

However, given those two Unicode strings defined in C++ as follows:

  // Latin small letter e with acute
  const wchar_t s1[] = L"\x00E9";

  // Latin small letter e + combining acute
  const wchar_t s2[] = L"\x0065\x0301";

calling wcscmp(s1, s2) to compare them returns a value different than zero, meaning that those two equivalent Unicode strings are actually considered different (which makes sense from a “physical” raw byte sequence perspective).

However, if those same strings are compared using the CompareStringEx() Win32 API as follows:

  int result = ::CompareStringEx(
    LOCALE_NAME_INVARIANT,
    0, // default behavior
    s1, -1,
    s2, -1,
    nullptr,
    nullptr,
    0);

then the return value is CSTR_EQUAL, meaning that the two aforementioned strings are considered equivalent, as initially expected.

 

Conversion between Unicode UTF-16 and UTF-8 in C++/Win32

For fresh updated and richer information and modern C++ usage, please read my MSDN Magazine article (published on the 2016 September issue):

Unicode Encoding Conversions with STL Strings and Win32 APIs

New updated modern C++ code can be found here on GitHub.


Check out My Pluralsight Courses here.

 


 

 

C++ reusable code for mixed ATL/STL conversions can be found here on GitHub. Basically, ATL CString(W) stores Unicode text encoded in UTF-16, and std::string stores UTF-8-encoded text.


Code working with ATL’s CStringW/A classes and throwing exceptions via AtlThrow() can be found here on GitHub. For convenience, the core part of that code is copied below:

//////////////////////////////////////////////////////////////////////////////
//
// *** Functions to convert between Unicode UTF-8 and Unicode UTF-16 ***
//                      using ATL CStringA/W classes
//
// By Giovanni Dicanio 
//
//////////////////////////////////////////////////////////////////////////////


//----------------------------------------------------------------------------
// FUNCTION: Utf8ToUtf16
// DESC:     Converts Unicode UTF-8 text to Unicode UTF-16 (Windows default).
//----------------------------------------------------------------------------
CStringW Utf8ToUtf16(const CStringA& utf8)
{
    // Special case of empty input string
    if (utf8.IsEmpty())
    {
        // Return empty string
        return CStringW();
    }


    // "Code page" value used with MultiByteToWideChar() for UTF-8 conversion 
    const UINT codePageUtf8 = CP_UTF8;

    // Safely fails if an invalid UTF-8 character is encountered
    const DWORD flags = MB_ERR_INVALID_CHARS;

    // Get the length, in WCHARs, of the resulting UTF-16 string
    const int utf16Length = ::MultiByteToWideChar(
            codePageUtf8,       // source string is in UTF-8
            flags,              // conversion flags
            utf8.GetString(),   // source UTF-8 string
            utf8.GetLength(),   // length of source UTF-8 string, in chars
            nullptr,            // unused - no conversion done in this step
            0);                 // request size of destination buffer, in WCHARs
    if (utf16Length == 0)
    {
        // Conversion error
        AtlThrowLastWin32();
    }


    // Allocate destination buffer to store the resulting UTF-16 string
    CStringW utf16;
    WCHAR* const utf16Buffer = utf16.GetBuffer(utf16Length);
    ATLASSERT(utf16Buffer != nullptr);


    // Do the conversion from UTF-8 to UTF-16
    int result = ::MultiByteToWideChar(
            codePageUtf8,       // source string is in UTF-8
            flags,              // conversion flags
            utf8.GetString(),   // source UTF-8 string
            utf8.GetLength(),   // length of source UTF-8 string, in chars
            utf16Buffer,        // pointer to destination buffer
            utf16Length);       // size of destination buffer, in WCHARs  
    if (result == 0)
    {
        // Conversion error
        AtlThrowLastWin32();
    }

    // Don't forget to release internal CString buffer 
    // before returning the string to the caller
    utf16.ReleaseBufferSetLength(utf16Length);

    // Return resulting UTF-16 string
    return utf16;
}



//----------------------------------------------------------------------------
// FUNCTION: Utf16ToUtf8
// DESC:     Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.
//----------------------------------------------------------------------------
CStringA Utf16ToUtf8(const CStringW& utf16)
{
    // Special case of empty input string
    if (utf16.IsEmpty())
    {
        // Return empty string
        return CStringA();
    }


    // "Code page" value used with WideCharToMultiByte() for UTF-8 conversion 
    const UINT codePageUtf8 = CP_UTF8;

    // Safely fails if an invalid UTF-16 character is encountered
    const DWORD flags = WC_ERR_INVALID_CHARS;

    // Get the length, in chars, of the resulting UTF-8 string
    const int utf8Length = ::WideCharToMultiByte(
            codePageUtf8,       // convert to UTF-8
            flags,              // conversion flags
            utf16.GetString(),  // source UTF-16 string
            utf16.GetLength(),  // length of source UTF-16 string, in WCHARs
            nullptr,            // unused - no conversion required in this step
            0,                  // request size of destination buffer, in chars
            nullptr, nullptr);  // unused
    if (utf8Length == 0)
    {
        // Conversion error
        AtlThrowLastWin32();
    }


    // Allocate destination buffer to store the resulting UTF-8 string
    CStringA utf8;
    char* const utf8Buffer = utf8.GetBuffer(utf8Length);
    ATLASSERT(utf8Buffer != nullptr);


    // Do the conversion from UTF-16 to UTF-8
    int result = ::WideCharToMultiByte(
            codePageUtf8,       // convert to UTF-8
            flags,              // conversion flags
            utf16.GetString(),  // source UTF-16 string
            utf16.GetLength(),  // length of source UTF-16 string, in WCHARs
            utf8Buffer,         // pointer to destination buffer
            utf8Length,         // size of destination buffer, in chars
            nullptr, nullptr);  // unused
    if (result == 0)
    {
        // Conversion error
        AtlThrowLastWin32();
    }


    // Don't forget to release internal CString buffer 
    // before returning the string to the caller
    utf8.ReleaseBufferSetLength(utf8Length);

    // Return resulting UTF-8 string
    return utf8;
}