The Sticky Preprocessor-Based TCHAR Model – Part 2: Where’s My Function?!?

In the previous blog post, I briefly introduced the TCHAR model. I did that not because I think that’s a quality model that should be used in modern Windows C++ applications: on the contrary, I dislike it and consider it useless nowadays. The reason why I introduced the TCHAR model is to help you understand what can be a very nasty bug in your C++ Windows projects.

So, suppose that you are building a cross-platform abstraction layer for some C++ application: in particular, you have a function that returns a string containing the path of the directory designated for temporary files, something like this:

// FILE: Utils.h

#pragma once

#include <string>

namespace Utils
{
    std::string GetTempPath();

    // ... Other functions
}

For the Windows implementation, this function is defined in terms of the GetTempPath Win32 API. In order to use that API, inside the corresponding Utils.cpp source, <Windows.h> is included:

// FILE: Utils.cpp

#include <Windows.h>
#include "Utils.h"

// Utils::GetTempPath() implementation ...

Now, suppose that you have another .cpp file, with cross-platform C++ code, that uses Utils::GetTempPath(). Note that, since this is cross-platform C++ code, <Windows.h> is not included in there. Think for example even of something as simple as:

// FILE: Main.cpp

#include <iostream>
#include "Utils.h"  // for Utils::GetTempPath()

int main()
{
    std::cout << Utils::GetTempPath() << '\n';
}

Well, this code won’t build. You’ll get a linker error, something like this:

1>Main.obj : error LNK2019: unresolved external symbol 
"class std::basic_string<char,struct std::char_traits<char>,
class std::allocator<char> > __cdecl Utils::GetTempPath(void)" 
(?GetTempPath@Utils@@YA?AV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@XZ) 
referenced in function _main

After removing a little bit of “noise” (including some C++ name mangling), basically the error is:

1>Main.obj : error LNK2019: unresolved external symbol “std::string Utils::GetTempPath()” referenced in function main

So, the linker is complaining about the Utils::GetTempPath() function.

Then you may start going crazy, double- and triple-checking the correct spelling of “GetTempPath” inside your Utils.h header, inside Utils.cpp, inside Main.cpp, etc. But there are no typos: GetTempPath is actually spelled correctly in every place.

Then, you try to rebuild the solution inside Visual Studio one more time, but the mysterious linker error shows up again.

What’s going on? Is this a linker bug? Time to file a bug on Connect?

Nope.

It’s just the nasty preprocessor-based TCHAR model that sneaked into our code!

Let’s try to analyze what’s happened in some details.

In this case, there are a couple of translation units to focus our attention on: one is from the Utils.cpp source file, containing the definition (implementation) of Utils::GetTempPath. The other is from the Main.cpp source file, calling the Utils::GetTempPath function (which is expected to be implemented in the former translation unit).

In the Utils.cpp’s translation unit, the <Windows.h> header is included. This header brings with it the preprocessor-based TCHAR model, discussed in the previous blog post. So, a preprocessor macro named “GetTempPath” is defined, and it is expanded to “GetTempPathW” in Unicode builds.

Think of it as an automatic search-and-replace process: before the actual compilation of C++ code begins, the preprocessor examines the source code, and automatically replaces all instances of “GetTempPath” with “GetTempPathW”. The Utils::GetTempPath function name is found and replaced as well, just like the other occurrences of “GetTempPath”. So, to the C++ compiler and linker, the actual function name for this translation unit is Utils::GetTempPathW (not Utils::GetTempPath, as written in source code!).

Now, what’s happening at the Main.cpp’s translation unit? Since here <Windows.h> was not included (directly or indirectly), the TCHAR preprocessor model didn’t kick in. So, this translation unit is genuinely expecting a Utils::GetTempPath function, just as specified in the Utils.h header. But since the Utils.cpp’s translation unit produced a Utils::GetTempPathW function (because of the TCHAR model’s preprocessor #define), the linker can’t find any definition (implementation) of Utils::GetTempPath, hence the aforementioned apparently mysterious linker error.

TCHAR Preprocessor Bug
TCHAR Preprocessor Bug

This can be a time-wasting subtle bug to spot, especially in non-trivial code bases, and especially when you don’t know about the TCHAR preprocessor model.

You should pay attention to functions and methods that have the same name of Win32 APIs, that can be subjected to this subtle TCHAR preprocessor transformation.

To fix that, an option is to #undef the TCHAR-modified definition of the identifier in Utils.h:

//
// Remove TCHAR preprocessor redefinition 
// of GetTempPath
//
#ifdef GetTempPath
#undef GetTempPath
#endif

A simple repro solution can be downloaded here from GitHub.

 

The Sticky Preprocessor-Based TCHAR Model – Part 1: Introduction

If you have been doing a fair amount of Win32 programming in C++, chances are good that you have been exposed to some basic APIs like SetWindowText.

Its prototype is very simple:

BOOL SetWindowText(HWND hWnd,
                   LPCTSTR lpString);

The LPCTSTR typedef is equivalent to const TCHAR*: it basically represents a pointer to an input NUL-terminated string. The purpose of this API is to change the text of the specified window’s title bar, or the text of the control (if the hWnd parameter represents a control) using the string passed as second parameter.

But, truth be told, there’s no SetWindowText function implemented and exposed as a Win32 API!

There are actually two slightly different functions: SetWindowTextA and SetWindowTextW.

This can be easily verified spelunking inside <WinUser.h>:

WINUSERAPI
BOOL
WINAPI
SetWindowTextA(
    _In_ HWND hWnd,
    _In_opt_ LPCSTR lpString);
WINUSERAPI
BOOL
WINAPI
SetWindowTextW(
    _In_ HWND hWnd,
    _In_opt_ LPCWSTR lpString);
#ifdef UNICODE
#define SetWindowText  SetWindowTextW
#else
#define SetWindowText  SetWindowTextA
#endif // !UNICODE

Removing some “noise” (don’t get me wrong: SAL and calling conventions are important; it’s “noise” just from the particular perspective of this blog post) from the above code snippet, and substituting the LPCSTR and LPCWSTR typedefs with their longer equivalent forms, we have:

// LPCSTR == const char*
BOOL SetWindowTextA(HWND hWnd, 
                    const char* lpString);

// LPCWSTR == const wchar_t*
BOOL SetWindowTextW(HWND hWnd, 
                    const wchar_t* lpString);

So, basically, the main difference between these two functions is in the string parameter: the function with the A suffix (SetWindowTextA) expects a char-based string, instead the function with the W suffix (SetWindowTextW) expects a wchar_t-based string.

These char-based strings are commonly called “ANSI” or “MBCS” (“Multi-Byte Character Set”) strings. The “A” suffix originates from “ANSI”.

Conversely, the wchar_t-based strings are commonly called “wide” strings, or Unicode strings. And, as you can easily imagine, the “W” suffix stems from “wide”.

The ANSI/MBCS form refers to legacy strings, with lots of associated potential problems including mismatching code page mess.

The Unicode form is the “modern” one, and should be the preferred form in Windows applications written in C++. Note that, in this context, the particular Unicode encoding used is UTF-16 (with wchar_t being a UTF-16 16-bit code unit in Visual C++).

Now, let’s have a look at the last part of the aforementioned code snippet:

#ifdef UNICODE
#define SetWindowText  SetWindowTextW
#else
#define SetWindowText  SetWindowTextA
#endif

So, it’s clear that SetWindowText is just a preprocessor #define, expanded to SetWindowTextW in Unicode builds (which have been the default since VS2005!), and to SetWindowTextA in ANSI/MBCS builds (which IMHO should be considered deprecated).

The Unicode vs. ANSI/MBCS mode is controlled by the UNICODE preprocessor label.

As already written, Unicode builds have been the default since VS2005; anyway, you can change the build mode via the Visual Studio IDE,  following the path: Project Properties | Configuration Properties | General | Character Set (as described, for example, in this StackOverflow answer).

The idea of this legacy TCHAR model is to basically allow C/C++ Win32 programmers to have a single code base, using a common “generic” character type named TCHAR (instead of explicitly using char and wchar_t), and a single apparent function name (for example: SetWindowText), and have this TCHAR expanded to either char or wchar_t, and the proper corresponding A-ending or W-ending function be called, depending on the particular ANSI/MBCS or Unicode build mode setting.

In this model string literals should be decorated with TEXT or _TEXT or _T, such that in ANSI/MBCS builds those literals are expanded, for example, as in “Connie”, instead in Unicode builds an L prefix is automatically added, making it L“Connie”.

Following this TCHAR model, a SetWindowText call would appear in C++ code something like this:

SetWindowText(myWindow, TEXT("Connie"));

In ANSI/MBCS builds, SetWindowText would actually be expanded to SetWindowTextA, TEXT(“Connie”) to “Connie”, so the above statement gets transformed to:

SetWindowTextA(myWindow, "Connie");

Instead, in Unicode builds, SetWindowText is expanded to SetWindowTextW, TEXT(“Connie”) becomes L“Connie” (with the L prefix denoting a Unicode UTF-16 string literal), and the aforementioned statement becomes:

SetWindowTextW(myWindow, L"Connie");

So, given a single code base, you could switch the build mode between Unicode and ANSI/MBCS, and automatically get two different binary executables: one Unicode-enabled, and the other one using the ANSI/MBCS legacy APIs.

Well, this might have made sense in the old days of Windows, when Unicode-enabled versions of Windows (for example: Windows 2000, XP, etc.) coexisted with older Unicode-unaware versions of the OS, which didn’t implement the “W” version of the Win32 APIs. So you could build software capable of targeting both Unicode-enabled and Unicode-unaware versions of Windows, starting from a common single TCHAR-enabled code base, and just #define’ing/#undef’ing a few preprocessor macros (UNICODE and _UNICODE), more or less…

Anyway, considering that recent widespread versions of Windows, like Windows 7, are Unicode-enabled, there’s really no reason nowadays to use this legacy messy TCHAR model: just build your Windows C++ applications in Unicode.

(Bonus historical note: to simplify creating Unicode-aware applications for Windows 95 and 98, Microsoft built UNICOWS.DLL or “cows”, a.k.a. “Microsoft Layer for Unicode”, released in July 2001.)

However, this TCHAR preprocessor-based model has some nasty effects still today, as we’ll see in the next blog post.

 

Pluralsight Course: Building Context-Menu Shell Extensions in C++

I “wrote” a couple of video courses published on Pluralsight (a third one is work in progress, stay tuned!).

My first Pluralsight course was “Building Context-Menu Shell Extensions in C++”. It’s a slightly less than three-hour course, in which I teach you how to build context-menu shell extensions in C++, using Visual Studio.

The course starts with a brief introduction to COM: just to those COM concepts required for the remaining course modules.

Then, in the following module, I introduce the use of IExecuteCommand to build a simple context-menu shell extension. In this module, I use just “raw” C++, without any frameworks (like ATL). This approach gives the opportunity to show how some things work “under the hood”.

In the next module, I revisit the IExecuteCommand technique, but this time with the help of ATL. ATL is a very useful productive framework for C++/COM programmers: comparing the work done in the previous module with the ATL-based approach presented in this module will make you appreciate the productivity improvements brought by ATL (and Visual Studio ATL Wizards).

In the final module I introduce you to an IContextMenu-based technique for building context-menu shell extensions. There are pros and cons in using IExecuteCommand vs. IContextMenu. For example, while IContextMenu is available in Windows XP, IExecuteCommand is a Win7+ COM interface. So, if you need to develop a context-menu shell extension that supports XP, you have to use IContextMenu.

Moreover, while IExecuteCommand simplifies some common operations, more advanced techniques like building fancy UIs in the context-menu (for example, implementing owner-drawn menu items) require the use of IContextMenu and its later incarnations (like IContextMenu3).

I hope you enjoy the course.

 

Comparing Unicode Strings Containing Combining Characters

Suppose to have a Unicode character that is a precomposed character, i.e. a Unicode entity that can be defined as a sequence of one or more characters. For instance: é (U+00E9, Latin small letter e with acute). This character is common in Italian, for example you can find it in “Perché?” (“Why?”).

This é character can be decomposed into an equivalent string made by the base letter e (U+0065, Latin small letter e) and the combining acute accent (U+0301).

So, it’s very reasonable that two Unicode strings, one containing the precomposed character “é” (U+00E9), and another made by the base letter “e” (U+0065) and the combining acute accent (U+0301), should be considered equivalent.

However, given those two Unicode strings defined in C++ as follows:

  // Latin small letter e with acute
  const wchar_t s1[] = L"\x00E9";

  // Latin small letter e + combining acute
  const wchar_t s2[] = L"\x0065\x0301";

calling wcscmp(s1, s2) to compare them returns a value different than zero, meaning that those two equivalent Unicode strings are actually considered different (which makes sense from a “physical” raw byte sequence perspective).

However, if those same strings are compared using the CompareStringEx() Win32 API as follows:

  int result = ::CompareStringEx(
    LOCALE_NAME_INVARIANT,
    0, // default behavior
    s1, -1,
    s2, -1,
    nullptr,
    nullptr,
    0);

then the return value is CSTR_EQUAL, meaning that the two aforementioned strings are considered equivalent, as initially expected.