Updates to the ATL/STL Unicode Encoding Conversion Code

I’ve updated my code on GitHub for converting between UTF-8, using STL std::string, and UTF-16, using ATL CStringW.

Now, on errors, the code throws instances of a custom exception class that is derived from std::runtime_error, and is capable of containing more information than a simple CAtlException.

Moreover, I’ve added a couple of overloads for converting from source string views (specified using an STL-style [start, finish) pointer range). This makes it possible to efficiently convert only portions of longer strings, without creating ad hoc CString or std::string instances to store those partial views.

 

Unicode Encoding Conversions Between ATL and STL Strings

At the Windows platform-specific level, when using C++ frameworks like ATL (or MFC), it makes sense to represent strings using CString (CStringW in Unicode builds, which should be the default in modern Windows C++ code).

On the other hand, in cross-platform C++ code, using std::string makes sense as well.

The same Unicode text can be encoded in UTF-16 when stored in CString(W), and in UTF-8 for std::string.

So, there’s a need to convert between those two Unicode representations. I discussed the details of such conversions in my MSDN Magazine article: “Unicode Encoding Conversions with STL Strings and Win32 APIs”However, the C++ code associated to that article used std::wstring for UTF-16 strings.

I’ve created a new repository on GitHub, where I uploaded reusable code (in the form of a convenient header-only module) for converting between UTF-16 and UTF-8, using CStringW for UTF-16, and std::string for UTF-8. Please feel free to check it out!

 

What’s Wrong with My UTF-8 Strings in Visual Studio?

Probably nothing: Maybe you just have to tell Visual Studio it’s an UTF-8-encoded string!

The std::string class can be used to store UTF-8-encoded Unicode text.

For example:

std::string s{"LATIN SMALL LETTER E WITH GRAVE (U+00E8): \xC3\xA8"};

However, in the Locals window, instead of the expected è (Latin small letter e with grave, U+00E8), Visual Studio displays some (apparently) garbage characters, like a Latin capital letter A with tilde, followed by diaeresis.

UTF-8 String Misinterpreted in the Locals Window
UTF-8 String Misinterpreted in the Locals Window

Why is that? Is there a bug in our C++ code? Did we pick the wrong UTF-8 bytes for ‘è’?

No. The UTF-8 encoding for è (U+00E8) is exactly the 2-byte sequence 0xC3 0xA8, so the above C++ code is correct.

The problem is that Visual Studio doesn’t use the UTF-8 encoding to display that string in the Locals window. It turns out that VS is probably using the Windows-1252 code page (a character encoding commonly mislabeled as “ANSI” on Windows…). And, in this character encoding, the first byte 0xC3 is mapped to à (U+00C3: Latin capital letter A with tilde), and the second byte 0xA8 is mapped to ¨ diaeresis (U+00A8).

To display the string content using the correct UTF-8 encoding, you can use the explicit “s8” format specifier. For example, typing in the Command Window:

? &s[0],s8

the correct string is displayed, as this time the bytes in the std::string variable are interpreted as a UTF-8 sequence.

 

The s8 Format Specifier in the Command Window
The s8 Format Specifier in the Command Window

 

Similarly, the s8 format specifier can be used in the Watch window as well.

The s8 Format Specifier in the Watch Window
The s8 Format Specifier in the Watch Window

 

Printing Non-ASCII Characters to the Console on Windows

…A subtitle could read as: “From Greek letters to Os”.

I came across some Windows C++ source code that printed the Greek small letter pi (π, U+03C0) and capital letter sigma (Σ, U+03A3) to the console, as part of some mathematical formulae. A minimal reproducible code snippet based on the original code looks like this:

#include <stdio.h>

#define CHAR_SIGMA 227
#define CHAR_PI 228

int main()
{
    printf("Greek pi: %c\n", CHAR_PI);
    printf("Greek Sigma: %c\n", CHAR_SIGMA);
}

When executed on the original author’s PC, the above code prints the expected Greek letters correctly. However, when I run the exact same code on my PC, I got this wrong output:

Greek letters wrongly printed to the console
Greek letters wrongly printed to the console

What’s going wrong here?

The problem is that the author of that code probably looked up the codes for those Greek letters in some code page, probably code page 437.

However, on my Windows PC, the default code page for the console seems to be a different one, i.e. code page 850; in fact, in this code page (850) the codes 227 and 228 are mapped to Ò and õ (instead of Σ and π) respectively.

How to fix that?

I would suggest using Unicode instead of code pages. On Windows, the default “native” Unicode encoding is UTF-16. Using UTF-16, π (U+03C0) is encoded as the 16-bit code unit 0x03C0 ; Σ (U+03A3) is encoded as 0x03A3.

This code snippet shows UTF-16 console output in action:

#include <fcntl.h>
#include <io.h>
#include <stdio.h>

int wmain( /* int argc, wchar_t* argv[] */ )
{
    // Enable Unicode UTF-16 output to console
    _setmode(_fileno(stdout), _O_U16TEXT);

    // Use UTF-16 encoding for Greek letters
    static const wchar_t kPi = 0x03C0;
    static const wchar_t kSigma = 0x03A3;
    
    wprintf(L"Greek pi: %c\n", kPi);
    wprintf(L"Greek Sigma: %c\n", kSigma);
}
Greek letters correctly printed to the console
Greek letters correctly printed to the console

Note the use of _setmode and _O_U16TEXT to enable UTF-16 output to the console, and the associated <fcntl.h> and <io.h> additional headers. (More details on those can be found in this blog post.)

P.S. Bonus reading:

Unicode Encoding Conversions with STL Strings and Win32 APIs

 

The Unreadable Obscure LPXYSTR Typedefs

If you write Win32 C++ code, you’ll find a lot of unreadable and apparently obscure typedefs for strings, for example: LPCTSTR, LPTSTR, LPCWSTR, etc.

What are those? What do they mean?

The key to understand those apparently obscure (but certainly unreadable!) typedefs is to split their names into smaller parts.

Let’s start with the “decoding” of LPCTSTR:

  • LP means pointer. I think it actually means something like “long pointer”, in a period (16-bit Windows?) when probably there were both “long” (far?) and “short” (near?) pointers, but this predates my experience with the C++ Windows platform, so I just don’t know for sure.
  • C means: const, constant.
  • T means: TCHAR
  • STR means: C-style NUL-terminated string.

So, putting those elements together, an “LPCTSTR” is a “pointer to a read-only (const) NUL-terminated TCHAR string”, i.e. in C and C++ code: “const TCHAR*. You can find LPCTSTR parameters in many Win32 APIs, e.g. in SetWindowText.

Now, to be honest, to me “const TCHAR*” is much more readable than “LPCTSTR”: In fact, using the former, I can clearly see all the composing elements: “const”, “TCHAR”, “pointer”. Instead, in the latter “LPCTSTR” compressed form, the components are zipped together in an obscure unreadable form.

Since I gave a friendly piece of advice to forget about the TCHAR model, let’s substitute “TCHAR” with “wchar_t” in Unicode builds, so LPCTSTR becomes “const wchar_t*”! So, we get basically a pointer to a read-only C-style NUL-terminated Unicode (UTF-16) string.

Let’s do another one: LPTSTR.

  • LP: pointer
  • T: TCHAR
  • STR: NUL-terminated string

So, this is basically like LPCTSTR, but missing the “C” (“const”). So, this is a string buffer that isn’t read-only: you can write into it; for example, think of the output string returned by GetWindowText.

In code, it’s: “TCHAR *”, which is equivalent to “wchar_t*” in Unicode builds. Let’s see: “wchar_t*” vs. “LPTSTR”. Wow! The former (“wchar_t*”) is much simpler and more readable to me!

And let’s be honest that the difference between LPCTSTR and LPTSTR is very subtle: the “C” for “const” is buried in the middle of the LPCTSTR typedef! Instead, the difference between “const wchar_t*” and “wchar_t*” is much more clear!

Finally, let’s analyze LPCWSTR.

  • LP: pointer
  • C: const
  • W: wide, i.e. WCHAR/wchar_t based
  • STR: NUL-terminated string

So, this a pointer to a read-only (const) wchar_t-based (i.e. Unicode UTF-16) NUL-terminated string; in C++ code that translates to “const wchar_t*”. Again, in my opinion the expanded form is much more readable than the obscure “LPCWSTR” typedef. You can find this typedef used e.g. in the DrawThemeTextEx Win32 API.

Let’s summarize these results in a nice table:

Win32 Obscure Typedef Expanded Equivalent Equivalent in Unicode Builds
LPCTSTR const TCHAR * const wchar_t*
LPTSTR TCHAR * wchar_t*
LPCWSTR const wchar_t* const wchar_t*

Decoding other obscure Win32 typedefs is left as an Exercise for The Reader (TM). Actually, I hope those typedefs will now be a little bit less obscure…

Note also that you may find the pointer part sometimes abbreviated using “P” instead of “LP”: Welcome to the New World of (just) pointers, without long/short/near/far pointers 🙂

And, as a coding style, I think that using the underlying C++ types (like in “const wchar_t*”) is much better (e.g. clearer, less bug-prone) than using those Win32 typedefs (like “PCWSTR”).

 

A Friendly Piece of Advice: The TCHAR Model? Forget About It!

I believe modern Windows Win32/C++ applications should just use Unicode builds, instead of the obsolete TCHAR model.

For example, consider some Win32/C++ code like this:

CString str;
TCHAR* buf = str.GetBuffer(maxLength);
// Write something in buf ...
buf[n] = L'\0'; // (1)
str.ReleaseBuffer();

// Other code ...

TCHAR * text = /* some valid output buffer */;
swprintf_s(text, L"Something ...", ...); // (2)

// DoSomething() is declared as:
//
//   void DoSomething(const wchar_t* psz);
//
DoSomething(text); // (3)

The author of that piece of code thinks he followed the TCHAR model, probably because he used TCHAR throughout his code; but he’s wrong. For example:

  1. If buf is a TCHAR*, the correct TCHAR-model assignment should be “buf[n] = _T(‘\0’);”, i.e. use the _T or TEXT preprocessor macros instead of the L prefix.
  2. swprintf_s should be substituted with _stprintf_s, and again the L prefix should be substituted with the _T or TEXT preprocessor macros.
  3. The TCHAR string “text” should be converted to a wchar_t string before passing it to the DoSomething function, e.g. using an ATL helper like CT2W or CT2CW (e.g. “DoSomething(CT2W(text));”).

So, following the TCHAR model would actually make the code more complicated. And for what?

For being able to compile the same source code also in ANSI/MBCS? And is the whole code base properly tested for ANSI/MBCS builds? And in case of ANSI/MBCS builds, what code page(s) is(are) used for char-based strings? Are the conversions from those code pages to Unicode done properly? This can be a bug factory! And, do you (or your clients) really need an ANSI/MBCS build of your software??

So, in my opinion, the best thing to do in modern Win32 C++ applications is to just forget about the TCHAR model, and to just use Unicode UTF-16 at least at the Win32 API boundary, and just use WCHAR or wchar_t instead of TCHAR (in other words: assume that TCHAR is wchar_t), and assume that Win32 APIs that take TCHAR-strings (e.g. SetWindowText) just take Unicode UTF-16 wchar_t-strings.

In other words: Just assume Unicode builds for Win32 C++ code.

I’d rewrite the above code fragment simply assuming that TCHAR is wchar_t, like this:

CString str;
// CString stores Unicode UTF-16 text 
// in Unicode builds
wchar_t* buf = str.GetBuffer(maxLength);

// Write something in buf ...
buf[n] = L'\0';
str.ReleaseBuffer();

// Other code ...

wchar_t * text = /* some valid output buffer */;
swprintf_s(text, L"Something ...", ...);

// DoSomething() is declared as:
//
//   void DoSomething(const wchar_t* psz);
//
DoSomething(text);

It’s also interesting to point out that some “modern” (e.g. Vista+) Windows APIs like DrawThemeTextEx just take Unicode (UTF-16) strings; i.e. the pszText parameter of the aforementioned API is of type LPCWSTR, i.e. “const WCHAR*” (instead of LPCTSTR, i.e. “const TCHAR*”).

 

The Chinese Dictionary Loading Benchmark Revised

Available here on GitHub.

[…] All in all, I’d be happy with the optimization level reached in #2: Ditch C++ standard I/O streams and locale/codecvt in favor of memory-mapped files for reading files and MultiByteToWideChar Win32 API for UTF-8 to UTF-16 conversions, but just continue using the STL’s wstring (or CString) class!

Unicode UTF-16/UTF-8 Encoding Conversions: Win32 API vs. C++ Standard Library Performance

In this MSDN Magazine article, I showed how to convert Unicode text between UTF-16 and UTF-8 encodings using direct Win32 API calls (in particular, I discussed in details the use of the MultiByteToWideChar API).

In addition to that, the C++ standard library offers some classes to perform such conversions.

For example, you can combine std::codecvt_utf8_utf16 with std::wstring_convert to convert from a UTF-16-encoded std::wstring to a UTF-8 std::string:

#include <codecvt>  // for codecvt_utf8_utf16
#include <locale>   // for wstring_convert
#include <string>   // for string, wstring

...
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conversion;
std::wstring utf16 = L"Some UTF-16 string";
std::string utf8 = conversion.to_bytes(utf16);

I developed some C++ code to convert many strings and measure the time spent for the conversions using the aforementioned C++ standard library classes vs. direct Win32 API calls, and the result is clearly in favor of the latter (code compiled with VS2015 in release build):

STL: 1050 ms

Win32: 379 ms  << — Clearly wins

 

MSDN Magazine article on Unicode Encoding Conversions with STL Strings and Win32 APIs

Another C++ article of mine was published on MSDN Magazine (in the 2016 September issue):

“Unicode Encoding Conversions with STL Strings and Win32 APIs”

Check it out!

Thanks to my editor Sharon Terdeman for her excellent editing work, and to my tech reviewers David Cravey and Marc Gregoire for their useful suggestions and feedback.

EDIT (2016-SEP-02): A Visual Studio solution containing C++ code based on this article can be found on GitHub here.

 

 

The Sticky Preprocessor-Based TCHAR Model – Part 2: Where’s My Function?!?

In the previous blog post, I briefly introduced the TCHAR model. I did that not because I think that’s a quality model that should be used in modern Windows C++ applications: on the contrary, I dislike it and consider it useless nowadays. The reason why I introduced the TCHAR model is to help you understand what can be a very nasty bug in your C++ Windows projects.

So, suppose that you are building a cross-platform abstraction layer for some C++ application: in particular, you have a function that returns a string containing the path of the directory designated for temporary files, something like this:

// FILE: Utils.h

#pragma once

#include <string>

namespace Utils
{
    std::string GetTempPath();

    // ... Other functions
}

For the Windows implementation, this function is defined in terms of the GetTempPath Win32 API. In order to use that API, inside the corresponding Utils.cpp source, <Windows.h> is included:

// FILE: Utils.cpp

#include <Windows.h>
#include "Utils.h"

// Utils::GetTempPath() implementation ...

Now, suppose that you have another .cpp file, with cross-platform C++ code, that uses Utils::GetTempPath(). Note that, since this is cross-platform C++ code, <Windows.h> is not included in there. Think for example even of something as simple as:

// FILE: Main.cpp

#include <iostream>
#include "Utils.h"  // for Utils::GetTempPath()

int main()
{
    std::cout << Utils::GetTempPath() << '\n';
}

Well, this code won’t build. You’ll get a linker error, something like this:

1>Main.obj : error LNK2019: unresolved external symbol 
"class std::basic_string<char,struct std::char_traits<char>,
class std::allocator<char> > __cdecl Utils::GetTempPath(void)" 
(?GetTempPath@Utils@@YA?AV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@XZ) 
referenced in function _main

After removing a little bit of “noise” (including some C++ name mangling), basically the error is:

1>Main.obj : error LNK2019: unresolved external symbol “std::string Utils::GetTempPath()” referenced in function main

So, the linker is complaining about the Utils::GetTempPath() function.

Then you may start going crazy, double- and triple-checking the correct spelling of “GetTempPath” inside your Utils.h header, inside Utils.cpp, inside Main.cpp, etc. But there are no typos: GetTempPath is actually spelled correctly in every place.

Then, you try to rebuild the solution inside Visual Studio one more time, but the mysterious linker error shows up again.

What’s going on? Is this a linker bug? Time to file a bug on Connect?

Nope.

It’s just the nasty preprocessor-based TCHAR model that sneaked into our code!

Let’s try to analyze what’s happened in some details.

In this case, there are a couple of translation units to focus our attention on: one is from the Utils.cpp source file, containing the definition (implementation) of Utils::GetTempPath. The other is from the Main.cpp source file, calling the Utils::GetTempPath function (which is expected to be implemented in the former translation unit).

In the Utils.cpp’s translation unit, the <Windows.h> header is included. This header brings with it the preprocessor-based TCHAR model, discussed in the previous blog post. So, a preprocessor macro named “GetTempPath” is defined, and it is expanded to “GetTempPathW” in Unicode builds.

Think of it as an automatic search-and-replace process: before the actual compilation of C++ code begins, the preprocessor examines the source code, and automatically replaces all instances of “GetTempPath” with “GetTempPathW”. The Utils::GetTempPath function name is found and replaced as well, just like the other occurrences of “GetTempPath”. So, to the C++ compiler and linker, the actual function name for this translation unit is Utils::GetTempPathW (not Utils::GetTempPath, as written in source code!).

Now, what’s happening at the Main.cpp’s translation unit? Since here <Windows.h> was not included (directly or indirectly), the TCHAR preprocessor model didn’t kick in. So, this translation unit is genuinely expecting a Utils::GetTempPath function, just as specified in the Utils.h header. But since the Utils.cpp’s translation unit produced a Utils::GetTempPathW function (because of the TCHAR model’s preprocessor #define), the linker can’t find any definition (implementation) of Utils::GetTempPath, hence the aforementioned apparently mysterious linker error.

TCHAR Preprocessor Bug
TCHAR Preprocessor Bug

This can be a time-wasting subtle bug to spot, especially in non-trivial code bases, and especially when you don’t know about the TCHAR preprocessor model.

You should pay attention to functions and methods that have the same name of Win32 APIs, that can be subjected to this subtle TCHAR preprocessor transformation.

To fix that, an option is to #undef the TCHAR-modified definition of the identifier in Utils.h:

//
// Remove TCHAR preprocessor redefinition 
// of GetTempPath
//
#ifdef GetTempPath
#undef GetTempPath
#endif

A simple repro solution can be downloaded here from GitHub.