What’s Wrong with My UTF-8 Strings in Visual Studio?

Probably nothing: Maybe you just have to tell Visual Studio it’s an UTF-8-encoded string!

The std::string class can be used to store UTF-8-encoded Unicode text.

For example:

std::string s{"LATIN SMALL LETTER E WITH GRAVE (U+00E8): \xC3\xA8"};

However, in the Locals window, instead of the expected è (Latin small letter e with grave, U+00E8), Visual Studio displays some (apparently) garbage characters, like a Latin capital letter A with tilde, followed by diaeresis.

UTF-8 String Misinterpreted in the Locals Window
UTF-8 String Misinterpreted in the Locals Window

Why is that? Is there a bug in our C++ code? Did we pick the wrong UTF-8 bytes for ‘è’?

No. The UTF-8 encoding for è (U+00E8) is exactly the 2-byte sequence 0xC3 0xA8, so the above C++ code is correct.

The problem is that Visual Studio doesn’t use the UTF-8 encoding to display that string in the Locals window. It turns out that VS is probably using the Windows-1252 code page (a character encoding commonly mislabeled as “ANSI” on Windows…). And, in this character encoding, the first byte 0xC3 is mapped to à (U+00C3: Latin capital letter A with tilde), and the second byte 0xA8 is mapped to ¨ diaeresis (U+00A8).

To display the string content using the correct UTF-8 encoding, you can use the explicit “s8” format specifier. For example, typing in the Command Window:

? &s[0],s8

the correct string is displayed, as this time the bytes in the std::string variable are interpreted as a UTF-8 sequence.

 

The s8 Format Specifier in the Command Window
The s8 Format Specifier in the Command Window

 

Similarly, the s8 format specifier can be used in the Watch window as well.

The s8 Format Specifier in the Watch Window
The s8 Format Specifier in the Watch Window

 

Leave a Reply

Your email address will not be published. Required fields are marked *