Simplifying Windows Registry Programming with the C++ WinReg Library

The native Windows Registry API is a C-interface API, that is low-level and kind of hard and cumbersome to use.

For example, suppose that you simply want to read a string value under a given key. You would end up writing code like this:

Sample code excerpt to read a string value from the Windows Registry using the native Windows API.
Sample code excerpt to read a string value from the Windows Registry using the native Windows API.

And this is just the part to query the destination string length. Then, you need to allocate a string object with proper size (and pay attention to proper size-in-bytes-to-size-in-wchar_ts conversion!), and after that you can finally read the actual string value into the local string object.

That’s definitely a lot of bug-prone C++ code, and this is just to query a string value!

Moreover, in modern C++ code you should prefer using nice higher-level resource manager classes with automatic resource cleanup, instead of raw HKEY handles.

Fortunately, it’s possible to hide that kind of complex and bug-prone code in a nice C++ library, that offers a much more programmer-friendly interface. This is basically what my C++ WinReg library does.

You can query a string value with just one simple line of C++ code using WinReg.
You can query a string value with just one simple line of C++ code using WinReg.

WinReg is an open-source C++ library, available on GitHub. For the sake of convenience, I packaged and distribute it as a header-only library, which is also available via the vcpkg package manager.

If you need to access the Windows Registry from your C++ code, you may want to give C++ WinReg a try.

 

Printing UTF-8 Text to the Windows Console: Sample Code

In a previous blog post, you saw that to print some UTF-8-encoded text to the Windows console, you first have to convert it to UTF-16.

In fact, calling _setmode to change stdout to _O_U8TEXT, and then trying to print UTF-8-encoded text with cout, resulted in a debug assertion failure in the VC++ runtime library. (Please take a look at the aforementioned blog post for more details.)

That blog post lacked some compilable demo code showing the solution, though. So, here you are:

// Test printing UTF-8-encoded text to the Windows console

#include "UnicodeConv.hpp"  // My Unicode conversion helpers

#include <fcntl.h>
#include <io.h>
#include <stdint.h>
#include <iostream>
#include <string>

int main()
{
    // Change stdout to Unicode UTF-16.
    //
    // Note: _O_U8TEXT doesn't seem to work, e.g.:
    // https://blogs.msmvps.com/gdicanio/2017/08/22/printing-utf-8-text-to-the-windows-console/
    //
    _setmode(_fileno(stdout), _O_U16TEXT);


    // Japanese name for Japan, encoded in UTF-8
    uint8_t utf8[] = {
        0xE6, 0x97, 0xA5, // U+65E5
        0xE6, 0x9C, 0xAC, // U+672C
        0x00
    };
    std::string japan(reinterpret_cast<const char*>(utf8));

    // Print UTF-16-encoded text
    std::wcout << Utf16FromUtf8("Japan") << L"\n\n";
    std::wcout << Utf16FromUtf8(japan)   << L'\n';
}

This is the output:

Unicode UTF-8 text converted to UTF-16 and printed out to the Windows console

All right.

Note also that I set the Windows console font to MS Gothic to be able to correctly render the Japanese kanjis.

The compilable C++ source code, including the implementation of the Utf16FromUtf8 Unicode conversion helper function, can be found here on GitHub.

 

C++ WinReg Update

The Windows Registry native C API is very low-level and hard to use.

A while back I developed some convenient higher-level C++ wrappers, that should make Windows Registry programming easier and more fun. You can find them on GitHub.

Just to give you a taste of this C++ WinReg library, you can simply open a registry key with code like this:

RegKey key{ HKEY_CURRENT_USER, L"SOFTWARE\\MyKey" };

And you can read registry values like that:

DWORD dw = key.GetDwordValue(L"MagicCode");
wstring s = key.GetStringValue(L"ReadMe");

Enumerating values under a given key is as easy as:

auto values = key.EnumValues();

Error management is performed using C++ exceptions (instead of “raw” error codes like in the native C API).

You can find more info in the README file.

I haven’t touched that project in a few months (being very busy), but yesterday I accepted an external contribution and merged a pull request adding a DeleteTree feature.

I take this occasion to thank everyone who showed interest and appreciation for this project.

Happy coding!

 

ATL::CStringW vs. std::wstring Performance: String Sorting and Comparisons

According to previous measurements, std::wstring performs better than ATL::CStringW:

  1. in all string concatenation tests
  2. when sorting string vectors that are made by small strings, thanks to std::basic_string’s SSO

So, I focused my attention on the string vector sorting scenario, and it seems to me that (at least in the VS2015 implementation), the slowdown of (non-SSO) wstrings is caused by wmemcmp calls, that are used to compare wstrings when sorting the string vectors. On the other hand, CStringW invokes wcscmp, that seems to run faster.

In fact, invoking std::sort with a custom comparator function that calls wcscmp to compare the C-style pointers returned by wstring::c_str, results in faster vector<wstring> sorting times. So, in this case wstring sorting performs better than CStringW even for non-SSO strings.

However, as Stephan T. Lavavej pointed out, std::basic_string supports embedded nulls, so wstring cannot use wcscmp (that works only for C-style null-terminated strings, without embedded nulls).

 

String Performance Tests: ATL vs. STL

I wrote some C++ code to test the performance of the ATL CStringW class vs. the C++ Standard Library’s std::wstring.

There are several aspects that can be considered when comparing string class performance: In the aforementioned code, I tested string vector sorting and string concatenation.

The code is available in this repository on GitHub.

You can take a look at the README for further details (including my test results).

 

Subtle Bug When Converting Strings to Lowercase

Suppose that you want to convert a std::string object to lowercase.

The first thing you would do is probably searching the std::string documentation for a convenient easy simple method named to_lower, or something like that. Unfortunately, there’s nothing like that.

So, you might start developing your own “to_lower” function. A typical implementation I’ve seen of such custom function goes something like this: For each character in the input string, convert it to lowercase invoking std::tolower. In fact, there’s even this sample code on cppreference.com:

// From http://en.cppreference.com/w/cpp/string/byte/tolower

std::string str_tolower(std::string s) {
    std::transform(s.begin(), s.end(), s.begin(), 
                   [](unsigned char c) { return std::tolower(c); }
                  );
    return s;
}

Well, if you try this code with something like str_tolower(“Connie”), everything seems to work fine, and you get “connie” as expected.

Now, since C++ folks like storing UTF-8-encoded text in std::string objects, in some large code base someone happily takes the aforementioned str_tolower function, and invokes it with their lovely UTF-8 strings. Fun ensured! …Well, actually, bugs ensured.

So, the problem is that str_tolower, under the hood, calls std::tolower on each char in the input string. While this works fine for pure ASCII strings like “Connie”, such code is a bug farm for UTF-8 strings. In fact, UTF-8 is a variable-width character encoding. So, there are some Unicode “characters” (code points) that are encoded in UTF-8 using one byte, while other characters are encoded using two bytes, and so on, up to four bytes. The poor std::tolower has no clue of such UTF-8 encoding features, so it innocently spits out wrong results, char by char.

For example, I tried invoking the above function on “PERCHÉ” (the last character is the Unicode U+00C9 LATIN CAPITAL LETTER E WITH ACUTE, encoded in UTF-8 as the two-byte sequence 0xC3 0x89), and the result I got was “perchÉ” instead of the expected “perché” (é is Unicode U+00E9, LATIN SMALL LETTER E WITH ACUTE). So, the pure ASCII characters in the input string were all correctly converted to lowercase, but the final non-ASCII character wasn’t.

Actually, it’s not the std::tolower function: It’s that this function was misused, invoking it in a way that the function was not designed for.

This is one of the perils of taking std::string-based C++ code that initially worked with ASCII strings, and thoughtlessly reuse it for UTF-8-encoded text.

In fact, we saw a very similar bug in a previous blog post.

So, how can you fix that problem? Well, a portable way is using the ICU library with its icu::UnicodeString class and its toLower method.

On the other hand, if you are writing Windows-specific C++ code, you can use the LcMapStringEx API. Note that this function uses the UTF-16 encoding (as almost all Windows Unicode APIs do). So, if you have UTF-8-encoded text stored in std::string objects, you first have to convert it from UTF-8 to UTF-16, then invoke the aforementioned API, and finally convert the UTF-16-encoded result back to UTF-8. For these UTF-8/UTF-16 conversions, you may find my MSDN Magazine article on “Unicode Encoding Conversions with STL Strings and Win32 APIs” interesting.

 

Maps with Case Insensitive String Keys

How to implement a map with case insensitive string keys? If you use the standard std::map associative container with std::string or std::wstring as key types, you get a case sensitive comparison by default.

If you take a look at std::map documentation, you’ll see that in addition to the key type and value type, there’s also a third template parameter that you can plug into std::map: it’s a comparison function object to sort the keys. The default option for this comparison function is std::less<Key>.

So, if you provide a custom comparison object that ignores the key string case, you can have a map with case insensitive keys:

map<string, ValueType, StringIgnoreCaseLess> myMap;

Now, the question is: What would such comparison object look like?

A Failed Approach

An approach I saw sometimes (e.g. on StackOverflow) is to use std::lexicographical_compare, comparing the strings char-by-char, after invoking tolower on each char. Basically, the idea is to compare the lowercase versions of each corresponding characters in the input strings. The code from the aforementioned SO answer follows:

// Code from: https://stackoverflow.com/a/1801913/1629821

struct ci_less
{
  // case-independent (ci) compare_less binary function
  struct nocase_compare
  {
    bool operator() (const unsigned char& c1, const unsigned char& c2) const {
      return tolower (c1) < tolower (c2); 
    }
  };
  
  bool operator() (const std::string & s1, const std::string & s2) const {
    return std::lexicographical_compare 
        (s1.begin (), s1.end (),   // source range
         s2.begin (), s2.end (),   // dest range
         nocase_compare ());       // comparison
  }
};

The problem with this code is that it doesn’t work for international strings. In fact, while for a pure ASCII string like “Connie” you can use this technique to successfully compare “Connie” with “connie” or “CONNIE”, this won’t work for strings containing international characters.

For example, consider the Italian word “perché”. The last character in “perché” is U+00E9, i.e. the ‘LATIN SMALL LETTER E WITH ACUTE’, which is encoded in UTF-8 as the hex byte sequence 0xC3 0xA9. Its uppercase form is É U+00C9 (encoded in UTF-8 as 0xC3 0x89). So, let’s assume you use the UTF-8 encoding to store your international text in std::string objects. Well, invoking tolower char by char, as implemented in the aforementioned SO answer, will fail. In fact, tolower is unable to correctly process UTF-8 sequences (at least in the Microsoft VS2015 CRT implementation I used in my tests).

A Better Approach for International Text

So, how to fix that? Well, on Windows there’s a CompareStringEx API (available since Vista, according to the MSDN documentation) that seems to work, at least in my tests with some Italian text. You can call this API passing the NORM_IGNORECASE comparison flag.

As for most Windows APIs, the Unicode encoding used for text is UTF-16. So, let’s start writing a nice C++ wrapper around this CompareStringEx C-interface API. Let’s assume that we want to compare two UTF-16 wstrings, and that they are “ordinary” strings that don’t contain embedded NULs. A possible implementation for this C++ helper function follows:

// C++ wrapper around the Windows CompareStringEx C API
inline int CompareStringIgnoreCase(const std::wstring& s1, 
                                   const std::wstring& s2)
{
    // According to the MSDN documentation, the CompareStringEx function 
    // is optimized for NORM_IGNORECASE and string lengths specified as -1.

    return ::CompareStringEx(
        LOCALE_NAME_INVARIANT,
        NORM_IGNORECASE,
        s1.c_str(),
        -1,
        s2.c_str(),
        -1,
        nullptr,        // reserved
        nullptr,        // reserved
        0               // reserved
    );
}

Now, you can simply invoke this C++ helper function inside the comparison object that will be used with std::map:

// Comparison object for std::map, ignoring string case
struct StringIgnoreCaseLess
{
    bool operator()(const std::wstring& s1, const std::wstring& s2) const
    {
        // (s1 < s2) ignoring string case
        return CompareStringIgnoreCase(s1, s2) == CSTR_LESS_THAN;
    }
};

This basically implements the condition (s1 < s2) ignoring the string case.

And, finally, you can simply plug this comparison object into a std::map with UTF-16-encoded wstring keys:

map<wstring, ValueType, StringIgnoreCaseLess> myCaseInsensitiveStringMap;

Or, using the nice C++11 alias template feature:

template <typename ValueType>
using CaseInsensitiveStringMap = std::map<std::wstring, ValueType, 
                                          StringIgnoreCaseLess>;

// Simply use CaseInsensitiveStringMap<ValueType>

You can find some compilable C++ sample code on GitHub.

Note on UTF-8 String Keys

If you really want to use UTF-8-encoded std::string keys, then you have to add some code to the comparison object, to first convert the input strings from UTF-8 to UTF-16, and then invoke CompareStringEx for the UTF-16 text comparison.

If you are working on C++ code that is already Windows platform specific, I think that choosing the UTF-16 encoding to represent international text is more “natural” (and more efficient) than converting back and forth between UTF-8 and UTF-16.

 

Detecting Unicode Space Characters

Some programmers love UTF-8 as they believe they can reuse old “ANSI” APIs with UTF-8-encoded text. UTF-8 does have some advantages (like being endian-neutral), but pretending to blindly reuse old “ANSI” APIs for UTF-8 text is not one of them.

For example: There are various space characters defined in Unicode.

You can use the iswspace function to check if a Unicode UTF-16 wide character is a white-space, e.g.:

#include <ctype.h>      // for iswspace
#include <iostream>

int main()
{
    // Let’s do a test with the punctuation space (U+2008)
    const wchar_t wch = 0x2008;

    if (iswspace(wch)) {
        std::cout << "OK.\n";
    }
}

The corresponding old “ANSI” function is isspace: Can you use it with Unicode text encoded in UTF-8? I’m open to be proven wrong, but I think that’s not possible.

 

Printing UTF-8 Text to the Windows Console

Let’s suppose you have a string encoded in UTF-8, and you want to print it to the Windows console.

You might have heard of the _setmode function and the _O_U8TEXT flag.

The MSDN documentation contains this compilable code snippet that you can use as the starting point for your experimentation:

// crt_setmodeunicode.c  
// This program uses _setmode to change  
// stdout to Unicode. Cyrillic and Ideographic  
// characters will appear on the console (if  
// your console font supports those character sets).  
  
#include <fcntl.h>  
#include <io.h>  
#include <stdio.h>  
  
int main(void) {  
    _setmode(_fileno(stdout), _O_U16TEXT);  
    wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");  
    return 0;  
}  

So, to print UTF-8-encoded text, you may think of substituting the _O_U16TEXT flag with _O_U8TEXT, and use printf (or cout) with a byte sequence representing your UTF-8-encoded string.

For example, let’s consider the Japanese name for Japan, written using the kanji 日本.

The first kanji is the Unicode code point U+65E5; the second kanji is U+672C. Their UTF-8 encodings are the 3-byte sequences 0xE6 0x97 0xA5 and 0xE6 0x9C 0xAC, respectively.

So, let’s consider this compilable code snippet that tries to print a UTF-8-encoded string:

#include <fcntl.h>
#include <io.h>
#include <stdint.h>
#include <iostream>

int main()
{
    _setmode(_fileno(stdout), _O_U8TEXT);

    // Japanese name for Japan, 
    // encoded in UTF-8
    uint8_t utf8[] = { 
        0xE6, 0x97, 0xA5, // U+65E5
        0xE6, 0x9C, 0xAC, // U+672C
        0x00 
    };

    std::cout << reinterpret_cast<const char*>(utf8) << '\n';
}

This code compiles fine. However, if you run it, you’ll get this error message:

Error when trying to print UTF-8 text to the console
Error when trying to print UTF-8 text to the console

So, how to print some UTF-8 encoded text to the Windows command prompt?

Well, it seems that you have to first convert from UTF-8 to UTF-16, and then use wprintf or wcout to print the UTF-16-encoded text. This isn’t optimal, but at least it seems to work.

EDIT 2019-04-09: See this new blog post for a solution with compilable demo code.