A Few Options for Crossing Module Boundaries

It’s common to build complex software systems mixing components written in different languages.

For example, you may have a GUI written in C#, and some high-performance component written in C++, and you need to exchange data between these.

In such cases, there are several options. For example:

  1. COM: You can embed the C++ high-performance code in some COM component, exposing COM interfaces. The C# GUI subsystem talks to this high-performance component using COM interop.
  2. C-interface DLL: You can build a C-interface native DLL, “flattening” the C++ component interface using C functions. You can use PInvoke declarations on the C# side to communicate with the C++ component.
  3. C++/CLI: You can build a bridging layer between C++ and C# using C++/CLI.

Each one of these options have pros and cons.

For example, the C++/CLI approach is much easier than COM. However, C++/CLI is restricted to clients written in C# (and other .NET languages); instead COM components can be consumed by a broader audience.

The C-interface DLL option is also widely usable, as C is a great language for module boundaries, and many languages are able to “talk” with C interfaces. However, in this case you are flattening an object-oriented API to a C-style function-based interface (instead, both COM and C++/CLI maintain a more object-oriented nature).

Moreover, both COM and C++/CLI are Windows-specific technologies; on the other hand, a C interface resonates better with cross-platform code.

 

A Subtle Bug with PInvoke and Safe Arrays Storing Variant Bytes

When exchanging array data between different module boundaries using safe arrays, I tend to prefer (and suggest) safe arrays of direct types, like BYTEs, or BSTR strings, instead of safe array storing variants (that in turn contain BYTEs, or BSTRs, etc.).

However, there are some scripting clients that only understand safe arrays storing variants. So, if you want to support such clients, you have to pack the original array data items into variants, and build a safe array of variants.

If you have a COM interface method or C-interface function that produces a safe array of variants that contain BSTR strings, and you want to consume this array in C# code,  the following PInvoke seems to work fine:

[DllImport("NativeDll.dll", PreserveSig = false)]
pubic static extern void BuildVariantStringArray(
  [Out, MarshalAs(UnmanagedType.SafeArray, SafeArraySubType = VarEnum.VT_VARIANT)]
  out string[] result);

So, if you have a safe array of variants that contain BYTEs, you may deduce that such a PInvoke declaration would work fine as well:

[DllImport("NativeDll.dll", PreserveSig = false)]
pubic static extern void BuildVariantByteArray(
  [Out, MarshalAs(UnmanagedType.SafeArray, SafeArraySubType = VarEnum.VT_VARIANT)]
  out byte[] result);

I’ve just changed “string[]” to “byte[]” in the declaration of the “result” out parameter.

Unfortunately, this doesn’t work. What you get as a result in the output byte array is garbage.

The fix in this case of safe array of variant bytes is to use an object[] array in C#, which directly maps the original safe array of variants (as variants are marshaled to objects in C#):

[DllImport("NativeDll.dll", PreserveSig = false)]
pubic static extern void BuildVariantByteArray(
  [Out, MarshalAs(UnmanagedType.SafeArray, SafeArraySubType = VarEnum.VT_VARIANT)]
  out object[] result);

And then manually convert from the returned object[] array to a byte[] array, for example using the C# Array.CopyTo method; e.g.:

// Get a safe array of variants (that contain bytes).
object[] data;
BuildVariantByteArray(out data);

// "Render" (copy) the previous object array 
// to a new byte array.
byte[] byteData = new byte[data.Length];
data.CopyTo(byteData, 0);

// Use byteData...

A variant is marshaled using object in C#. So a safe array of variants is marshaled using an object array in C#. In the case of safe arrays of variant bytes, the returned bytes are boxed in objects. Using Array.CopyTo, these bytes get unboxed and stuffed into a byte array.

The additional CopyTo step doesn’t seem necessary in the safe array of string variants, probably because strings are objects in C#.

Still, I think this aspect of the .NET/C# marshaler should be fixed, and if a PInvoke declaration clearly states byte[] on the C# side, the marshaler should automatically unbox the bytes from the safe array of variants.

 

MSDN Magazine Article: Simplify Safe Array Programming in C++

The March 2017 issue of MSDN Magazine contains a feature article of mine on simplifying safe array programming in C++ with the help of the ATL’s CComSafeArray class template.

There is also an accompanying web-only side bar introducing the SAFEARRAY C data structure and some of the basic operations available for it via Win32 API calls, although for C++ code I encourage the use of a convenient higher-level C++ object-oriented wrapper like ATL::CComSafeArray.

Safe arrays are useful for example when you have a COM component and you want to exchange array data between the component and its clients (that can be potentially written in languages even different than C++, e.g. C#, or scripting languages).

I wish I could have had such a resource available when I did some safe array programming in C++.

Some of the insights and experience I developed in that regard are distilled in the aforementioned article.

I hope it may be helpful to someone.

Check it out here!

 

Updates to the ATL/STL Unicode Encoding Conversion Code

I’ve updated my code on GitHub for converting between UTF-8, using STL std::string, and UTF-16, using ATL CStringW.

Now, on errors, the code throws instances of a custom exception class that is derived from std::runtime_error, and is capable of containing more information than a simple CAtlException.

Moreover, I’ve added a couple of overloads for converting from source string views (specified using an STL-style [start, finish) pointer range). This makes it possible to efficiently convert only portions of longer strings, without creating ad hoc CString or std::string instances to store those partial views.

 

Custom C++ String Pool Allocator on GitHub

I’ve uploaded on GitHub some C++ code of mine, implementing a custom string pool allocator.

The basic idea is to allocate big chunks of memory, and then serve single string allocations carving memory from inside those blocks, with a simple fast pointer increase.

There’s also a benchmark comparing this custom allocator vs. STL’s strings.

Custom string pool allocator benchmark results.
Custom string pool allocator benchmark results.

The results clearly show that both allocating strings that way, and sorting them, is faster than using the default std::wstring class.

 

The New C++11 u16string Doesn’t Play Well with Win32 APIs

Someone asked why in this article I used std::wstring instead of the new C++11 std::u16string for Unicode UTF-16 text.

The key point is that Win32 Unicode UTF-16 APIs use wchar_t as their code unit type; wstring is based on wchar_t, so it works fine with those APIs.

On the other hand, u16string is based on the char16_t type, which is a new built-in type introduced in C++11, and is different from wchar_t.

So, if you have a u16string variable and you try to use it with a Win32 Unicode API, e.g.:

// std::u16string s;
SetWindowText(hWnd, s.c_str());

Visual Studio 2015 complains (emphasis mine):

error C2664: ‘BOOL SetWindowTextW(HWND,LPCWSTR)’: cannot convert argument 2 from ‘const char16_t *’ to ‘LPCWSTR’

note: Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast

wchar_t is non-portable in the sense that its size isn’t specified by the standard; but, all in all, if you are invoking Win32 APIs you are already in an area of code that is non-portable (as Win32 APIs are Windows platform specific), so adding wstring (or even CString!) to that mix doesn’t change anything with respect to portability (or lack thereof).

 

Unicode Encoding Conversions Between ATL and STL Strings

At the Windows platform-specific level, when using C++ frameworks like ATL (or MFC), it makes sense to represent strings using CString (CStringW in Unicode builds, which should be the default in modern Windows C++ code).

On the other hand, in cross-platform C++ code, using std::string makes sense as well.

The same Unicode text can be encoded in UTF-16 when stored in CString(W), and in UTF-8 for std::string.

So, there’s a need to convert between those two Unicode representations. I discussed the details of such conversions in my MSDN Magazine article: “Unicode Encoding Conversions with STL Strings and Win32 APIs”However, the C++ code associated to that article used std::wstring for UTF-16 strings.

I’ve created a new repository on GitHub, where I uploaded reusable code (in the form of a convenient header-only module) for converting between UTF-16 and UTF-8, using CStringW for UTF-16, and std::string for UTF-8. Please feel free to check it out!

 

Passing std::vector’s Underlying Array to C APIs

Often, there’s a need to pass some data stored as an array from C++ to C-interface APIs. The “default” first-choice STL container for storing arrays in C++ is std::vector. So, how to pass the array content managed by std::vector to a C-interface API?

The Wrong Way I saw that kind of C++ code:

// v is a std::vector<BYTE>.
// Pass it to a C-interface API: pointer + size in bytes
DoSomethingC( 
  /* Some cast, e.g.: (BYTE*) */ &v, 
  sizeof(v) 
);

That’s wrong, in two ways: for both the pointer and the size. Let’s talk the size first: sizeof(v) represents the size, in bytes, of an instance of std::vector, which is in general different from the size in bytes of the array data managed by the vector. For example, suppose that a std::vector is implemented using three pointers, e.g. to begin of data, to end of data, and to end of reserved capacity; in this case, sizeof(v) would be sizeof(pointer) * 3, which is 8 (pointer size, in bytes, in 64-bit architectures) * 3 = 24 bytes on 64-bit architectures (4*3 = 12 bytes on 32-bit).

But what the author of that piece of code actually wanted was the size in bytes of the array data managed (pointed to) by the std::vector, which you can get multiplying the vector’s element count returned from v.size() by the size in bytes of a single vector element. For a vector<BYTE>, the value returned by v.size() is just fine (in fact, sizeof(BYTE) is one).

Now let’s discuss the address (pointer) problem. “&v” points to the beginning of the std::vector’s internal representation (i.e. the internal “guts” of std::vector), which is implementation-defined, and isn’t interesting at all for the purpose of that piece of code. Actually, misinterpreting the std::vector’s internal implementation details with the array data managed by the vector is dangerous, as in case of write access the called function will end up stomping the vector’s internal state with unrelated bytes. So, on return, the vector object will be in a corrupted and unusable state, and the memory previously owned by the vector will be leaked.

In case of read access, the vector’s internal state will be read, instead of the intended actual std::vector’s array content.

The presence of a cast is also a signal that something may be wrong in the user’s code, and maybe the C++ compiler was actually helping with a warning or an error message, but it was silenced instead.

So, how to fix that? Well, the pointer to the array data managed by std::vector can be retrieved calling the vector::data() method. This method is offered in both a const version for read-only access to the vector’s content, and in a non-const version, for read-write access.

The Correct Way So, the correct code to pass the std::vector’s underlying array data to a C-interface API expecting a pointer and a size would be for the case discussed above:

DoSomethingC(v.data(), v.size());

Or, if you have e.g. a std::vector<double> and the size parameter is expressed in bytes (instead of element count):

DoSomethingC(v.data(), v.size() * sizeof(double));

An alternative syntax to calling vector::data() would be “&v[0]”, although the intent using vector::data() seems clearer to me. Moreover, vector::data() works also for empty vectors, returning nullptr in this case. Instead, “&v[0]” triggers a “vector subscript out of range” debug assertion failure in MSVC when used on an empty vector (in fact, for an empty vector it doesn’t make sense to access the first item at index zero, as the vector is empty and there’s no first item).

&v[0] on an empty vector: debug assertion failure
&v[0] on an empty vector: debug assertion failure

What’s Wrong with My UTF-8 Strings in Visual Studio?

Probably nothing: Maybe you just have to tell Visual Studio it’s an UTF-8-encoded string!

The std::string class can be used to store UTF-8-encoded Unicode text.

For example:

std::string s{"LATIN SMALL LETTER E WITH GRAVE (U+00E8): \xC3\xA8"};

However, in the Locals window, instead of the expected è (Latin small letter e with grave, U+00E8), Visual Studio displays some (apparently) garbage characters, like a Latin capital letter A with tilde, followed by diaeresis.

UTF-8 String Misinterpreted in the Locals Window
UTF-8 String Misinterpreted in the Locals Window

Why is that? Is there a bug in our C++ code? Did we pick the wrong UTF-8 bytes for ‘è’?

No. The UTF-8 encoding for è (U+00E8) is exactly the 2-byte sequence 0xC3 0xA8, so the above C++ code is correct.

The problem is that Visual Studio doesn’t use the UTF-8 encoding to display that string in the Locals window. It turns out that VS is probably using the Windows-1252 code page (a character encoding commonly mislabeled as “ANSI” on Windows…). And, in this character encoding, the first byte 0xC3 is mapped to à (U+00C3: Latin capital letter A with tilde), and the second byte 0xA8 is mapped to ¨ diaeresis (U+00A8).

To display the string content using the correct UTF-8 encoding, you can use the explicit “s8” format specifier. For example, typing in the Command Window:

? &s[0],s8

the correct string is displayed, as this time the bytes in the std::string variable are interpreted as a UTF-8 sequence.

 

The s8 Format Specifier in the Command Window
The s8 Format Specifier in the Command Window

 

Similarly, the s8 format specifier can be used in the Watch window as well.

The s8 Format Specifier in the Watch Window
The s8 Format Specifier in the Watch Window

 

How Small Is Small Enough for the SSO?

Yesterday I wrote about the SSO.

A good question would be: “What is the maximum length threshold to trigger the SSO?” In other words: How long is too much for a string to enable the SSO?

Well, this limit length is not specified by the standard. However, for the sake of curiosity, in VS2015’s implementation, the limit is 15 for “narrow (i.e. char-based) strings, and 7 for “wide (i.e. wchar_t-based) strings.

This means that “meow”, “Connie” and “Commodore” are valid candidates for the SSO, as their lengths are less than 16 chars. On the other hand, considering the corresponding wide strings, L“meow” and L“Connie” are still SSO’ed, as their lengths are less than 8 wchar_ts; however, L“Commodore” breaks the 7 wide characters limit, so it’s not SSO’ed.

Thanks to Mr. STL for his clarification on the SSO length limit in Visual Studio’s implementation.

 

Giovanni Dicanio's C++ Corner on the Internet