Adventures with string_view: Optimizing Code with O(1) Operations

Last time we saw that you can invoke the CString::GetString method to get a C-style null-terminated const string pointer, then pass it to functions that take wstring_view input parameters:

// name is a CString instance;
// DoSomething takes a wstring_view input parameter
DoSomething(name.GetString());

While this code works fine, it’s possible to optimize it.

As the saying goes, first make things work, then make things fast.

The Big Picture

A typical implementation of wstring_view holds two data members: a pointer to the string characters, and a size (or length). Basically, the pointer indicates where the string starts, and the size/length specifies how many consecutive characters belong to the string view (note that string views are not necessarily null-terminated).

The above code invokes a wstring_view constructor overload that takes a null-terminated C-style string pointer. To get the size (or length) of the string view, the implementation code needs to traverse the input string’s characters one by one, until it finds the null terminator. This is a linear time operation, or O(N) operation.

Fortunately, there’s another wstring_view constructor overload, that takes two parameters: a pointer and a length. Since CString objects know their own length, you can invoke the CString::GetLength method to get the value of the length parameter.

// Create a wstring_view from CString (*)
DoSomething({ name.GetString(), name.GetLength() });

The great news is that CString objects bookkeep their own string length, so that CString::GetLength doesn’t have to traverse all the string characters until it finds the terminating null. The value of the string length is already available when you invoke the CString::GetLength method.

In other words, creating a string view invoking CString::GetString and CString::GetLength replaces a linear time O(N) operation with a constant time O(1) operation, which is great.

Fixing and Refining the Code

When you try to compile the above code marked with (*), the C++ compiler actually complains with the following message:

Error C2398 Element '2': 
conversion from 'int' 
to 'const std::basic_string_view<wchar_t,std::char_traits<wchar_t>>::size_type' 
requires a narrowing conversion

The problem here is that CString::GetLength returns an int, which doesn’t match with the size type expected by wstring_view. Well, not a big deal: We can safely cast the int value returned by CString::GetLength to wstring_view::size_type, or just size_t:

DoSomething({ name.GetString(), 
    static_cast<size_t>(name.GetLength()) });

As a further refinement, we can wrap the above wstring_view-from-CString creation code in a nice helper function:

// Helper function that *efficiently* creates a wstring_view to a CString
inline std::wstring_view AsView(const CString& str)
{
    return { str.GetString(), 
             static_cast<size_t>(str.GetLength()) };
}

Measuring the Performance Gain

Using the above helper function, you can safely and efficiently create a string view to a CString object.

Now, you may ask, how much speed gain are we talking here?

Good question!

I have developed a small simple C++ benchmark, which shows that replacing a O(N) operation with a O(1) operation in this case gives a performance boost of 88 ms vs. 445 ms, which is an 80% performance gain! Wow.

Benchmark comparing two wstring_view constructor overloads
Benchmark comparing two wstring_view constructor overloads

If you want to learn more about Big-O notation and other related topics, you can watch my Pluralsight course on Introduction to Data Structures and Algorithms in C++.

Big-O doesn’t have to be boring!

C++ String Benchmark: STL vs. ATL vs. Custom Pool Allocator

I was curious to compare the performance of the STL string implementation versus ATL CString, using Visual Studio 2019, so I wrote some simple C++ benchmark code for this purpose.

I also added into the mix a custom string pool allocator based on some code from The Old New Thing blog, that I modified and bug-fixed. This allocator basically maintains a singly-linked list of chunks, and string memory is carved from each chunk just increasing a string pointer. When there isn’t enough memory in the current chunk to serve the allocation, a new chunk is allocated. The new chunk is safely linked to the previous chunk list, and the memory for the requested strings is carved from this new chunk. The linked list of chunks is traversed at destruction time to properly release the allocated memory blocks.

I measured the times to create and fill string vectors, and the times to sort the same vectors.

TL;DR: The STL string performance is great! You can improve creation times with a custom pool allocator.

These times are measured for vectors storing each kind of strings: STL’s wstring, ATL’s CString (i.e. CStringW in Unicode builds), and the strings created using the custom string pool allocator.

This is a sample run (executed on a Windows 10 64-bit Intel i7 PC):

String benchmark: STL vs. ATL vs. custom string pool allocator
String benchmark: STL vs. ATL vs. custom string pool allocator

As you can note, the best creation times are obtained with the custom string pool allocator. This was expected, as the allocation strategy of carving string memory from pre-allocated blocks is very efficient.

On the other hand, regarding the sorting times, STL and the custom pool strings perform very similarly.

ATL’s CString, which is based on CoW (Copy on Write), shows the worst times for both creation and sorting.

Benchmark Variation: Tiny Strings

It’s also possible to run the benchmark with short strings, triggering the SSO (compile the code #define’ing TEST_TINY_STRINGS).

String benchmark with tiny strings: STL vs. ATL vs. custom string pool allocator
String benchmark with tiny strings: STL vs. ATL vs. custom string pool allocator

As you can see in this case, thanks to the SSO, STL strings win by an important margin in both creation and sorting times.

Note: Making the Sorting Comparison Uniform

I’d also like to clarify that, to make the sorting comparison more uniform, I used custom comparator functions with std::sort, invoking wcscmp for each kind of strings. In fact, spelunking in the VS 2019 STL and ATL implementation code (Thank You, Step Into Specific menu), I found that std::wstring invokes wmemcmp (file <xstring>, line #235), while ATL::CString uses wcscmp (file <cstringt.h>, line #564). Probably std::wstring uses wmemcmp because wstring instances can contain embedded NULs.

 

ATL::CStringW vs. std::wstring Performance: String Sorting and Comparisons

According to previous measurements, std::wstring performs better than ATL::CStringW:

  1. in all string concatenation tests
  2. when sorting string vectors that are made by small strings, thanks to std::basic_string’s SSO

So, I focused my attention on the string vector sorting scenario, and it seems to me that (at least in the VS2015 implementation), the slowdown of (non-SSO) wstrings is caused by wmemcmp calls, that are used to compare wstrings when sorting the string vectors. On the other hand, CStringW invokes wcscmp, that seems to run faster.

In fact, invoking std::sort with a custom comparator function that calls wcscmp to compare the C-style pointers returned by wstring::c_str, results in faster vector<wstring> sorting times. So, in this case wstring sorting performs better than CStringW even for non-SSO strings.

However, as Stephan T. Lavavej pointed out, std::basic_string supports embedded nulls, so wstring cannot use wcscmp (that works only for C-style null-terminated strings, without embedded nulls).

 

String Performance Tests: ATL vs. STL

I wrote some C++ code to test the performance of the ATL CStringW class vs. the C++ Standard Library’s std::wstring.

There are several aspects that can be considered when comparing string class performance: In the aforementioned code, I tested string vector sorting and string concatenation.

The code is available in this repository on GitHub.

You can take a look at the README for further details (including my test results).

 

Custom C++ String Pool Allocator on GitHub

I’ve uploaded on GitHub some C++ code of mine, implementing a custom string pool allocator.

The basic idea is to allocate big chunks of memory, and then serve single string allocations carving memory from inside those blocks, with a simple fast pointer increase.

There’s also a benchmark comparing this custom allocator vs. STL’s strings.

Custom string pool allocator benchmark results.
Custom string pool allocator benchmark results.

The results clearly show that both allocating strings that way, and sorting them, is faster than using the default std::wstring class.

 

The Small String Optimization

How do “Connie” and “meow” differ from “The Commodore 64 is a great computer”?

(Don’t get me wrong: They are all great strings! 🙂 )

In several implementations, including the Visual C++’s one, the STL string classes are empowered by an interesting optimization: The Small String Optimization (SSO).

What does that mean?

Well, it basically means that small strings get a special treatment. In other words, there’s a difference in how strings like “Connie”, “meow” or “The Commodore 64 is a great computer” are allocated and stored by std::string.

In general, a typical string class allocates the storage for the string’s text dynamically from the heap, using new[]. In Visual Studio’s C/C++ run-time implementation on Windows, new[] calls malloc, which calls HeapAlloc (…which may probably call VirtualAlloc). The bottom line is that dynamically-allocating memory with new[] is a non-trivial task, that does have an overhead, and implies a trip down the Windows memory manager.

So, the std::string class says: “OK, for small strings, instead of taking a trip down the new[]-malloc-HeapAlloc-etc. “memory lane” 🙂 , let’s do something much faster and cooler! I, the std::string class, will reserve a small chunk of memory, a “small buffer” embedded inside std::string objects, and when strings are small enough, they will be kept (deep-copied) in that buffer, without triggering dynamic memory allocations.”

That’s a big saving! For example, for something like:

std::string s{"Connie"};

there’s no memory allocated on the heap! “Connie” is just stack-allocated. No new[], no malloc, no HeapAlloc, no trip down the Windows memory manager.

That’s kind of the equivalent of this C-ish code:

char buffer[ /* some short length */ ];
strcpy_s(buffer, "Connie");

No new[], no HeapAlloc, no virtual memory manager overhead! It’s just a simple snappy stack allocation, followed by a string copy.

But there’s more! In fact, having the string’s text embedded inside the std::string object offers great locality, better than chasing pointers to scattered memory blocks allocated on the heap. This is also very good when strings are stored in a std::vector, as small strings are basically stored in contiguous memory locations, and modern CPUs love blasting contiguous data in memory!

Optimizations similar to the SSO can be applied also to other data structures, for example: to vectors. This year’s CppCon had an interesting session discussing that: “High Performance Code 201: Hybrid Data Structures”.

I’ve prepared some C++ code implementing a simple benchmark to measure the effects of SSO. The results I got for 200,000-small-string vectors clearly show a significant advantage of STL strings for small strings. For example: in 64-bit build on an Intel i7 CPU @3.40GHz: vector::push_back time for ATL (CStringW) is 29 ms, while for STL (wstring) it’s just 14 ms: one half! Moreover, sorting times are 135 ms for ATL vs. 77 ms for the STL: again, a big win for the SSO implemented in the STL!

 

The Chinese Dictionary Loading Benchmark Revised

Available here on GitHub.

[…] All in all, I’d be happy with the optimization level reached in #2: Ditch C++ standard I/O streams and locale/codecvt in favor of memory-mapped files for reading files and MultiByteToWideChar Win32 API for UTF-8 to UTF-16 conversions, but just continue using the STL’s wstring (or CString) class!

Unicode UTF-16/UTF-8 Encoding Conversions: Win32 API vs. C++ Standard Library Performance

In this MSDN Magazine article, I showed how to convert Unicode text between UTF-16 and UTF-8 encodings using direct Win32 API calls (in particular, I discussed in details the use of the MultiByteToWideChar API).

In addition to that, the C++ standard library offers some classes to perform such conversions.

For example, you can combine std::codecvt_utf8_utf16 with std::wstring_convert to convert from a UTF-16-encoded std::wstring to a UTF-8 std::string:

#include <codecvt>  // for codecvt_utf8_utf16
#include <locale>   // for wstring_convert
#include <string>   // for string, wstring

...
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conversion;
std::wstring utf16 = L"Some UTF-16 string";
std::string utf8 = conversion.to_bytes(utf16);

I developed some C++ code to convert many strings and measure the time spent for the conversions using the aforementioned C++ standard library classes vs. direct Win32 API calls, and the result is clearly in favor of the latter (code compiled with VS2015 in release build):

STL: 1050 ms

Win32: 379 ms  << — Clearly wins