Conversion between Unicode UTF-8 and UTF-16 with STL strings

Suppose there is a need to convert between Unicode UTF-8 and Unicode UTF-16 in a Windows C++ application. This can happen because it is good to use UTF-16 as the Unicode encoding inside a C++ app (in fact, UTF-16 is the encoding used by Win32 Unicode APIs), and use UTF-8 outside app boundaries (e.g. text files, etc.).



To do that, it is possible to use ATL conversion helpers like CA2W and CW2A, as shown in this blog post by Kenny Kerr. Or it is possible to directly use MultiByteToWideChar and WideCharToMultiByte and CString(A/W) class as illustrated in a previous blog post here.


Another option is to use STL strings instead of ATL/MFC CString. An advantage of this approach is that it works also with the Express editions of Visual Studio (which do not include ATL and MFC). Moreover, STL strings are better integrated in the context of STL and Boost, and there are C++ programmers who just prefer STL strings to ATL/MFC CString. The code that uses STL strings is similar to that illustrated previously for CString’s. Considering a conversion from UTF-8 to UTF-16, MultiByteToWideChar API is called twice: the first call determines the length of the resulting UTF-16 string, so that enough memory can be reserved for the string; then, the second call performs the actual conversion. A similar pattern is followed for the symmetric conversion (from UTF-16 to UTF-8, this time using WideCharToMultiByte API).


A couple of differences between CString and STL’s strings in the context of Win32 programming are worth noting.


First, Win32 APIs tend to receive input strings in the form of LPCTSTR, which is a typedef for “const TCHAR *”, i.e. these are raw C strings, NUL terminated. CString plays well in this model, in fact it is possible to simply pass instances of CString’s in the presence of LPCTSTR parameters (thanks to proper cast operator PCXSTR() implemented by CSimpleStringT, the base class of CStringT). Instead, in the presence of std::[w]string arguments, c_str() or data() methods must be called explicitly.


Moreover, when there is a need to reserve some memory inside CString buffer to modify its content directly, it is possible to call GetBuffer() or GetBufferSetLength() methods (these methods return a non-const pointer to the internal string buffer, allowing direct modification of its content). Instead, with STL’s strings it is possible to call the resize() method to reserve enough memory for the string content, and then use code like &myString[0] to get direct (non-const) access to internal string content. (This technique works at least with current Visual C++ implementation of STL strings.)


With these two differences between CString and STL’s strings in mind, it should be easy to follow the commented code in “utf8conv.h” file, attached to this blog post.


As a final note, Win32 API’s used in the UTF-8 conversion process can fail; as it is common in the Win32 programming model, GetLastError function can be used to retrieve more details on the error. Instead of using return codes for error conditions, the attached source code throws C++ exceptions. For this purpose, an exception class, named utf8_error, is derived from std::exception, and used to signal error conditions during the conversion process.


EDIT 2011, October 15th: Code Gallery sample can be found here.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>