Conversion between Unicode UTF-16 and UTF-8 in C++/Win32

There are several possible representations of Unicode text, e.g. UTF-8, UTF-16, UTF-32, etc.


UTF-16 is the default Unicode encoding form used by Windows.


UTF-8 is a common encoding form used to exchange text data on the Internet.
One of the advantages of UTF-8 is that there is no endian problem (i.e. big-endian vs. little-end), because UTF-8 is interpreted just as a sequence of bytes (instead, it is important to specify the correct endiannes of UTF-16 and UTF-32 code units).


To convert text between Unicode UTF-8 and UTF-16 encodings, a couple of Win32 APIs come in handy: MultiByteToWideChar and WideCharToMultiByte functions.


Suppose we want to convert text from UTF-8 to UTF-16. In this case, MultiByteToWideChar function can be used. To request a conversion from UTF-8, the CP_UTF8 code page value must be specified as first parameter of MultiByteToWideChar.
This function should be called twice: the first time it is called, we set the cchWideChar parameter to 0, so the function returns the required buffer size for the resulting UTF-16 (“wide char”) string. So, we can dynamically allocate a buffer to store the UTF-16 string (this is done using CStringW::GetBuffer method in code sample attached here). Then, we can call the MultiByteToWideChar function again, to perform the actual conversion from UTF-8 to UTF-16.


(So, to summarize: the purpose of the first call to the function is to get the destination buffer size, the second call to the function does the actual conversion.)


A similar process occurs for WideCharToMultiByte, which can be used to convert text from Unicode UTF-16 (“wide char”) to UTF-8.


The following C++ commented code shows how to use these Win32 functions to convert text between UTF-8 and UTF-16.


This code is pure Win32 C++ code; it uses ATL convenient CString class (the UTF-16 strings are stored in instances of CStringW; UTF-8 strings are stored in instances of CStringA). This code can be used in the context of MFC as well.


 


//////////////////////////////////////////////////////////////////////////////


//


// *** Routines to convert between Unicode UTF-8 and Unicode UTF-16 ***


//


// By Giovanni Dicanio <giovanni.dicanio AT gmail.com>


//


// Last update: 2010, January 2nd


//


//


// These routines use ::MultiByteToWideChar and ::WideCharToMultiByte


// Win32 API functions to convert between Unicode UTF-8 and UTF-16.


//


// UTF-16 strings are stored in instances of CStringW.


// UTF-8 strings are stored in instances of CStringA.


//


// On error, the conversion routines use AtlThrow to signal the


// error condition.


//


// If input string pointers are NULL, empty strings are returned.


//


//


// Prefixes used in these routines:


// ——————————–


//


//  - cch  : count of characters (CHAR’s or WCHAR’s)


//  - cb   : count of bytes


//  - psz  : pointer to a NUL-terminated string (CHAR* or WCHAR*)


//  - str  : instance of CString(A/W) class


//


//


//


// Useful Web References:


// ———————-


//


// WideCharToMultiByte Function


// http://msdn.microsoft.com/en-us/library/dd374130.aspx


//


// MultiByteToWideChar Function


// http://msdn.microsoft.com/en-us/library/dd319072.aspx


//


// AtlThrow


// http://msdn.microsoft.com/en-us/library/z325eyx0.aspx


//


//


// Developed on VC9 (Visual Studio 2008 SP1)


//


//


//////////////////////////////////////////////////////////////////////////////


 


 


 


namespace UTF8Util


{


 


 


 


//—————————————————————————-


// FUNCTION: ConvertUTF8ToUTF16


// DESC: Converts Unicode UTF-8 text to Unicode UTF-16 (Windows default).


//—————————————————————————-


CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 )


{


    //


    // Special case of NULL or empty input string


    //


    if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == ) )


    {


        // Return empty string


        return L“”;


    }


 


 


    //


    // Consider CHAR’s count corresponding to total input string length,


    // including end-of-string () character


    //


    const size_t cchUTF8Max = INT_MAX – 1;


    size_t cchUTF8;


    HRESULT hr = ::StringCchLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 );


    if ( FAILED( hr ) )


    {


        AtlThrow( hr );


    }


   


    // Consider also terminating


    ++cchUTF8;


 


    // Convert to ‘int’ for use with MultiByteToWideChar API


    int cbUTF8 = static_cast<int>( cchUTF8 );


 


 


    //


    // Get size of destination UTF-16 buffer, in WCHAR’s


    //


    int cchUTF16 = ::MultiByteToWideChar(


        CP_UTF8,                // convert from UTF-8


        MB_ERR_INVALID_CHARS,   // error on invalid chars


        pszTextUTF8,            // source UTF-8 string


        cbUTF8,                 // total length of source UTF-8 string,


                                // in CHAR’s (= bytes), including end-of-string


        NULL,                   // unused – no conversion done in this step


        0                       // request size of destination buffer, in WCHAR’s


        );


    ATLASSERT( cchUTF16 != 0 );


    if ( cchUTF16 == 0 )


    {


        AtlThrowLastWin32();


    }


 


 


    //


    // Allocate destination buffer to store UTF-16 string


    //


    CStringW strUTF16;


    WCHAR * pszUTF16 = strUTF16.GetBuffer( cchUTF16 );


 


    //


    // Do the conversion from UTF-8 to UTF-16


    //


    int result = ::MultiByteToWideChar(


        CP_UTF8,                // convert from UTF-8


        MB_ERR_INVALID_CHARS,   // error on invalid chars


        pszTextUTF8,            // source UTF-8 string


        cbUTF8,                 // total length of source UTF-8 string,


                                // in CHAR’s (= bytes), including end-of-string


        pszUTF16,               // destination buffer


        cchUTF16                // size of destination buffer, in WCHAR’s


        );


    ATLASSERT( result != 0 );


    if ( result == 0 )


    {


        AtlThrowLastWin32();


    }


 


    // Release internal CString buffer


    strUTF16.ReleaseBuffer();


 


    // Return resulting UTF16 string


    return strUTF16;


}


 


 


 


//—————————————————————————-


// FUNCTION: ConvertUTF16ToUTF8


// DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.


//—————————————————————————-


CStringA ConvertUTF16ToUTF8( __in const WCHAR * pszTextUTF16 )


{


    //


    // Special case of NULL or empty input string


    //


    if ( (pszTextUTF16 == NULL) || (*pszTextUTF16 == L) )


    {


        // Return empty string


        return “”;


    }


 


 


    //


    // Consider WCHAR’s count corresponding to total input string length,


    // including end-of-string (L”) character.


    //


    const size_t cchUTF16Max = INT_MAX – 1;


    size_t cchUTF16;


    HRESULT hr = ::StringCchLengthW( pszTextUTF16, cchUTF16Max, &cchUTF16 );


    if ( FAILED( hr ) )


    {


        AtlThrow( hr );


    }


 


    // Consider also terminating


    ++cchUTF16;


 


 


    //


    // WC_ERR_INVALID_CHARS flag is set to fail if invalid input character


    // is encountered.


    // This flag is supported on Windows Vista and later.


    // Don’t use it on Windows XP and previous.


    //


#if (WINVER >= 0x0600)


    DWORD dwConversionFlags = WC_ERR_INVALID_CHARS;


#else


    DWORD dwConversionFlags = 0;


#endif


 


    //


    // Get size of destination UTF-8 buffer, in CHAR’s (= bytes)


    //


    int cbUTF8 = ::WideCharToMultiByte(


        CP_UTF8,                // convert to UTF-8


        dwConversionFlags,      // specify conversion behavior


        pszTextUTF16,           // source UTF-16 string


        static_cast<int>( cchUTF16 ),   // total source string length, in WCHAR’s,


                                        // including end-of-string


        NULL,                   // unused – no conversion required in this step


        0,                      // request buffer size


        NULL, NULL              // unused


        );


    ATLASSERT( cbUTF8 != 0 );


    if ( cbUTF8 == 0 )


    {


        AtlThrowLastWin32();


    }


 


 


    //


    // Allocate destination buffer for UTF-8 string


    //


    CStringA strUTF8;


    int cchUTF8 = cbUTF8; // sizeof(CHAR) = 1 byte


    CHAR * pszUTF8 = strUTF8.GetBuffer( cchUTF8 );


 


 


    //


    // Do the conversion from UTF-16 to UTF-8


    //


    int result = ::WideCharToMultiByte(


        CP_UTF8,                // convert to UTF-8


        dwConversionFlags,      // specify conversion behavior


        pszTextUTF16,           // source UTF-16 string


        static_cast<int>( cchUTF16 ),   // total source string length, in WCHAR’s,


                                        // including end-of-string


        pszUTF8,                // destination buffer


        cbUTF8,                 // destination buffer size, in bytes


        NULL, NULL              // unused


        ); 


    ATLASSERT( result != 0 );


    if ( result == 0 )


    {


        AtlThrowLastWin32();


    }


 


    // Release internal CString buffer


    strUTF8.ReleaseBuffer();


 


    // Return resulting UTF-8 string


    return strUTF8;


}


 


 


 


} // namespace UTF8Util


 


 


//////////////////////////////////////////////////////////////////////////////


 

A Visual Studio 2008 solution is attached to this blog post. This solution contains an MFC test app and a console test app regarding the aforementioned conversion functions.

I’d like to close this blog post with a couple of interesting links about Unicode:

  The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

  UTF-8, UTF-16, UTF-32 & BOM

 

 

6 thoughts on “Conversion between Unicode UTF-16 and UTF-8 in C++/Win32”

  1. I am just wondering how the buffer allocation using CStringA|W works. It allocates a buffer using a number of characters only. How does this work since the number of bytes depends on exactly what the characters are for variable width encodings such as UTF-8 and UTF-16. (Unless what MultiByteToWideChar returns isn’t strictly the number of characters?)

  2. @Ben, MultiByteToWideChar tells you how large a buffer of WCHAR values is needed to store the output string (on the first invocation).

    When it says “characters” it really means “code-units” (*not* code-points), in this case WCHAR objects (each of which is 16 bits since we have UTF-16).

  3. Very good, but a simplified function based in this that work follow:

    char* _UTF16ToUTF8( nunichar * pszTextUTF16 ){
    if ( (pszTextUTF16 == NULL) || (*pszTextUTF16 == L’\0′) ) {
    return 0;
    }
    int cchUTF16;
    cchUTF16=n_strlen( pszTextUTF16)+1;
    int cbUTF8 = WideCharToMultiByte(CP_UTF8,0,pszTextUTF16,cchUTF16,NULL,0/* request buffer size*/,NULL, NULL );
    ASSERT2(cbUTF8);
    char *strUTF8=new char[cbUTF8],*pszUTF8 =strUTF8;
    int result = WideCharToMultiByte(CP_UTF8, 0,pszTextUTF16,cchUTF16 ,pszUTF8, cbUTF8,NULL,NULL );
    ASSERT2( result);
    return strUTF8;
    }

  4. this is the correct function(sorry):

    char* _UTF16ToUTF8( wchar_t * pszTextUTF16 ){
    if ( (pszTextUTF16 == NULL) || (*pszTextUTF16 == L’\0′) ) {
    return 0;
    }
    int cchUTF16;
    cchUTF16=wcslen( pszTextUTF16)+1;
    int cbUTF8 = WideCharToMultiByte(CP_UTF8,0,pszTextUTF16,cchUTF16,NULL,0/* request buffer size*/,NULL, NULL );
    ASSERT(cbUTF8);
    char *strUTF8=new char[cbUTF8],*pszUTF8 =strUTF8;
    int result = WideCharToMultiByte(CP_UTF8, 0,pszTextUTF16,cchUTF16 ,pszUTF8, cbUTF8,NULL,NULL );
    ASSERT( result);
    return strUTF8;
    }

  5. @Nei Amaral F.
    I’d add const correctness to your function, using “const” for “pszTextUTF16″.

    Moreover, I’d prefer using a string class for return value instead of a raw char* pointer (which the caller must manually free with delete[], and is a potential source for memory leaks).

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>