static int CMarkup::UTF8To16( unsigned short* pwszUTF16, const char* pszUTF8, int nUTF8Count );
UTF8To16 converts the UTF-8 string in pszUTF8 to UTF-16 in the pwszUTF16 string buffer. It uses the same arguments as the ANSI C mbstowcs function, but instead of converting from the locale charset it converts from UTF-8.
Update December 17, 2008: With CMarkup release 10.1 the UTF-16 string type in the UTF8To16 and UTF16To8 functions changed from wchar_t* to unsigned short*, since wchar_t means UTF-32 on Linux and OS X.
The pszUTF8 source must be a UTF-8 string which will be processed up to null-terminator or nUTF8Count. If pwszUTF16 is NULL, the number of UTF-16 units required (i.e. UTF-16 length) is returned. nUTF8Count is the maximum UTF-8 bytes to convert and should include NULL if null-terminator is desired in result. If pwszUTF16 is not NULL it is filled with the result string and it must be large enough! The result will be null-terminated if NULL encountered in pszUTF8 before nUTF8Count. When pwszUTF16 is not NULL, the number of UTF-8 bytes converted is returned rather than the UTF-16 size.
The following example illustrates converting the letter z from UTF-16 to UTF-8, and then back to UTF-16. In the UTF16To8 call, we pass L"\x007A" which is a way of expressing UTF-16 char z. In the UTF8To16 call, we pass the wszUTF16 buffer and receive the result, "z", specifying the length of the UTF-8 source + 1 to include the null-terminator.
char szUTF8[5]; unsigned short wszUTF16[3]; int nUTFLen; nUTFLen = CMarkup::UTF16To8(szUTF8,L"\x007A",5); // z Check( strcmp(szUTF8,"z") == 0 ); nUTFLen = CMarkup::UTF8To16(wszUTF16,szUTF8,nUTFLen+1); Check( wcscmp(wszUTF16,L"z") == 0 );
Here is an example to demonstrate the common technique of passing a NULL result buffer so that the function returns the necessary result length, before allocating the result buffer and calling the function again.
const char* pszTest = "hello"; unsigned short* pwszBuffer; int nLen = strlen( pszTest ); int nUTF16Len = CMarkup::UTF8To16(NULL,pszTest,nLen); pwszBuffer = new unsigned short[nUTF16Len+1]; CMarkup::UTF8To16(pwszBuffer,pszTest,nLen+1); nLen = CMarkup::UTF16To8(NULL,pwszBuffer,0); CString csTest; CMarkup::UTF16To8(csTest.GetBuffer(nLen),pwszBuffer,nLen); csTest.ReleaseBuffer(nLen); delete [] pwszBuffer; Check( strcmp(csTest,pszTest) == 0 );
UTF8To16 and UTF16To8 have no dependencies and can be used in place of the MultiByteToWideChar and WideCharToMultiByte Win32 APIs which do not support UTF-8 on Windows 9X, NT3.5 and versions of CE, and are not available on other platforms.