CMarkup UTF8To16 Method

static int CMarkup::UTF8To16(
  unsigned short* pwszUTF16,
  const char* pszUTF8,
  int nUTF8Count

UTF8To16 converts the UTF-8 string in pszUTF8 to UTF-16 in the pwszUTF16 string buffer. It uses the same arguments as the ANSI C mbstowcs function, but instead of converting from the locale charset it converts from UTF-8.

Update December 17, 2008: With CMarkup release 10.1 the UTF-16 string type in the UTF8To16 and UTF16To8 functions changed from wchar_t* to unsigned short*, since wchar_t means UTF-32 on Linux and OS X.

The pszUTF8 source must be a UTF-8 string which will be processed up to null-terminator or nUTF8Count. If pwszUTF16 is NULL, the number of UTF-16 units required (i.e. UTF-16 length) is returned. nUTF8Count is the maximum UTF-8 bytes to convert and should include NULL if null-terminator is desired in result. If pwszUTF16 is not NULL it is filled with the result string and it must be large enough! The result will be null-terminated if NULL encountered in pszUTF8 before nUTF8Count. When pwszUTF16 is not NULL, the number of UTF-8 bytes converted is returned rather than the UTF-16 size.

The following example illustrates converting the letter z from UTF-16 to UTF-8, and then back to UTF-16. In the UTF16To8 call, we pass L"\x007A" which is a way of expressing UTF-16 char z. In the UTF8To16 call, we pass the wszUTF16 buffer and receive the result, "z", specifying the length of the UTF-8 source + 1 to include the null-terminator.

char szUTF8[5];
unsigned short wszUTF16[3];
int nUTFLen;
nUTFLen = CMarkup::UTF16To8(szUTF8,L"\x007A",5); // z
Check( strcmp(szUTF8,"z") == 0 );
nUTFLen = CMarkup::UTF8To16(wszUTF16,szUTF8,nUTFLen+1);
Check( wcscmp(wszUTF16,L"z") == 0 );

Here is an example to demonstrate the common technique of passing a NULL result buffer so that the function returns the necessary result length, before allocating the result buffer and calling the function again.

const char* pszTest = "hello";
unsigned short* pwszBuffer;
int nLen = strlen( pszTest );
int nUTF16Len = CMarkup::UTF8To16(NULL,pszTest,nLen);
pwszBuffer = new unsigned short[nUTF16Len+1];
nLen = CMarkup::UTF16To8(NULL,pwszBuffer,0);
CString csTest;
delete [] pwszBuffer;
Check( strcmp(csTest,pszTest) == 0 );

UTF8To16 and UTF16To8 have no dependencies and can be used in place of the MultiByteToWideChar and WideCharToMultiByte Win32 APIs which do not support UTF-8 on Windows 9X, NT3.5 and versions of CE, and are not available on other platforms.