CMarkup UTF16To8 Method

static int CMarkup::UTF16To8(
  char *pszUTF8,
  const unsigned short* pwszUTF16,
  int nUTF8Count
  );

UTF16To8 converts the UTF-16 string in pwszUTF16 to UTF-8 in the pszUTF8 string buffer. It uses the same arguments as the ANSI C wcstombs function, but instead of converting to the locale charset it converts to UTF-8.

Update December 17, 2008: With CMarkup release 10.1 the UTF-16 string type in the UTF16To8 and UTF8To16 functions changed from wchar_t* to unsigned short*, since wchar_t means UTF-32 on Linux and OS X.

The pwszUTF16 source must be a null-terminated UTF-16 string. If pszUTF8 is NULL, the number of bytes required is returned and nUTF8Count is ignored. Otherwise pszUTF8 is filled with the result string. nUTF8Count is the byte size of pszUTF8 and must be large enough to allow for a null-terminator in pszUTF8 if a null-terminator is desired. The number of bytes (excluding NULL) is returned.

The following example converts the Treble Clef character from UTF-16 to UTF-8, and then back to UTF-16. This is an example of a (rare) character that requires a surrogate pair in UTF-16 (see UTF-16 Files and the Byte Order Mark (BOM)) and 4 bytes in UTF-8. Note that the 5 passed into UTF16To8 allows for the null-terminator (which is important for the strcmp check and to generate the null-terminator in the UTF-16 result of UTF8To16).

unsigned short szUTF16[3] = { 0xD950, 0xDF21, 0 };
char szUTF8[5];
int nUTFLen = CMarkup::UTF16To8(szUTF8,szUTF16,5); // 0x64321
Check( strcmp(szUTF8,"\xF1\xA4\x8C\xA1") == 0 );
unsigned short szUTF16Result[3];
nUTFLen = CMarkup::UTF8To16(szUTF16Result,szUTF8,nUTFLen+1);
Check( szUTF16Result[0] == szUTF16[0] );

UTF16To8 and UTF8To16 have no dependencies and can be used in place of the MultiByteToWideChar and WideCharToMultiByte Win32 APIs which do not support UTF-8 on Windows 9X, NT3.5 and versions of CE, and are not available on other platforms.