UTF-8 in CMarkup and CDataEdit

UTF-8 (RFC 3629) is the standard for XML and it is widely used in web-enabled applications. It is a multi-byte system for encoding Unicode that uses from 1 to 6 bytes per character. A regular ASCII file (with character codes under 128) is a valid UTF-8 file so you need not worry about UTF-8 support if your XML stays within this limitation.

Most text in Western software is compatible with UTF-8 because it utilizes the same zero byte as the terminator and shares the same lower 128 characters as found in ASCII. This is significant because you can assume that if you see a less than sign, you know its a less than sign regardless of the bytes around it because a less than sign is one of those standard 128 characters.

To draw UTF-8 text in Windows, you convert the UTF-8 to wide char (Windows Unicode or UCS-2 which has 2 bytes per character). The CDataEdit Class control displays UTF-8 on all Win32 operating systems back to Windows 95. Some Windows operating systems since NT and CE support all string APIs in wide char allowing you to compile your program for Windows UNICODE with all of your strings in wide char and automatically calling the wide char Win32 APIs. But CDataEdit is made to work without the UNICODE build by only using specific wide char text drawing APIs available since Windows 95, and explicitly converting from UTF-8 to wide char.

Supporting UTF-8 in the CMarkup classes is the same as ASCII because methods such as GetData and GetAttrib return the UTF-8 strings without being aware that they are UTF-8. However, UNICODE builds of MFC CMarkup, and CMarkupMSXML are supported for Windows NT (4.0 and above) and Windows CE (which is wide char only). These builds store the document in wide char and all of the methods have wide char parameters.

MBCS is the Windows Multibyte (or double-byte) character system. Unlike UTF-8, you cannot assume that if you see a less than sign it is actually a less than sign because it might be part of a multibyte character. Therefore, you must step through any MBCS string sequentially keeping track of where the boundaries between characters are. You can compile CMarkup for MBCS which will internally process the XML document as MBCS using the MFC and Windows functions and macros supporting MBCS string manipulation.