UTF-16 Files and the Byte Order Mark (BOM)
UTF-16 (RFC 2781) is a Unicode encoding (the same as "wide char" on Windows) that is sometimes used in files on Windows operating systems. It is only used sometimes, and even in Windows programs that are compiled for
UNICODE (wide char), text is normally stored to file in single or multi-byte encoding systems (ANSI or UTF-8). See ANSI and Unicode files and C++ strings for more about this.
As with UTF-8, a UTF-16 file does not need an XML Declaration specifying the encoding (see the XML 1.0 Spec, 4.3.3). While UTF-8 and ANSI characters appear the same when they are in the ASCII range, UTF-16 is unmistakeably different because it uses two bytes per character.
Update December 17, 2008: In CMarkup release 10.1 all UTF-16 files are fully supported in CMarkup Evaluation version. This includes UTF-16LE and UTF-16BE across little endian and big endian platforms, Windows, Linux and OS X.
CMarkup looks for the Byte Order Mark (BOM) at the beginning of the file indicating that it is a UTF-16 file (LE "Little Endian" or BE "Big Endian"). This BOM consists of two bytes with the hexidecimal value
fe for LE and
ff for BE. Here is an example of the beginning of a UTF-16LE XML file:
If the UTF-16 BOM is discovered on Load, the
MDF_UTF16BEFILE CMarkup document flag is set, and the 2 bytes of the BOM are not loaded into the document string in memory. In the Save method, if that flag is set, the document is stored in the file in the corresponding UTF-16 encoding with the BOM inserted at the start of the file. By setting or unsetting this flag, you can control whether the file is saved with UTF-16 encoding.
xml.SetDocFlags( xml.GetDocFlags() | xml.MDF_UTF16LEFILE ); // on xml.SetDocFlags( xml.GetDocFlags() & ~xml.MDF_UTF16LEFILE ); // off
Before Windows 2000 and XP, these wide char files were usually called UCS-2. UTF-16 is an extension of UCS-2 such that all UCS-2 files are legitimate UTF-16 files. UCS-2 is a very simple Unicode encoding. Each two byte integer is simply the Unicode value for the character. The only problem is that it can only handle up to 65535 (
ffff) different values and the Unicode standard is expanding beyond that limit (called the "basic multilingual plane"). So now we have UTF-16 which is the same as UCS-2 except that it takes advantage of a special range of values between
dfff that are not valid Unicode characters. UTF-16 uses "surrogate" characters in that range and two of them go together to represent a Unicode value much larger than the UCS-2 limit.
Below are examples of three different Unicode encodings. The last one is an example Unicode value beyond
ffff that utilizes the surrogate pair in UTF-16, and will most likely appear as a little square because only rare fonts support it.
|€ (Euro)||20AC (8364)||E2 82 AC||20AC||20AC|
|中 (middle)||4E2D (20013)||E4 B8 AD||4E2D||4E2D|
|𝄞 (Treble Clef)||1D11E (119070)||F0 9D 84 9E||N/A||D834 DD1E|
UNICODE (wide char) Windows build of CMarkup, a UTF-16LE file does not need to be converted. In a normal plain C build, it is converted to UTF-8 on
Load and back to UTF-16 on
CMarkup provides UTF-16 UTF-8 conversion functions (UTF8To16, UTF16To8) in case the
WideCharToMultiByte Win32 APIs are not usable. They are in Windows CE 1.01 but not Windows CE 1.0, and on early versions of Windows they exist but do not support UTF-8 (Windows 95/98/ME/NT3.5 and versions of CE). By default, CMarkup builds with
MARKUP_WINCONV and utilizes the Win32 APIs, but you can turn this functionality off by defining