UTF-16 Files and the Byte Order Mark (BOM)

UTF-16 (RFC 2781) is a Unicode encoding (the same as "wide char" on Windows) that is sometimes used in files on Windows operating systems. It is only used sometimes, and even in Windows programs that are compiled for UNICODE (wide char), text is normally stored to file in single or multi-byte encoding systems (ANSI or UTF-8). See ANSI and Unicode files and C++ strings for more about this.

As with UTF-8, a UTF-16 file does not need an XML Declaration specifying the encoding (see the XML 1.0 Spec, 4.3.3). While UTF-8 and ANSI characters appear the same when they are in the ASCII range, UTF-16 is unmistakeably different because it uses two bytes per character.

comment posted Support for UTF-16?

Scott Wilson 11-Feb-2004

Does CMarkup support UTF-16? I've found a sample file that it thinks is not well-formed... The first two bytes of the UTF-16 file are 0xFF 0xFE. Do you know why these characters are there?

Update December 17, 2008: In CMarkup release 10.1 all UTF-16 files are fully supported in CMarkup Evaluation version. This includes UTF-16LE and UTF-16BE across little endian and big endian platforms, Windows, Linux and OS X.

UTF-16 BOM

CMarkup looks for the Byte Order Mark (BOM) at the beginning of the file indicating that it is a UTF-16 file (LE "Little Endian" or BE "Big Endian"). This BOM consists of two bytes with the hexidecimal value ff and fe for LE and fe and ff for BE. Here is an example of the beginning of a UTF-16LE XML file:

If the UTF-16 BOM is discovered on Load, the MDF_UTF16LEFILE or MDF_UTF16BEFILE CMarkup document flag is set, and the 2 bytes of the BOM are not loaded into the document string in memory. In the Save method, if that flag is set, the document is stored in the file in the corresponding UTF-16 encoding with the BOM inserted at the start of the file. By setting or unsetting this flag, you can control whether the file is saved with UTF-16 encoding.

The MDF_UTF16LEFILE and MDF_UTF16BEFILE flags are also supported in the ReadTextFile and WriteTextFile functions.

Unicode encodings

Before Windows 2000 and XP, these wide char files were usually called UCS-2. UTF-16 is an extension of UCS-2 such that all UCS-2 files are legitimate UTF-16 files. UCS-2 is a very simple Unicode encoding. Each two byte integer is simply the Unicode value for the character. The only problem is that it can only handle up to 65535 (ffff) different values and the Unicode standard is expanding beyond that limit (called the "basic multilingual plane"). So now we have UTF-16 which is the same as UCS-2 except that it takes advantage of a special range of values between d800 and dfff that are not valid Unicode characters. UTF-16 uses "surrogate" characters in that range and two of them go together to represent a Unicode value much larger than the UCS-2 limit.

Below are examples of three different Unicode encodings. The last one is an example Unicode value beyond ffff that utilizes the surrogate pair in UTF-16, and will most likely appear as a little square because only rare fonts support it.

character	Unicode	UTF-8	UCS-2	UTF-16
€ (Euro)	20AC (8364)	E2 82 AC	20AC	20AC
中 (middle)	4E2D (20013)	E4 B8 AD	4E2D	4E2D
𝄞 (Treble Clef)	1D11E (119070)	F0 9D 84 9E	N/A	D834 DD1E

In a UNICODE (wide char) Windows build of CMarkup, a UTF-16LE file does not need to be converted. In a normal plain C build, it is converted to UTF-8 on Load and back to UTF-16 on Save.

CMarkup provides UTF-16 UTF-8 conversion functions (UTF8To16, UTF16To8) in case the MultiByteToWideChar and WideCharToMultiByte Win32 APIs are not usable. They are in Windows CE 1.01 but not Windows CE 1.0, and on early versions of Windows they exist but do not support UTF-8 (Windows 95/98/ME/NT3.5 and versions of CE). By default, CMarkup builds with MARKUP_WINCONV and utilizes the Win32 APIs, but you can turn this functionality off by defining MARKUP_STDCONV.