UTF-8 Files and the Preamble

The UTF-8 preamble, also known as the UTF-8 BOM or signature, is a 3 byte sequence at the start of a file indicating it is UTF-8. Like the UTF-16 BOM, this is not particular to XML, it is for any text file. But unlike the UTF-16 BOM, Byte Order Mark is not a correct term in this case because in UTF-8 there is no byte order. In hex, the UTF-8 preamble is ef bb bf.

While the UTF-16 BOM is standard, the UTF-8 preamble is not widely accepted and it is discouraged on UNIX operating systems. Microsoft Notepad uses the UTF-8 preamble when it saves UTF-8 documents, but does not need it to recognize UTF-8 encoding when it loads files. The 3 byte UTF-8 preamble is not recommended in XML files because if the file begins with an ASCII less than sign, it is already assumed to be UTF-8 unless the XML Declaration specifies another encoding.

If due to circumstances your file has a UTF-8 preamble, CMarkup 7.2 (developer version) will support it much the way it supports the UTF-16 BOM. If the UTF-8 preamble is discovered on Load, the MDF_UTF8PREAMBLE CMarkup flag is set, and the 3 bytes of the preamble are not loaded into the document. In the Save method the 3 byte preamble is written to the start of the file if that flag is set. By setting or unsetting this flag, you can control whether the preamble is put into the saved file.

xml.SetDocFlags( xml.GetDocFlags() | xml.MDF_UTF8PREAMBLE ); // on
xml.SetDocFlags( xml.GetDocFlags() & ~xml.MDF_UTF8PREAMBLE ); // off

This flag is also supported in the ReadTextFile and WriteTextFile functions.