ANSI and Unicode Files

If you are trying to save your ANSI file into a Unicode encoding see Convert ANSI file to Unicode. This article explains more about XML file encodings in relation to your C++ project build options.

UTF-8 is the recommended encoding for XML files. If your text is all ASCII, then it is also valid UTF-8. But if your XML file is not ASCII and not Unicode (i.e. not UTF-8 or UTF-16/UCS-2) then you really should use an XML Declaration to declare your encoding. For example if it is the default U.S. ANSI charset use this declaration:

<?xml version="1.0" encoding="windows-1252"?>

Overview of ANSI Code Pages

In Windows programming, the term ANSI is used to collectively refer to all the non-Unicode single and multibyte character sets that can be selected as the system locale code page. These include the single byte systems for Europe and the "double byte" for Chinese, Japanese and Korean which actually use one or two bytes per character.

In character sets like Windows-1252 and Cyrillic Windows-1251, a character is always one byte with a value up to 255. The problem is that values 128 to 255 (hex 80 to ff) are assigned to specific characters that can differ between charsets. For example, the Euro symbol (€) is hex 80 (decimal 128) in Windows-1252, but in Windows-1251 the value hex 80 represents the capital letter DJE (Ђ) and hex 88 is the Euro. So, when the computer sees the value 80, it depends on the current system locale setting how it is going to display that character.

character Windows-1252
Latin-1
Windows-1251
Cyrillic
UTF-8 UTF-16
€ (Euro) 80 (128) 88 (136) E2 82 AC 20AC
® (Registered) AE (174) AE (174) C2 AE 00AE

In double-byte character sets like GB2312 (Chinese Simplified), a character is one byte when it is in the ASCII range, but it can be either one or two bytes otherwise (some code points over 80 are lead bytes that are interpreted together with a trailing byte). For example, the ASCII character z value does not change (much) between the different encodings, but the sample Chinese character is completely different:

character GB2312 UTF-8 UTF-16
z 7A 7A 007A
中 (middle) D6 D0 E4 B8 AD 4E2D

Windows systems allow non-UNICODE programs to work with one ANSI character set at a time. To change charsets, you have to change the computer's system locale setting (that's why Unicode is great because you do not have to do this).

The charset of your program

In terms of charsets, there are 3 ways to build your C++ program:

  • UNICODE means use wide char UTF-16 (or it's precursor UCS-2)
  • MBCS means use the Windows system locale charset, single or multi byte (DBCS)
  • Plain C byte-based charset, although Windows APIs are MBCS, C functions assume 1 byte per character
  • UNICODE build

    In a UNICODE build, you simply convert the file to wide char on read, and back to what it was on write. If the file is UTF-8, CMarkup does the conversion using the UTF16To8 and UTF8To16. If the file is already UTF-16, no conversion is necessary (unless the byte order is reversed). If the file is ANSI, you need to implement ANSI to UNICODE conversion (sample source code is provided in remarks in the CMarkup::ReadTextFile and CMarkup::WriteTextFile functions in Markup.cpp).

    MBCS build

    In Visual C++, a non-UNICODE program by default has MBCS defined. With MBCS you use your system locale ANSI charset internally, and you often load and save the same ANSI charset in a file. For string operations to work correctly, the computer must have the same current system locale code page as the encoding of the file. This is the most convenient arrangement when files are not shared between machines in different locales, but it is not recommended because it is short-sighted; UTF-8 is the recommended encoding for XML files.

    With MBCS, if your file is Unicode (UTF-8 or UTF-16) it must be converted to your ANSI code page when it is loaded and back to Unicode when it is saved. If a Unicode file contains characters that are not supported in your locale charset then those characters will be lost when the file is loaded, and if it is saved they will be lost in the file as well.

    CMarkup will check the XML Declaration encoding and if it is UTF-8 convert it to the ANSI locale charset on file read and back to UTF-8 on file write.

    CMarkup is a bit slower in an MBCS build than a non-MBCS build because it must compute the character length (1 or 2 bytes) as it processes strings. Although UTF-8 is multibyte, this character length computation is not necessary during most UTF-8 string processing because of how UTF-8 is designed. See the next section on using UTF-8 internally.

    non-MBCS build

    If you do not have _MBCS (nor UNICODE) in your compiler defines, CMarkup is designed to use UTF-8 internally. This allows you to keep the document in Unicode in memory even in a non-UNICODE build, and avoid any of the multi-language text loss described above in the MBCS build. You should convert strings to ANSI only as needed for displaying or passing to Windows APIs with AToUTF8 and UTF8ToA.

    If your file is UTF-8, this allows you to keep the document as UTF-8 in memory too without conversion on load and save.

    UTF-16 files

    UTF-8 is the recommended Unicode encoding for XML, but sometimes UTF-16 is used. As of release 6.6 the developer version of CMarkup will detect UTF-16 files (this includes UCS-2) containing a Byte Order Mark (BOM). See UTF-16 Files and the Byte Order Mark (BOM).

    In a non-MBCS build where UTF-8 is used internally, CMarkup converts the document with UTF16To8 and UTF8To16.

    In an MBCS build the document is converted with wcstombs and mbstowcs (you may need to call setlocale) or in Visual C++ without MARKUP_STDCONV defined it will use WideCharToMultiByte and MultiByteToWideChar. See non-Unicode text handling in CMarkup.

    See Also:

    UTF-8 Files and the Preamble
    Setting the XML Declaration With CMarkup