ANSI and Unicode files and C++ strings

How do you handle XML and HTML text files whether your strings in memory are UNICODE/wchar_t, MBCS, or plain C? Loading and saving text in a Unicode or ANSI file may involve a conversion to and from the text encoding of your C++ strings.

Hey, in case you are just trying to use a text editor to save your ANSI file into a Unicode encoding see Convert ANSI file to Unicode.

UTF-8 is the recommended encoding for XML and HTML files. If your text is all ASCII, then it is also valid UTF-8. But if your XML file is not ASCII and not Unicode (i.e. not UTF-8 or UTF-16/UCS-2) then you really should use an XML Declaration to declare your encoding. For example if it is the default U.S. ANSI charset use this declaration:

<?xml version="1.0" encoding="windows-1252"?>

Overview of ANSI Code Pages

In Windows programming, the term ANSI is used to collectively refer to all the non-Unicode single and multibyte character sets that can be selected as the system locale code page. These include the single byte systems for Europe and the "double byte" for Chinese, Japanese and Korean which actually use one or two bytes per character.

In character sets like Windows-1252 and Cyrillic Windows-1251, a character is always one byte with a value up to 255. The problem is that values 128 to 255 (hex 80 to ff) are assigned to specific characters that can differ between charsets. For example, the Euro symbol (€) is hex 80 (decimal 128) in Windows-1252, but in Windows-1251 the value hex 80 represents the capital letter DJE (Ђ) and hex 88 is the Euro. So, when the computer sees the value 80, it depends on the current system locale setting how it is going to display that character.

character Windows-1252
Latin-1
Windows-1251
Cyrillic
UTF-8 UTF-16
€ (Euro) 80 (128) 88 (136) E2 82 AC 20AC
® (Registered) AE (174) AE (174) C2 AE 00AE

In double-byte character sets like GB2312 (Chinese Simplified), a character is one byte when it is in the ASCII range, but it can be either one or two bytes otherwise (some code points over 80 are lead bytes that are interpreted together with a trailing byte). For example, the ASCII character z value does not change (much) between the different encodings, but the sample Chinese character is completely different:

character GB2312 UTF-8 UTF-16
z 7A 7A 007A
中 (middle) D6 D0 E4 B8 AD 4E2D

Windows systems allow non-UNICODE programs to work with one ANSI character set at a time. To change charsets, you have to change the computer's system locale setting (that's why Unicode is great because you do not have to do this).

Your charset build options on Windows

In terms of the charset of your in-memory text, there are 3 ways to build your Windows C++ program:

  • UNICODE means use wide char UTF-16 (or it's precursor UCS-2)
  • MBCS means use the Windows system locale charset, single or multi byte (DBCS)
  • Plain C byte-based charset, although Windows APIs are MBCS, C functions assume 1 byte per character. For CMarkup this is Unicode UTF-8.
  • Your charset build options on Linux, OS X and other platforms

    OS X and Linux C++ programs do not have the ANSI and UNICODE modes that Windows programs do. But you can use char-based strings which are usually UTF-8, or wchar_t-based UTF-32 strings. The corresponding STL string classes are std::string and std::wstring. Although I recommend UTF-8 which requires less space and fewer conversions, CMarkup supports wide strings on these platforms with the MARKUP_WCHAR define.

    Loading and saving text files

    Update December 17, 2008: With CMarkup release 10.1, the Save and Load methods, and the underlying WriteTextFile and ReadTextFile functions have greatly expanded character conversion capabilities to handle most common ANSI and double-byte encodings specified in the XML declaration or HTML Content-Type meta tag (see GetDeclaredEncoding) of the document.

    Whichever charset you use in memory, CMarkup converts between that and the encoding of your file when you read or write the file.

    In Windows, the CMarkup text conversion functionality uses the MultiByteToWideChar and WideCharToMultiByte Windows APIs, see the preprocessor define MARKUP_WINCONV. In Visual C++ MARKUP_WINCONV is automatically selected. In g++ for cygwin and other compilers for Windows, add MARKUP_WINCONV to your preprocessor defines or specify -DMARKUP_WINCONV on the command line.

    If not on Windows, CMarkup uses the iconv API available on OS X, Linux and some other platforms, see the preprocessor define MARKUP_ICONV. The g++ GNU compiler will automatically select MARKUP_ICONV. Put MARKUP_ICONV in your preprocessor defines if needed. On OS X you may need to specify the iconv library to the linker. The following command is used to compile the CMarkup test program on OS X:

    g++ main.cpp Markup.cpp MarkupTest.cpp -liconv

    Without a conversion API

    Define MARKUP_STDCONV to use neither Windows conversion APIs nor iconv. See non-Unicode text handling in CMarkup.

    If you do not use either MARKUP_WINCONV or MARKUP_ICONV, CMarkup still supports conversion between Unicode encodings (UTF-8 and UTF-16 and wchar_t which is UTF-32 on OS X and Linux), as well as the system locale encoding if you call setlocale. A non-Unicode encoding is supported with the ANSI C mbtowc and wctomb functions to convert to/from Unicode.

    MBCS build

    This is the default build mode in older Windows Visual C++ projects using the system locale ANSI code page in strings and Windows APIs. It means that your strings in memory are not Unicode.

    CMarkup is a bit slower in an MBCS build than a non-MBCS build because it must compute the character length (1 or 2 bytes) as it processes strings. This is not the case with UTF-8 because although UTF-8 is multibyte, this character length computation is not necessary during most UTF-8 string processing because of how UTF-8 is designed. See the next section on using UTF-8 internally.

    If a Unicode file is loaded into an MBCS build, you will lose any characters not supported by the charset being used in memory. Or if the file is a different character set than the one in memory you can lose characters. The conversion process generally replaces these lost characters with a question mark and reports in the result string (GetResult) that characters were lost during conversion.

     

    comment posted Re: Upgrade to 10.1

    David Emmerich 28-Nov-2008

    We had discovered a problem reading a UTF-8 XML file and I was researching if CMarkup had a way of converting UTF-8 to ANSI. When I noticed the 10.1 version on your web site, I figured I had better try the latest before doing anything else. I downloaded it and [replaced 6.5] with no change in my code, everything just worked, the conversion happening automatically. Good job!

    Great. Just FYI, the automatic conversion from UTF-8 to ANSI in memory in an MBCS build was implemented in CMarkup developer release 7.3, and made part of the evaluation version in release 9.0. CMarkup release 10.1 further adds the ability to automatically convert from an ANSI file even if it does not correspond to the system locale ANSI encoding.

    UTF-8 build (non-MBCS)

    If you do not have MBCS (nor UNICODE nor MARKUP_WCHAR) in your compiler defines, CMarkup is designed to use UTF-8 internally. This allows you to keep the document in Unicode in memory, and avoid any of the multi-language text loss described above in the MBCS build. You should convert strings to ANSI only as needed for displaying or passing to Windows APIs with AToUTF8 and UTF8ToA.

    If your file is UTF-8 (as recommended for XML), this allows you to keep the document as UTF-8 in memory too without conversion on load and save.

    UTF-16 files

    UTF-8 is the recommended Unicode encoding for XML, but sometimes UTF-16 is used. CMarkup will detect UTF-16 files (this includes UCS-2) containing a Byte Order Mark (BOM). See UTF-16 Files and the Byte Order Mark (BOM).

    See Also:

    UTF-8 Files and the Preamble
    Setting the XML Declaration With CMarkup
    wchar_t string on Linux, OS X and Windows