ANSI and Unicode files and C++ strings
How do you handle XML and HTML text files whether your strings in memory are
MBCS, or plain C? Loading and saving text in a Unicode or ANSI file may involve a conversion to and from the text encoding of your C++ strings.
Hey, in case you are just trying to use a text editor to save your ANSI file into a Unicode encoding see Convert ANSI file to Unicode.
UTF-8 is the recommended encoding for XML and HTML files. If your text is all ASCII, then it is also valid UTF-8. But if your XML file is not ASCII and not Unicode (i.e. not UTF-8 or UTF-16/UCS-2) then you really should use an XML Declaration to declare your encoding. For example if it is the default U.S. ANSI charset use this declaration:
<?xml version="1.0" encoding="windows-1252"?>
Overview of ANSI Code Pages
In Windows programming, the term ANSI is used to collectively refer to all the non-Unicode single and multibyte character sets that can be selected as the system locale code page. These include the single byte systems for Europe and the "double byte" for Chinese, Japanese and Korean which actually use one or two bytes per character.
In character sets like Windows-1252 and Cyrillic Windows-1251, a character is always one byte with a value up to 255. The problem is that values 128 to 255 (hex
ff) are assigned to specific characters that can differ between charsets. For example, the Euro symbol (€) is hex
80 (decimal 128) in Windows-1252, but in Windows-1251 the value hex
80 represents the capital letter DJE (Ђ) and hex
88 is the Euro. So, when the computer sees the value
80, it depends on the current system locale setting how it is going to display that character.
|€ (Euro)||80 (128)||88 (136)||E2 82 AC||20AC|
|® (Registered)||AE (174)||AE (174)||C2 AE||00AE|
In double-byte character sets like GB2312 (Chinese Simplified), a character is one byte when it is in the ASCII range, but it can be either one or two bytes otherwise (some code points over
80 are lead bytes that are interpreted together with a trailing byte). For example, the ASCII character z value does not change (much) between the different encodings, but the sample Chinese character is completely different:
|中 (middle)||D6 D0||E4 B8 AD||4E2D|
Windows systems allow non-UNICODE programs to work with one ANSI character set at a time. To change charsets, you have to change the computer's system locale setting (that's why Unicode is great because you do not have to do this).
Your charset build options on Windows
In terms of the charset of your in-memory text, there are 3 ways to build your Windows C++ program:
UNICODEmeans use wide char UTF-16 (or it's precursor UCS-2)
MBCSmeans use the Windows system locale charset, single or multi byte (DBCS)
Your charset build options on Linux, OS X and other platforms
OS X and Linux C++ programs do not have the ANSI and UNICODE modes that Windows programs do. But you can use
char-based strings which are usually UTF-8, or
wchar_t-based UTF-32 strings. The corresponding STL string classes are
std::wstring. Although I recommend UTF-8 which requires less space and fewer conversions, CMarkup supports wide strings on these platforms with the
Loading and saving text files
Update December 17, 2008: With CMarkup release 10.1, the Save and Load methods, and the underlying WriteTextFile and ReadTextFile functions have greatly expanded character conversion capabilities to handle most common ANSI and double-byte encodings specified in the XML declaration or HTML Content-Type meta tag (see GetDeclaredEncoding) of the document.
Whichever charset you use in memory, CMarkup converts between that and the encoding of your file when you read or write the file.
In Windows, the CMarkup text conversion functionality uses the
WideCharToMultiByte Windows APIs, see the preprocessor define
MARKUP_WINCONV. In Visual C++
MARKUP_WINCONV is automatically selected. In g++ for cygwin and other compilers for Windows, add
MARKUP_WINCONV to your preprocessor defines or specify
-DMARKUP_WINCONV on the command line.
If not on Windows, CMarkup uses the
iconv API available on OS X, Linux and some other platforms, see the preprocessor define
MARKUP_ICONV. The g++ GNU compiler will automatically select
MARKUP_ICONV in your preprocessor defines if needed. On OS X you may need to specify the iconv library to the linker. The following command is used to compile the CMarkup test program on OS X:
g++ main.cpp Markup.cpp MarkupTest.cpp -liconv
Without a conversion API
MARKUP_STDCONV to use neither Windows conversion APIs nor
iconv. See non-Unicode text handling in CMarkup.
If you do not use either
MARKUP_ICONV, CMarkup still supports conversion between Unicode encodings (UTF-8 and UTF-16 and
wchar_t which is UTF-32 on OS X and Linux), as well as the system locale encoding if you call
setlocale. A non-Unicode encoding is supported with the ANSI C
wctomb functions to convert to/from Unicode.
This is the default build mode in older Windows Visual C++ projects using the system locale ANSI code page in strings and Windows APIs. It means that your strings in memory are not Unicode.
CMarkup is a bit slower in an
MBCS build than a non-MBCS build because it must compute the character length (1 or 2 bytes) as it processes strings. This is not the case with UTF-8 because although UTF-8 is multibyte, this character length computation is not necessary during most UTF-8 string processing because of how UTF-8 is designed. See the next section on using UTF-8 internally.
If a Unicode file is loaded into an
MBCS build, you will lose any characters not supported by the charset being used in memory. Or if the file is a different character set than the one in memory you can lose characters. The conversion process generally replaces these lost characters with a question mark and reports in the result string (GetResult) that characters were lost during conversion.
Great. Just FYI, the automatic conversion from UTF-8 to ANSI in memory in an MBCS build was implemented in CMarkup developer release 7.3, and made part of the evaluation version in release 9.0. CMarkup release 10.1 further adds the ability to automatically convert from an ANSI file even if it does not correspond to the system locale ANSI encoding.
UTF-8 build (non-MBCS)
If you do not have
MARKUP_WCHAR) in your compiler defines, CMarkup is designed to use UTF-8 internally. This allows you to keep the document in Unicode in memory, and avoid any of the multi-language text loss described above in the
MBCS build. You should convert strings to ANSI only as needed for displaying or passing to Windows APIs with AToUTF8 and UTF8ToA.
If your file is UTF-8 (as recommended for XML), this allows you to keep the document as UTF-8 in memory too without conversion on load and save.
UTF-8 is the recommended Unicode encoding for XML, but sometimes UTF-16 is used. CMarkup will detect UTF-16 files (this includes UCS-2) containing a Byte Order Mark (BOM). See UTF-16 Files and the Byte Order Mark (BOM).