Euro and other non-ASCII chars in XML with CMarkup
You might work only with text in the ASCII range (below 128) or have some non-ASCII text like the Euro character or Western European characters with accents and umlautes. Here are some examples of how to handle encoding issues as you move beyond ASCII.
Using \xE2\x82\xAC for the euro is correct in your case because your string encoding is UTF-8.
When you specify a non-ASCII character in a source file on Windows it is compiled into your program in the locale charset. So the problem with the euro symbol in
sFmt is that it is in your locale's MBCS (in which the euro is represented by one byte) and CMarkup is expecting UTF-8 (which is the case when your project is set to use neither
UNICODE). You were able to work around it by putting the UTF-8 encoding directly in the string.
If compiling for MBCS you could have used the euro character directly in your source string, but the result would only be satisfactory as long as the program is running on a machine with your same locale "Language for non-Unicode programs."
C++ string charset build options
This is another opportunity to discuss the internal memory string encoding choices in C++ (also described in ANSI and Unicode files and C++ strings).
The CMarkup class has a string member
m_strDoc that holds the XML document (or part of it in the case of file mode). Also, the CMarkup methods accept and return strings. The encoding of these strings depends on platform and compiler options.
You select the UTF-8 option in Windows by turning off the UNICODE (wide char) and MBCS project defines. In Visual Studio 2005+ Properties General Character Set choose the "Not Set" option. In this case, all the strings going into and out of the CMarkup methods are expected in UTF-8.
Note: the terminology is confusing, but in this Windows context UTF-8 is neither MBCS nor UNICODE. UTF-8 is "multibyte," however Windows uses MBCS to refer only to those character sets that can be selected for the machine locale and used in "A" APIs and Windows messages. And although UTF-8 is Unicode, Windows uses UNICODE to refer only to UTF-16 used in "W" APIs and Windows messages. In Windows you must convert UTF-8 strings to MBCS for "A" APIs like
SetWindowTextA or better yet to UTF-16 for "W" APIs like
SetWindowTextW (that's what CMarkup's UTF8To16 and UTF16To8 are for).
If you have a UTF-8 file, using UTF-8 in memory eliminates the need to convert the text encoding between file and memory.
If you have a UTF-8 file and compile for MBCS in memory, CMarkup converts the XML to the locale code page when it is loaded into memory. This has performance and multi-language disadvantages. It must do the conversion as mentioned when going between memory and file, which adds time (though less than the time to read from disk) and might be a performance consideration depending on your requirements. But also, you will lose any Unicode characters not supported in the locale code page where the program is running. For example, the ö with the umlaute is 246 in the standard Windows U.S. code page Windows-1252 but it is not supported in Greek Windows-1253 (but the Euro is).
The character in your attribute is U+FF5E (65374, Halfwidth and Fullwidth Forms FF00 - FFEF) UTF-8 EF BD 9E). When this character is not supported in the character set in memory, it is replaced by a question mark. You likely have an MBCS build which expects strings to be in the system locale Windows code page (not Unicode). It is best if you can use a Unicode charset in memory -- either UTF-8 or wide string, as explained above.
When your project is set to use MBCS, CMarkup converts the file to your system locale charset in memory. If you can set your Character Set to the "Not Set" option in your Project Properties it will keep your XML in UTF-8.
Yes, in the real world, you get situations where you need to salvage improperly declared XML documents. Say you have a header (an "XML declaration") at the top of your XML file:
<?xml version="1.0" encoding="UTF-8"?>
But the encoding of the non-ASCII characters is actually Windows-1252. You can get CMarkup to ignore the "UTF-8" specified there by using the ReadTextFile and WriteTextFile functions directly and specifying the desired encoding:
string strDoc, strEncoding="Windows-1252"; CMarkup::ReadTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding); CMarkup xml; xml.SetDoc(strDoc); ... strDoc = xml.GetDoc(); CMarkup::WriteTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding);
This should allow you to leave (and ignore) the incorrect encoding in the XML declaration.