You might work only with text in the ASCII range (below 128) or have some non-ASCII text like the Euro character or Western European characters with accents and umlautes. Here are some examples of how to handle encoding issues as you move beyond ASCII.
Using \xE2\x82\xAC for the euro is correct in your case because your string encoding is UTF-8.
When you specify a non-ASCII character in a source file on Windows it is compiled into your program in the locale charset. So the problem with the euro symbol in sFmt
is that it is in your locale's MBCS (in which the euro is represented by one byte) and CMarkup is expecting UTF-8 (which is the case when your project is set to use neither MBCS
nor UNICODE
). You were able to work around it by putting the UTF-8 encoding directly in the string.
If compiling for MBCS you could have used the euro character directly in your source string, but the result would only be satisfactory as long as the program is running on a machine with your same locale "Language for non-Unicode programs."
This is another opportunity to discuss the internal memory string encoding choices in C++ (also described in ANSI and Unicode files and C++ strings).
The CMarkup class has a string member m_strDoc
that holds the XML document (or part of it in the case of file mode). Also, the CMarkup methods accept and return strings. The encoding of these strings depends on platform and compiler options.
You select the UTF-8 option in Windows by turning off the UNICODE (wide char) and MBCS project defines. In Visual Studio 2005+ Properties General Character Set choose the "Not Set" option. In this case, all the strings going into and out of the CMarkup methods are expected in UTF-8.
Note: the terminology is confusing, but in this Windows context UTF-8 is neither MBCS nor UNICODE. UTF-8 is "multibyte," however Windows uses MBCS to refer only to those character sets that can be selected for the machine locale and used in "A" APIs and Windows messages. And although UTF-8 is Unicode, Windows uses UNICODE to refer only to UTF-16 used in "W" APIs and Windows messages. In Windows you must convert UTF-8 strings to MBCS for "A" APIs like SetWindowTextA
or better yet to UTF-16 for "W" APIs like SetWindowTextW
(that's what CMarkup's UTF8To16 and UTF16To8 are for).
If you have a UTF-8 file, using UTF-8 in memory eliminates the need to convert the text encoding between file and memory.
If you have a UTF-8 file and compile for MBCS in memory, CMarkup converts the XML to the locale code page when it is loaded into memory. This has performance and multi-language disadvantages. It must do the conversion as mentioned when going between memory and file, which adds time (though less than the time to read from disk) and might be a performance consideration depending on your requirements. But also, you will lose any Unicode characters not supported in the locale code page where the program is running. For example, the ö with the umlaute is 246 in the standard Windows U.S. code page Windows-1252 but it is not supported in Greek Windows-1253 (but the Euro is).
GetAttrib result is a question mark
Chen 22-Aug-2011
When the xml data has a node like the following:
<block solution ="~" />
CMarkup's function GetAttrib("solution")
has the result "?".
The character in your attribute is U+FF5E (65374, Halfwidth and Fullwidth Forms FF00 - FFEF) UTF-8 EF BD 9E). When this character is not supported in the character set in memory, it is replaced by a question mark. You likely have an MBCS build which expects strings to be in the system locale Windows code page (not Unicode). It is best if you can use a Unicode charset in memory -- either UTF-8 or wide string, as explained above.
Trouble working with Arabic XML
Greg 16-Jun-2011
I am reading in a bunch of small Arabic XML files and combining them into a larger one. My problem is that I must not be handling something correctly when I read the data into memory because the output is gibberish (examples follow). Here are the relevant details of my environment: CMarkup version 11, Microsoft Visual Studio 2010 Ultimate, Unmanaged C++, MBCS, XML is UTF-8 with no BOM. Here is an example of what I am reading in:
<?xml version="1.0" encoding="UTF-8"?>
<Symbol xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="..\schemas\symbolcanonical.xsd">
<Filename>11.esds</Filename>
<Description>
<LocText>ضع هذاالرمز على رسم المخطط .ثم ابداء الطباعة .</LocText>
</Description>
</Symbol>
And here is an example of what I am writing out:
<?xml version="1.0" encoding="UTF-8"?>
<SymbolCollection>
<SymbolList>
<Symbol>
<Filename>11.esds</Filename>
<Description>?? ???????? ??? ??? ?????? .?? ????? ??????? .</Description>
</Symbol>
</SymbolList>
</SymbolCollection>
I’ve played around with various SetDocFlags and setlocale
options and whatnot...
[Selecting "Not Set" for the project character set solved the problem.] I didn't realize I had a viable third choice on that build setting.
When your project is set to use MBCS, CMarkup converts the file to your system locale charset in memory. If you can set your Character Set to the "Not Set" option in your Project Properties it will keep your XML in UTF-8.
encoding of XML with german umlaute
Ahyan 19-Jul-2011
It is possible for developers to save their XML files in an unfortunately inconsistent way where the XML header encoding information does not fit to the file content. That happens when the XML files are modified in a text editor that does not care if it is saving an XML file and if the encoding header matches the content. Within this text editor one has to explicitly specify the encoding of the text file (which is actually XML in this case) with the "Save File As" options. So we end up having XML files with incorrect headers and a given XML file that can contain "german umlaute" (special german characters like ä,ö etc) will be invalid because the XML header states e.g. an encoding ("UTF-8" or "8859-1") that doesn't fit the actual content. This can be improved by training the developers...
Yes, in the real world, you get situations where you need to salvage improperly declared XML documents. Say you have a header (an "XML declaration") at the top of your XML file:
<?xml version="1.0" encoding="UTF-8"?>
But the encoding of the non-ASCII characters is actually Windows-1252. You can get CMarkup to ignore the "UTF-8" specified there by using the ReadTextFile and WriteTextFile functions directly and specifying the desired encoding:
string strDoc, strEncoding="Windows-1252"; CMarkup::ReadTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding); CMarkup xml; xml.SetDoc(strDoc); ... strDoc = xml.GetDoc(); CMarkup::WriteTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding);
This should allow you to leave (and ignore) the incorrect encoding in the XML declaration.
See also:
Davide 02-Mar-2011
I need to insert an amount in a UTF-8 xml file, something like:
But the resulting xml is unreadable. I've found a workaround using: