Euro and other non-ASCII chars in XML with CMarkup

You might work only with text in the ASCII range (below 128) or have some non-ASCII text like the Euro character or Western European characters with accents and umlautes. Here are some examples of how to handle encoding issues as you move beyond ASCII.

 

comment posted euro is unreadable in XML

Davide 02-Mar-2011

I need to insert an amount in a UTF-8 xml file, something like:

CString sFmt;
sFmt.Format( _T("%d €"), nPrice );
xml.AddElem( _T("Price"), sFmt );

But the resulting xml is unreadable. I've found a workaround using:

CString sFmt;
sFmt.Format( _T("%d \xE2\x82\xAC"), nPrice );
xml.AddElem( _T ("Price"), sFmt );

Using \xE2\x82\xAC for the euro is correct in your case because your string encoding is UTF-8.

When you specify a non-ASCII character in a source file on Windows it is compiled into your program in the locale charset. So the problem with the euro symbol in sFmt is that it is in your locale's MBCS (in which the euro is represented by one byte) and CMarkup is expecting UTF-8 (which is the case when your project is set to use neither MBCS nor UNICODE). You were able to work around it by putting the UTF-8 encoding directly in the string.

If compiling for MBCS you could have used the euro character directly in your source string, but the result would only be satisfactory as long as the program is running on a machine with your same locale "Language for non-Unicode programs."

C++ string charset build options

This is another opportunity to discuss the internal memory string encoding choices in C++ (also described in ANSI and Unicode files and C++ strings).

The CMarkup class has a string member m_strDoc that holds the XML document (or part of it in the case of file mode). Also, the CMarkup methods accept and return strings. The encoding of these strings depends on platform and compiler options.

  • Wide char Unicode: UTF-16 on Windows, generally UTF-32 on OS X and Linux
  • MBCS: Windows only, depends on machine's locale setting e.g. Western European, Korean
  • UTF-8 Unicode: plain byte-based text, UTF-8 on OS X and Linux, can be UTF-8 on Windows
  • You select the UTF-8 option in Windows by turning off the UNICODE (wide char) and MBCS project defines. In Visual Studio 2005+ Properties General Character Set choose the "Not Set" option. In this case, all the strings going into and out of the CMarkup methods are expected in UTF-8.

    If you have a UTF-8 file, using UTF-8 in memory eliminates the need to convert the text encoding between file and memory.

    If you have a UTF-8 file and compile for MBCS in memory, CMarkup converts the XML to the locale code page when it is loaded into memory. This has performance and multi-language disadvantages. It must do the conversion as mentioned when going between memory and file, which adds time (though less than the time to read from disk) and might be a performance consideration depending on your requirements. But also, you will lose any Unicode characters not supported in the locale code page where the program is running. For example, the ö with the umlaute is 246 in the standard Windows U.S. code page Windows-1252 but it is not supported in Greek Windows-1253 (but the Euro is).

     

    comment posted GetAttrib result is a question mark

    Chen 22-Aug-2011

    When the xml data has a node like the following:

    <block solution ="" />

    CMarkup's function GetAttrib("solution") has the result "?".

    The character in your attribute is U+FF5E (65374, Halfwidth and Fullwidth Forms FF00 - FFEF) UTF-8 EF BD 9E). When this character is not supported in the character set in memory, it is replaced by a question mark. You likely have an MBCS build which expects strings to be in the system locale Windows code page (not Unicode). It is best if you can use a Unicode charset in memory -- either UTF-8 or wide string, as explained above.

     

    comment posted Trouble working with Arabic XML

    Greg 16-Jun-2011

    I am reading in a bunch of small Arabic XML files and combining them into a larger one. My problem is that I must not be handling something correctly when I read the data into memory because the output is gibberish (examples follow). Here are the relevant details of my environment: CMarkup version 11, Microsoft Visual Studio 2010 Ultimate, Unmanaged C++, MBCS, XML is UTF-8 with no BOM. Here is an example of what I am reading in:

    <?xml version="1.0" encoding="UTF-8"?>
    <Symbol xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="..\schemas\symbolcanonical.xsd">
    <Filename>11.esds</Filename>
    <Description>
    <LocText>ضع هذاالرمز  على رسم المخطط  .ثم ابداء الطباعة  .</LocText>
    </Description>
    </Symbol>

    And here is an example of what I am writing out:

    <?xml version="1.0" encoding="UTF-8"?>
    <SymbolCollection>
    <SymbolList>
    <Symbol>
    <Filename>11.esds</Filename>
    <Description>?? ????????  ??? ??? ??????  .?? ????? ???????  .</Description>
    </Symbol>
    </SymbolList>
    </SymbolCollection>

    I’ve played around with various SetDocFlags and setlocale options and whatnot...

    [Selecting "Not Set" for the project character set solved the problem.] I didn't realize I had a viable third choice on that build setting.

    When your project is set to use MBCS, CMarkup converts the file to your system locale charset in memory. If you can set your Character Set to the "Not Set" option in your Project Properties it will keep your XML in UTF-8.

     

    comment posted encoding of XML with german umlaute

    Ahyan 19-Jul-2011

    It is possible for developers to save their XML files in an unfortunately inconsistent way where the XML header encoding information does not fit to the file content. That happens when the XML files are modified in a text editor that does not care if it is saving an XML file and if the encoding header matches the content. Within this text editor one has to explicitly specify the encoding of the text file (which is actually XML in this case) with the "Save File As" options. So we end up having XML files with incorrect headers and a given XML file that can contain "german umlaute" (special german characters like ä,ö etc) will be invalid because the XML header states e.g. an encoding ("UTF-8" or "8859-1") that doesn't fit the actual content. This can be improved by training the developers...

    Yes, in the real world, you get situations where you need to salvage improperly declared XML documents. Say you have a header (an "XML declaration") at the top of your XML file:

    <?xml version="1.0" encoding="UTF-8"?>

    But the encoding of the non-ASCII characters is actually Windows-1252. You can get CMarkup to ignore the "UTF-8" specified there by using the ReadTextFile and WriteTextFile functions directly and specifying the desired encoding:

    string strDoc, strEncoding="Windows-1252";
    CMarkup::ReadTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding);
    CMarkup xml;
    xml.SetDoc(strDoc);
    ...
    strDoc = xml.GetDoc();
    CMarkup::WriteTextFile("C:\\file.xml",strDoc,NULL,NULL,&strEncoding);

    This should allow you to leave (and ignore) the incorrect encoding in the XML declaration.

    See also:

    ANSI and Unicode files and C++ strings

    GetDeclaredEncoding