Control Characters in XML

Aside from the carriage return, linefeed, tab, and null, the "non-printable" control characters only mean something to ancient terminals and transmission protocols. Nowadays they are avoided in text documents, and therefore should be avoided in XML too. But if for some reason you need to represent them in XML, you might like to use XML's natural numeric reference escaping mechanism. Unfortunately, it is not that simple.

In its infinite wisdom the XML 1.0 standard excluded the control characters in the range 0x01 to 0x1f except whitespace 0x09, 0x0a, 0x0d, even in escaped form. This was reversed in XML 1.1 but it was too late. This mistake is perfectly understandable if you agree that the purpose of the XML standard was to over-engineer the concept of a simple markup format. ;)

 

comment posted CMarkup with non-printable characters

Matt 09-Feb-2012

Basically, CMarkup::EscapeText(...) allows non-printable characters like 0xE to be transmitted as-is. But some XML parsers seem to be picky about this, and cite the spec as the reason: http://www.w3.org/TR/xml/#charsets. This is leading to this real-world problem [where we pass text from a third party source application to a third party receiving application via XML. When the text in the XML contains a 0x0e (which only has meaning in the source application) a failure is triggered in the receiving application].

After testing, it looks like a lot of parsers still reject the XML if it has an escaped non-print character like . Our plan is to convert non-prints to question marks.

I think in your case you should convert the control characters to question marks (or remove them) as part of scrubbing the source data because the receiving application would never want them anyway.

But if someone needed to make CMarkup automatically escape control characters, it would require a small modification. If you have 11.5, it is Markup.cpp:2967, but in any recent release in CMarkup::EscapeText before the else { nCharLen = MCD_CLEN( pSource );... add the following else if clause:

else if ( cSource<0x20 && cSource>0 && cSource!=0x0a && ccSource!=0x0d && cSource!=0x09 )
{
  // 0x0e becomes 
  MCD_CHAR szEscaped[10];
  MCD_SPRINTF( MCD_SSZ(szEscaped), MCD_T("&#x%x;"), (int)cSource );
  MCD_BLDAPPEND(strText,szEscaped);
  ++pSource;
}

before this:

else
{
  nCharLen = MCD_CLEN( pSource );

Re: ASCII control characters in XML Yes, the XML spec clearly rules these characters out. We didn't discuss it that much during the process - it seemed like a good idea, and nobody on any of the committees seemed troubled at the prospect of losing them; so I'm afraid this is a hardwired characteristic of XML 1.0, and you're stuck with it. -Tim Bray Tue, 28 Apr 1998

Re: control characters I'm not sure we'd do it the same way if we were doing it again. I don't see that they do any real harm. -Tim Bray Sat, 17 Jun 2000

XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection [and prove the committee consists of incorrigible meddlers], the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) -Extensible Markup Language (XML) 1.1 W3C Candidate Recommendation 15 October 2002

See also:

EscapeText

UnescapeText