As part of the CMarkup release 8.0 support for Generic Markup In CMarkup, any non-XML arrangement of tags can be navigated using CMarkup. See also HTML And CMarkup.

It is possible to use an abbreviated markup system in which end tags are intentionally omitted for brevity. This is not recommended because it means leaving the XML standard for questionable benefit, but it can be navigated using CMarkup methods. Compare the well-formed RECORD and then the alternative markup option following it:

<RECORD><NAME>John Smith</NAME><ID>7632</ID><SCORE>10.6</SCORE></RECORD>
<RECORD><NAME>John Smith<ID>7632<SCORE>10.6</RECORD>

If there is no end tag, and it is not an XML empty element ending in />, CMarkup treats the element much like an empty element with no data (see Containment Hierarchy for more information on how the hierarchy is determined in non-XML markup documents). One way to retrieve the information from the above record is as follows (this could be optimized depending on knowledge of how tags are arranged):

xml.FindElem( "RECORD" );
xml.IntoElem();
xml.FindElem( "NAME" );
xml.FindNode();
CString csName = xml.GetData();
xml.FindElem( "ID" );
xml.FindNode();
CString csID = xml.GetData();
xml.FindElem( "SCORE" );
xml.FindNode();
CString csScore = xml.GetData();

Essentially, once you find the non-ended element, you extract the text node that comes after it. The complication is that if the element is immediately followed by the next non-ended element without a text node in between, you will not be in a good position to find the next element. For example, suppose it is possible in your data to have a NAME and SCORE and an empty ID.

<RECORD><NAME>James Smith<ID><SCORE>4.6</RECORD>

After finding the ID element, you call FindNode to move to the text node value. But, because the ID is empty, the next node is the SCORE element. At this point, a call to FindElem("SCORE") will return false because FindElem starts searching after the current position. Calling ResetMainPos between FindElem calls is the simplest way to avoid this problem.

CString csName, csID, csScore;
xml.FindElem( "RECORD" );
xml.IntoElem();
xml.FindElem( "NAME" );
if ( xml.FindNode() == xml.MNT_TEXT )
  csName = xml.GetData();
xml.ResetMainPos();
xml.FindElem( "ID" );
if ( xml.FindNode() == xml.MNT_TEXT )
  csID = xml.GetData();
xml.ResetMainPos();
xml.FindElem( "SCORE" );
if ( xml.FindNode() == xml.MNT_TEXT )
  csScore = xml.GetData();

It would be more efficient if it was known that there will be no ID element when the ID value is empty. In that case you could test the result of the FindElem calls and ResetMainPos would not be needed.

<RECORD><NAME>James Smith<SCORE>4.6</RECORD>
CString csName, csID, csScore;
xml.FindElem( "RECORD" );
xml.IntoElem();
if ( xml.FindElem("NAME") )
{
  xml.FindNode();
  csName = xml.GetData();
}
if ( xml.FindElem("ID") )
{
  xml.FindNode();
  csID = xml.GetData();
}
if ( xml.FindElem("SCORE") )
{
  xml.FindNode();
  csScore = xml.GetData();
}

 

comment posted not quite XML documents

Warren Stevens 24-Jan-2005

Some of the documents I have to read are not quite XML documents, but are quite close (they're essentially SGML files). The difference being that some of the elements do not have an end-tag - the tag is finished by an end-of-line or by another start-tag. Here is a snippet from one of the files:

<SONRS>
  <STATUS>
    <CODE>0
    <SEVERITY>INFO
    <MESSAGE>OK
  </STATUS>
  <DTSERVER>20041013225504[-5]
  <LANGUAGE>ENG
  <INTU.BID>00015
</SONRS>

This can be handled much like the previous example, but using a new feature in CMarkup release 11.3 to trim the trailing whitespace when the values are extracted from the document. The example code below extracts the SEVERITY and MESSAGE values.

  // Retrieve severity and message
  CString csMsg, csSev;
  xml.SetDocFlags( xml.MDF_TRIMWHITESPACE );
  xml.FindElem(); // root
  xml.IntoElem();
  xml.FindElem( "STATUS" );
  xml.IntoElem();
  if ( xml.FindElem("SEVERITY") && xml.FindNode() == xml.MNT_TEXT )
    csSev = xml.GetData();
  xml.ResetMainPos();
  if ( xml.FindElem("MESSAGE") && xml.FindNode() == xml.MNT_TEXT )
    csMsg = xml.GetData();

 

comment posted case insensitive?

Geert 06-Feb-2007

Is it possible to parse case insensitive?

Yes, one of the advantages of CMarkup is the option to ignore case when parsing and navigating documents. See SetDocFlags. Other XML tools don't allow this but CMarkup is designed for XML as well as non-compliant XML and other kinds of markup. Turn the ignore case flag on for a CMarkup object m as follows:

m.SetDocFlags( CMarkup::MDF_IGNORECASE );

Or to compile this as the default for all CMarkup objects:

#define MARKUP_IGNORECASE

Because HTML tag and attribute names are not case sensitive, this is especially useful for HTML And CMarkup.