Node Methods in CMarkup

Update July 12, 2005: With CMarkup release 8.0, the node methods have been made available in the evaluation version so that everyone can try the full HTML and generic markup capabilities of CMarkup. Node methods are key to providing access to mixed content as well as Other Markup (non-XML).

The node methods are FindNode, AddNode, InsertNode, RemoveNode, and GetNodeType, and in addition the SetData and GetData, SetAttrib, GetAttrib, and GetAttribName methods can be used with non-element nodes in certain cases.

The node methods compliment but in no way replace the standard element methods. The element methods still map out the real structure of the XML document, and the node methods are a mechanism for manipulating the nodes found here and there in between the elements. If you want to ignore comments and other extraneous non-element nodes, just don't use the node methods. The node methods are completely optional, even if there are non-element nodes in your document.

Node Types
Type Example GetTagName GetData
Element <ITEM>data</ITEM> ITEM contained between start tag and end tag
Comment  #comment contained between dashes
Processing Instruction <?xml version="1.0"?> xml contained between question marks
Document Type <!DOCTYPE greeting SYSTEM "hello.dtd"> greeting everything including first < and last >
CDATA Section <![CDATA[data]]> #cdata-section character data between inner brackets
Text hello #text parsed character data between markup tags
Whitespace #text whitespace character data between markup tags
Lone End Tag </ITEM> ITEM ill-formed support added in CMarkup 8.0

Node Types
Type	Example	`GetTagName`	`GetData`
Element	`<ITEM>data</ITEM>`	ITEM	contained between start tag and end tag
Comment	`<!-- comment -->`	#comment	contained between dashes
Processing Instruction	`<?xml version="1.0"?>`	xml	contained between question marks
Document Type	`<!DOCTYPE greeting SYSTEM "hello.dtd">`	greeting	everything including first < and last >
CDATA Section	`<![CDATA[data]]>`	#cdata-section	character data between inner brackets
Text	`hello`	#text	parsed character data between markup tags
Whitespace		#text	whitespace character data between markup tags
Lone End Tag	`</ITEM>`	ITEM	ill-formed support added in CMarkup 8.0

!
When creating and modifying nodes, CMarkup does no verifying of the text, so you can easily cause the document to become ill-formed. For example, do not add a comment containing "-->" or a processing instruction containing "?>" etc. However, the usual special characters are properly encoded and decoded in Text nodes.

The following paragraphs explain how to manipulate nodes according to three main uses for node methods in CMarkup.

Comments

To navigate a document watching for elements and comments , call FindNode(), checking the return value for MNT_ELEMENT or MNT_COMMENT. You may also be interested in processing instructions in which case you would check for MNT_PROCESSING_INSTRUCTION. Call GetData() to get the text of the comment or the entire processing instruction (including the target). You still need to use IntoElem and OutOfElem when you encounter elements according to how you wish to traverse the hierarchy of the document.

To create a comment after the current main position, call AddNode( CMarkup::MNT_COMMENT, "comment" ) (in a comment you specify the text in between the dashes). You can also use InsertNode to insert before, or even navigate to non-element nodes using FindNode and add or insert there. So you have full flexibility on where to put nodes.

Mixed Content

To navigate the mixed content of an element, navigate to the element in the main position and call IntoElem(). Then call FindNode(), checking the return value for the type of node. When the node type is 0, there are no more child nodes in the current element. When FindNode() returns MNT_ELEMENT, you generally want to go into that element and navigate nodes in there if it contains mixed content as well. To do this you should keep track of how many levels deep you have gone. When positioned on a non-element node, you retrieve its text using the GetData method. In a CDATA section it is the character data inside the inner brackets. In a text node it is the parsed character data from the end of the previous markup tag to the start of the following markup tag. In the following example, the trailing space is included in the Text node "I ", while the space between the I element and the B element is a Whitespace node.

<poem>I <I>saw</I> <B>her</B></poem>

To create mixed content use AddNode(nodetype,text). This method places mixed content nodes immediately after the current main position node. Use this method to create an empty element where the text argument is the tag name and no end of line (EOL) is put after the element. The following table shows the nodes (primarily used in mixed content) that have no EOL placed after them in AddNode and InsertNode:

Type End Of Line (EOL) Added
Element no
Text (or Whitespace) no
CDATA Section no
Comment yes
Processing Instruction yes
Document Type yes

Type	End Of Line (EOL) Added
Element	no
Text (or Whitespace)	no
CDATA Section	no
Comment	yes
Processing Instruction	yes
Document Type	yes

Other Nodes

The xml version declaration (its a processing instruction with reserved target "xml") and DTD are special nodes appearing at the top of the XML document. You can access these after calling ResetPos with the FindNode method. You can also add and remove them with the other node methods. When you set or get the data, you have to be familiar with the part of the node that is considered "data." In the processing instruction, its everything between the question marks, i.e. xml version="1.0". Processing instructions are also used to give application specific instructions, for example appname highdef="true". In the DTD, the data is everything from the first < to the end >.

An Example

The following sample document has an xml declaration (processing instruction), comments, DTD, and mixed content in the third ITEM element.

<?xml version="1.0"?>
<!DOCTYPE PARSETEST [
<!ELEMENT PARSETEST (ITEM*)>
<!ATTLIST PARSETEST v CDATA '' s CDATA ''>
<!ELEMENT ITEM (#PCDATA|B|I)*>
<!ATTLIST ITEM note CDATA '' xml:space (default|preserve) 'preserve'>
<!ELEMENT B ANY>
<!ELEMENT I ANY>
]>
<!--tightcomment-->
<PARSETEST v="1" s='6'>
  <!-- mid comment -->
  <ITEM note="hi"/>
  <ITEM note="see data">hi</ITEM>
  <ITEM> mixed <B>content</B> <I>okay</I></ITEM>
</PARSETEST>
<!-- end comment -->

After instantiating a CMarkup object with the above document, the following code demonstrates methods to navigate and modify it:

xml.ResetPos();
ASSERT( xml.FindNode() == xml.MNT_PROCESSING_INSTRUCTION );
ASSERT( xml.GetData() == "xml version=\"1.0\"" );
ASSERT( xml.FindNode(xml.MNT_COMMENT) );
ASSERT( xml.GetData() == "tightcomment" );
ASSERT( xml.SetData( _T("comment 1 changed") ) );
ASSERT( xml.GetData() == _T("comment 1 changed") );
ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_ELEMENT );
ASSERT( xml.GetTagName() == _T("PARSETEST") );
ASSERT( xml.IntoElem() );
ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_COMMENT );
ASSERT( xml.GetData() == _T(" mid comment ") );
ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_ELEMENT );
ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_ELEMENT );
ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_ELEMENT );
ASSERT( xml.IntoElem() );
ASSERT( xml.FindNode() == xml.MNT_TEXT );
ASSERT( xml.GetData() == _T(" mixed ") );
ASSERT( xml.FindNode() == xml.MNT_ELEMENT );
ASSERT( xml.GetTagName() == _T("B") );
ASSERT( xml.IntoElem() );
ASSERT( xml.FindNode() == xml.MNT_TEXT );
ASSERT( xml.GetData() == _T("content") );
ASSERT( xml.FindNode() == 0 );
ASSERT( xml.OutOfElem() );
ASSERT( xml.FindNode() == xml.MNT_WHITESPACE );
ASSERT( xml.GetData() == _T(" ") );
ASSERT( xml.FindNode() == xml.MNT_ELEMENT );
ASSERT( xml.GetTagName() == _T("I") );
ASSERT( xml.IntoElem() );
ASSERT( xml.FindNode() == xml.MNT_TEXT );
ASSERT( xml.GetData() == _T("okay") );
ASSERT( xml.FindNode() == 0 );
ASSERT( xml.OutOfElem() );
ASSERT( xml.FindNode() == 0 );
ASSERT( xml.OutOfElem() );

MSXML Differences

The above code works with CMarkup and CMarkupMSXML. However, note that it does not test for the Document Type node because in the MSXML wrapper (CMarkupMSXML) the Document Type node is never found by the FindNode method. In DOM the Document Type node is not treated like the other nodes, it is more like a separate read-only member of the document object.

Whitespace is simple in CMarkup because the document is kept as one large string rather than breaking it into node objects, and no whitespace information is lost. In MSXML, whitespace is usually lost unless explicitly retained, and the preserveWhiteSpace member does not seem to make any difference. There is also a concern about efficiency when a DOM implementation like MSXML stores whitespace nodes. Usually whitespace is not very important except in literature text and mixed content.

If you are rendering the mixed content in the above example, you do not want to lose the space between the two elements in <B>content</B> <I>okay</I>. To get MSXML to preserve whitespace in an element's content, you can use a DTD like the one in the example with an xml:space attribute for the element in which you want to preserve whitespace. For more information see "White Space Handling" in the XML Specification.