Node Methods in CMarkup
Update July 12, 2005: With CMarkup release 8.0, the node methods have been made available in the evaluation version so that everyone can try the full HTML and generic markup capabilities of CMarkup. Node methods are key to providing access to mixed content as well as Other Markup (non-XML).
The node methods are FindNode, AddNode, InsertNode, RemoveNode, and GetNodeType, and in addition the SetData and GetData, SetAttrib, GetAttrib, and GetAttribName methods can be used with non-element nodes in certain cases.
The node methods compliment but in no way replace the standard element methods. The element methods still map out the real structure of the XML document, and the node methods are a mechanism for manipulating the nodes found here and there in between the elements. If you want to ignore comments and other extraneous non-element nodes, just don't use the node methods. The node methods are completely optional, even if there are non-element nodes in your document.
||ITEM||contained between start tag and end tag|
||#comment||contained between dashes|
||xml||contained between question marks|
||greeting||everything including first < and last >|
||#cdata-section||character data between inner brackets|
||#text||parsed character data between markup tags|
|Whitespace||#text||whitespace character data between markup tags|
|Lone End Tag||
||ITEM||ill-formed support added in CMarkup 8.0|
When creating and modifying nodes, CMarkup does no verifying of the text, so you can easily cause the document to become ill-formed. For example, do not add a comment containing "-->" or a processing instruction containing "?>" etc. However, the usual special characters are properly encoded and decoded in Text nodes.
The following paragraphs explain how to manipulate nodes according to three main uses for node methods in CMarkup.
To navigate a document watching for elements and comments , call
FindNode(), checking the return value for
MNT_COMMENT. You may also be interested in processing instructions in which case you would check for
GetData() to get the text of the comment or the entire processing instruction (including the target). You still need to use
OutOfElem when you encounter elements according to how you wish to traverse the hierarchy of the document.
To create a comment after the current main position, call
AddNode( CMarkup::MNT_COMMENT, "comment" ) (in a comment you specify the text in between the dashes). You can also use
InsertNode to insert before, or even navigate to non-element nodes using
FindNode and add or insert there. So you have full flexibility on where to put nodes.
To navigate the mixed content of an element, navigate to the element in the main position and call
IntoElem(). Then call
FindNode(), checking the return value for the type of node. When the node type is 0, there are no more child nodes in the current element. When
MNT_ELEMENT, you generally want to go into that element and navigate nodes in there if it contains mixed content as well. To do this you should keep track of how many levels deep you have gone. When positioned on a non-element node, you retrieve its text using the
GetData method. In a CDATA section it is the character data inside the inner brackets. In a text node it is the parsed character data from the end of the previous markup tag to the start of the following markup tag. In the following example, the trailing space is included in the Text node "I ", while the space between the I element and the B element is a Whitespace node.
<poem>I <I>saw</I> <B>her</B></poem>
To create mixed content use
AddNode(nodetype,text). This method places mixed content nodes immediately after the current main position node. Use this method to create an empty element where the
text argument is the tag name and no end of line (EOL) is put after the element. The following table shows the nodes (primarily used in mixed content) that have no EOL placed after them in
|Type||End Of Line (EOL) Added|
|Text (or Whitespace)||no|
The xml version declaration (its a processing instruction with reserved target "xml") and DTD are special nodes appearing at the top of the XML document. You can access these after calling
ResetPos with the
FindNode method. You can also add and remove them with the other node methods. When you set or get the data, you have to be familiar with the part of the node that is considered "data." In the processing instruction, its everything between the question marks, i.e.
xml version="1.0". Processing instructions are also used to give application specific instructions, for example
appname highdef="true". In the DTD, the data is everything from the first < to the end >.
The following sample document has an xml declaration (processing instruction), comments, DTD, and mixed content in the third ITEM element.
<?xml version="1.0"?> <!DOCTYPE PARSETEST [ <!ELEMENT PARSETEST (ITEM*)> <!ATTLIST PARSETEST v CDATA '' s CDATA ''> <!ELEMENT ITEM (#PCDATA|B|I)*> <!ATTLIST ITEM note CDATA '' xml:space (default|preserve) 'preserve'> <!ELEMENT B ANY> <!ELEMENT I ANY> ]> <!--tightcomment--> <PARSETEST v="1" s='6'> <!-- mid comment --> <ITEM note="hi"/> <ITEM note="see data">hi</ITEM> <ITEM> mixed <B>content</B> <I>okay</I></ITEM> </PARSETEST> <!-- end comment -->
After instantiating a CMarkup object with the above document, the following code demonstrates methods to navigate and modify it:
xml.ResetPos(); ASSERT( xml.FindNode() == xml.MNT_PROCESSING_INSTRUCTION ); ASSERT( xml.GetData() == "xml version=\"1.0\"" ); ASSERT( xml.FindNode(xml.MNT_COMMENT) ); ASSERT( xml.GetData() == "tightcomment" ); ASSERT( xml.SetData( _T("comment 1 changed") ) ); ASSERT( xml.GetData() == _T("comment 1 changed") ); ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_ELEMENT ); ASSERT( xml.GetTagName() == _T("PARSETEST") ); ASSERT( xml.IntoElem() ); ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_COMMENT ); ASSERT( xml.GetData() == _T(" mid comment ") ); ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_ELEMENT ); ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_ELEMENT ); ASSERT( xml.FindNode(xml.MNT_EXCLUDE_WHITESPACE) == xml.MNT_ELEMENT ); ASSERT( xml.IntoElem() ); ASSERT( xml.FindNode() == xml.MNT_TEXT ); ASSERT( xml.GetData() == _T(" mixed ") ); ASSERT( xml.FindNode() == xml.MNT_ELEMENT ); ASSERT( xml.GetTagName() == _T("B") ); ASSERT( xml.IntoElem() ); ASSERT( xml.FindNode() == xml.MNT_TEXT ); ASSERT( xml.GetData() == _T("content") ); ASSERT( xml.FindNode() == 0 ); ASSERT( xml.OutOfElem() ); ASSERT( xml.FindNode() == xml.MNT_WHITESPACE ); ASSERT( xml.GetData() == _T(" ") ); ASSERT( xml.FindNode() == xml.MNT_ELEMENT ); ASSERT( xml.GetTagName() == _T("I") ); ASSERT( xml.IntoElem() ); ASSERT( xml.FindNode() == xml.MNT_TEXT ); ASSERT( xml.GetData() == _T("okay") ); ASSERT( xml.FindNode() == 0 ); ASSERT( xml.OutOfElem() ); ASSERT( xml.FindNode() == 0 ); ASSERT( xml.OutOfElem() );
The above code works with CMarkup and CMarkupMSXML. However, note that it does not test for the Document Type node because in the MSXML wrapper (CMarkupMSXML) the Document Type node is never found by the
FindNode method. In DOM the Document Type node is not treated like the other nodes, it is more like a separate read-only member of the document object.
Whitespace is simple in CMarkup because the document is kept as one large string rather than breaking it into node objects, and no whitespace information is lost. In MSXML, whitespace is usually lost unless explicitly retained, and the
preserveWhiteSpace member does not seem to make any difference. There is also a concern about efficiency when a DOM implementation like MSXML stores whitespace nodes. Usually whitespace is not very important except in literature text and mixed content.
If you are rendering the mixed content in the above example, you do not want to lose the space between the two elements in
<B>content</B> <I>okay</I>. To get MSXML to preserve whitespace in an element's content, you can use a DTD like the one in the example with an xml:space attribute for the element in which you want to preserve whitespace. For more information see "White Space Handling" in the XML Specification.