Inside the CMarkup Parser

Although CMarkup is often called a "parser," parsing is only part of what CMarkup does since it also supports navigating, creating and modifying documents as well as other functions like UTF16To8 and EncodeBase64. Nevertheless, the parser is the single most important part of CMarkup because that is how it imports and provides access to existing XML documents.

The parser does its work in the SetDoc and Load methods to process the document and build an index array. It is a non-validating parser, which means that it does not check the validity of the document against a DTD or schema (see How Compliant is CMarkup with XML?). It checks some well-formedness, producing an error if end tags do not match start tags, or there is an extra root element, or some cases of incorrectly formed nodes.

In release 7.0, a new parsing implementation was introduced that does not use recursion. Previously, in the recursive solution, the parse function called itself each time an element was found inside of an element, returning when it reached the end tag of the parent. Since each nested call adds local variables and arguments to the stack, recursion might have been a problem on platforms where stack size was extremely limited such as Palm OS. The vast majority of documents do not even go 10 elements deep (nested) which might have been a total of 500 bytes on the stack, but nevertheless it was a potential vulnerability. The new parser uses a small light array called ElemStack that behaves like a stack, but is on the heap. It keeps track of the nested element tag names up to the current depth.

Built for speed (see CMarkup Performance Tests), the parse function x_ParseElem is actually quite simple. It loops through all of the nodes until it reaches the end of the document calling x_ParseNode to find and check the next node. See the article on Node Methods in CMarkup for a discussion of node types. The end tag of an element is denoted by a node type of 0 because it is the end of the element node.

The ElemStack is used to keep track of nested elements. When a start tag is encountered, it stays on the NodeStack until the corresponding end tag is found. If the element contains elements, then they are placed above it in the NodeStack and their corresponding end tags would need to be matched and removed from the NodeStack before getting back down to the parent element.

The x_ParseNode function identifies a node given a starting character offset in the document. For speed and simplicity, the node is parsed in a simple loop forward from that first character, maintaining a bit-flag state until the node type has been determined.

If the starting character is a less than sign <, it is a tag <...> (as opposed to text or whitespace). If it is a tag, on the second character it will check if it is the start of an element tag name, or a slash (end tag), or an exclamation mark (comment, CDATA Section, or DOCTYPE) or question mark (PI). If it is an exclamation mark you have to go to the next character to determine the type of node. Once it knows the node type, it just scans for the appropriate terminating character sequence for that node type.

If the first character was not a less than sign and non-whitespace then it is a text node and it goes until reaching a less than sign < or end of document. If the first character is whitespace it also goes until a less than sign < or end of document, but watching for the first non-whitespace to indicate that it is a text node. If it reaches the end of the node without encountering non-whitespace, then it is a whitespace node.

Now let's walk through this process with the following sample XML document:

<?xml version="1.0"?><test> hello world </test>

In the first call to x_ParseNode, the starting character is the first character of the document. It is a less than sign, denoting a tag, so the OPENTAG bit is set. The next character is a question mark, so in the OPENTAG state the question mark means that it is a processing instruction (PI). Now it knows the node type and it unsets the OPENTAG bit. It scans for the PI terminating sequence ?>, and returns.

In the second call to x_ParseNode, the starting character is the first character of the test element. It is a less than sign, denoting a tag, so the OPENTAG bit is set. The next character is an allowable first character of an element tag name, so in the OPENTAG state this means that it is an element. Now it knows the node type and it unsets the OPENTAG bit. It scans for the element tag terminating character >, and returns.

In the third call to x_ParseNode, the starting character is the first character of the " hello world ". It is not a less than sign, it is a space, so the TEXTORWS (text or whitespace) bit is set. The next character is a text character, so in the TEXTORWS state this means that it is a text node. Now it knows the node type and it unsets the TEXTORWS bit. It scans for a less than sign < or end of document, and returns.

In the fourth call to x_ParseNode, the starting character is the first character of the end tag of the test element. It is a less than sign, denoting a tag, so the OPENTAG bit is set. The next character is a slash, so in the OPENTAG state this means that it is an end tag. Now it knows the type of tag and it unsets the OPENTAG bit. It scans for the end tag terminating character >, and returns.

Update July 12, 2005: In release 8.0, the parser was modified slightly to continue despite encountering a blatant error such as a non-ended element, a lone end tag, or other malformed node. And it continues even if it finds siblings to the root element. It still provides an error string and flags the document as ill-formed, but it allows full navigation and modification. See Generic Markup In CMarkup.