Containment Hierarchy

CMarkup maintains a hierarchy of elements that contain other elements (see Navigating Levels in CMarkup). In well-formed XML this is an unambiguous logical representation of the markup because all elements are correctly ended and nested. Even in a document that is not well-formed, a containment hierarchy can still be useful. This article describes the way the containment hierarchy is determined in this case as part of the support for Generic Markup In CMarkup as of release 8.0.

A lot of non-XML markup uses tags without corresponding end tags. Examples in HTML are <BR> indicating a line break and <IMG src="a.jpg"> describing an image to be displayed.

The CMarkup parser uses a simple algorithm to create the containment hierarchy of elements. All non-ended elements in a subdocument are closed when the end tag of the subdocument container element is encountered. The CMarkup object provides these non-ended elements with all the normal navigation methods and they do not contain any child elements. In the following example, when the closing tag of the P element is encountered, the BR tags are marked as non-ended elements and both are linked in as immediate child elements of the P element.

<P>We<BR>see<BR>tree</P>

HTML allows for non-ended container elements such as <P> and <LI> which are logically ended according to rules expressed in the HTML document type definition. Rather than use any specific knowledge of HTML, CMarkup treats those elements the same way it does any non-ended elements, like empty elements. This does not stop you from navigating the nodes after one of these elements and processing them as if they were inside the relevant block.

Another common issue in HTML is incorrectly nested tags. For example, the B start and end tags overlap the I start and end tags:

<P><B>He <I>then</B> shouted</I></P>

In CMarkup, the I element is treated as a child of the B element. So, the B element contains three child nodes: the text node " He", the empty element I, and the text node "then". The I end tag is treated as a lone end tag node inside the content of the P element where GetNodeType returns MNT_LONE_END_TAG. So the P element contains three child nodes: the B element, the text node " shouted", and the lone I end tag.

At first glance it seems unsatisfactory not to logically represent the overlap, however in practice it is better to keep the text in the overlapped part from being duplicated as data content of two different elements. Also, there are much more complicated ways of overlapping elements that can begin to make your mind squirm and are probably best dealt with in ways specific to the meaning of the elements.

Incorrect nesting is also caused by accidental lone end tags such as the paragraph end tag in the following example.

<P>Hello<BR>
<TABLE><TR><TD>A One Cell Table</P></TD></TR></TABLE>
</P>

When the </P> end tag is encountered it is matched with the <P> start tag at the beginning, and all interim tags are closed. This makes the TABLE, TR and TD into non-ended sibling elements of each other, hiding the intended hierarchical relationship, and treating their end tags as lone end tags trailing outside the content of the paragraph.

CMarkup does not attempt to repair the markup or to resolve the complex document object model issues created by incorrectly nested tags. Browsers handle invalid HTML according to the actual semantic meaning of the tags in relation to displaying the page (for example, non-ended TABLE TR and LI element scopes are all treated differently). However, CMarkup provides a straight-forward tree in which any element exists once based on the occurrence of the start tag.