HTML And CMarkup

Simply put, you can navigate the HTML and find all hyperlink or image elements, or whatever you are looking for. You can also add and remove elements, attributes and content. The following HTML document happens to be a nearly well-formed XML document except for the mismatched case of the P element.

<html>
  <head>
    <title>The Title</title>
  </head>
  <body>
    <P>Hello World</p>
  </body>
</html>

To use CMarkup with hand-generated HTML, set the MDF_IGNORECASE document flag. The following example reads the above HTML page from test.htm, extracts the title, and then changes it and writes it back out.

CMarkup html;
html.SetDocFlags( CMarkup::MDF_IGNORECASE );
html.Load( "test.htm" );
html.FindElem( "html" );
html.IntoElem();
html.FindElem( "head" );
html.IntoElem();
html.FindElem( "title" );
CString csTitle = html.GetData();
html.SetData( csTitle + " Is Changed" );
html.Save( "test.htm" );

By default, all tag name and attribute name matching is case sensitive in CMarkup (unless you have defined MARKUP_IGNORECASE). To make one particular CMarkup object ASCII case insensitive, set the MDF_IGNORECASE flag. Set this flag before parsing the document with Load or SetDoc if there is a chance any end tags may be a different case than corresponding start tags like the upper case P in the start tag and lower case p in the end tag of the Hello World paragraph. If you don't want to affect any other document flags, you can set MDF_IGNORECASE as follows:

html.SetDocFlags( html.GetDocFlags() | CMarkup::MDF_IGNORECASE );

The SetElemContent method (in CMarkup release 8.0) is great for setting HTML directly into the content of an element such as a paragraph element p.

html.AddElem( "p" );
html.SetElemContent( "This small image <br><img src=a.jpg>" );

<p>This small image <br><img src=a.jpg></p>

There are also AddElem and SetData Flags for generating HTML idiosyncrasies explicitly. Attributes without quotes or without values are parsed properly with GetAttrib and GetAttribName, although SetAttrib always generates attributes with quotes. For an attribute without a value, such as the HTML wrap attribute, GetAttrib returns the attribute name as the value.

There is no specific support for HTML in CMarkup, it is just part of Generic Markup In CMarkup. See also the navigation examples in Other Markup for more insight into navigating outside of well-formed XML.

CMarkup works best with properly nested HTML elements because improperly nested elements can cause unpredictable results when navigating a document. Remember not to assume the HTML is nested properly just because it displays properly in a browser because browsers use workarounds. See Containment Hierarchy for more on this.

Using Paths In CMarkup (in CMarkup Developer only), you can get the title more quickly by calling xml.FindGetData("/html/head/title") or better yet by calling xml.FindGetData("//title"). The anywhere path feature in the developer version of CMarkup Release 8.0 is also a powerful way to loop through hyperlinks or images in HTML files. For example, the following code goes to all of the hyperlinks in the HTML page.

html.Load( "test.htm" );
while ( html.FindElem("//A") )
{
  CString csHref = html.GetAttrib( "href" );
}