C++ XML reader parses a very large XML file

Got a very large XML file to parse in C++? Need to process a huge XML file to calculate statistics, import into a database, or generate a report? Just want to look at the top of your big XML file to determine if you want to continue, never reading the whole document into memory? The CMarkup file read mode provides a simple high performance C++ XML reader to do these kinds of tasks.

File read and write modes (see C++ XML writer too) are in the developer version of CMarkup.

Even though CMarkup is surprisingly light and quick to load and modify multi-megabyte files in memory, this XML reader mode is for cases where that is not good enough. Read mode provides read-only forward-only pull parser access, even in huge multi-gigabyte XML files.

How to use CMarkup file read mode

Here's some code that scans the elements in a large XML file fast, without trying to load the entire file at once. The main difference from regular CMarkup usage is that rather than calling Load, you Open the file for read. Notice the Open and Close calls:

CMarkup xmlreader;
xmlreader.Open( "largeXMLfile.xml", MDF_READFILE );
xmlreader.FindElem(); // root
MCD_STR sType = xmlreader.getAttrib( "infotype" );
xmlreader.IntoElem();
while ( xmlreader.FindElem() )
{
  MCD_STR sID = xmlreader.GetAttrib( "id" );
  xmlreader.IntoElem();
  MCD_STR sName = xmlreader.FindGetData( "name" );
  MCD_STR sRef = xmlreader.FindGetData( "ref" );
  xmlreader.OutOfElem();
}
xmlreader.Close();

Pull parser design

The CMarkup C++ XML reader requires no callbacks, no events, and no setup. Just open the file and pull what you want from it using the same CMarkup methods you would use if you were navigating it in memory. Rather than developing an entirely new interface for the XML reader (and XML writer) functionailty of CMarkup, much remains the same except that you open the file instead of accessing the document in memory. See XML reader models: SAX versus XML pull parser for a discussion of the major XML reader design options.

Here is an example of a query lookup based on an id. Open the file, query the information, and close it. This example will do a sequential read through the file until it finds the matching information, or until it reaches the end of the file.

CMarkup xmlreader;
xmlreader.Open( "largeXMLfile.xml", MDF_READFILE );
if ( xmlreader.FindElem("//data[@id='5632av']") )
{
  xmlreader.IntoElem();
  MCD_STR sName = xmlreader.FindGetData( "name" );
}
xmlreader.Close();

C++ XML reader methods

CMarkup's file read mode limits the methods you can use and the ways you can use those methods. The key thing to remember is that it is forward-only pull parsing from file so you can only navigate forward in the document you are reading once-through. And since you can only read in a single position, you cannot use child element methods.

Here are the CMarkup methods that can be used, and a brief explanation of how they work in file read mode:

Open	With flag `MDF_READFILE`, opens file for read
Close	Closes file and ends file mode. Automatically invoked by destructor
FindElem	In file read mode, locates next sibling element, optionally matching tag name or path; however, unlike regular mode, if an element is not found then the current position will be at the end tag of the parent element or at the end of the document if it was not within a parent element
GetData	In file read mode, returns the string value of the current element or node
FindGetData	In file read mode, locates the next element matching the specified path and returns the string value; however, unlike regular mode, if an element is not found then the current position will be at the end tag of the parent element or at the end of the document if it was not within a parent element
GetAttrib	In file read mode, returns the string value of the specified attribute of the current element (or processing instruction)
HasAttrib	In file read mode, returns true if the specified attribute of the current element (or processing instruction) exists
GetNthAttrib	In file read mode, returns the name and value of attribute specified by number for the current element
GetAttribName	In file read mode, returns the name of attribute specified by number for the current element
GetNodeType	In file read mode, returns the node type of the current node
GetTagName	In file read mode, returns the tag name of the current element (or processing instruction)
FindNode	In file read mode, locates next sibling node, optionally matching node type(s); however, unlike regular mode, if a node is not found then the current position will be at the end tag of the parent element or at the end of the document if it was not within a parent element.
IntoElem	In file read mode, goes "into" current element to find elements and nodes between its start and end tags
OutOfElem	In file read mode, goes "out of" current element to find elements and nodes after its end tag
GetElemPath	In file read mode, returns a string representing the absolute path of the main position element, allowing for a maximum of 255 uniquely named sibling elements
GetDoc	In file read mode, returns the partial document markup string which is the most recently retrieved from the file
GetSubDoc	Update June 7, 2009: In Release 11.1 file read mode, returns the markup string of the subdocument rooted in the current position element. If the element has no child elements, the element remains the current position, otherwise the current position is after the end of the subdocument.

You can also use any CMarkup static utility function because these do not involve the CMarkup object state or data members.

A window into the document

File read mode provides on-the-fly charset conversion to the in-memory charset of a MARKUP_FILEBLOCKSIZE-based size "read block" at a time. The m_strDoc document string member is used as a partial document buffer letting you view the current block of the document in the debugger variables the same way you do when not in file mode, giving great visibility into the document and the behind the scenes positioning in the actual document text.

Copying a CMarkup object in file mode

The copy constructor and assignment operator = do not work when copying a CMarkup object in either read or write file mode. This is because the CMarkup object encapsulates an open file pointer, a system handle which can only be managed by one CMarkup object at a time.

Poorly formed markup containment hierarchy

This note is only for developers dealing with HTML or loosely formed markup. CMarkup is designed to deal with ill-formed XML and things such as non-ended <br> line break tags in HTML. See Generic Markup In CMarkup.

So in that same spirit, file read mode is able to keep going despite non-hierarchically formed markup. However, the recovery algorithm works differently in read mode because CMarkup has not parsed the whole file and does not know what is to come in the rest of the file.

In the case of a non-ended element tag, file read mode has to assume the end tag will be found later in the file until it encounters the end tag of the enclosing element. This is different than the in-memory policy for dealing with non-hierarchical markup described in the CMarkup Containment Hierarchy.