Parse huge XML file in C++

Dave of NeoCurve wrote in and asked to evaluate the developer version of CMarkup, and he quickly got up and running with a 460MB subset of his XML data and then with a 1.8GB XML file. He later said the "warp-speed pull-parser design allowed us to easily manage our huge data files with very little overhead."

In one of its features for a large XML file in C++, CMarkup's XML pull functionality was designed to provide the low overhead benefit of an XMLReader without the complexity of a SAX event based parser. An XML pull parser lets you pull from a large XML file a little bit at a time forward-only and read-only, processing the data you want, but keeping only a small block of the file in memory at a time. The overhead is very small and the speed is almost as fast as the I/O read operation.

XML pull parser functionality (file read mode) is in the developer version of CMarkup.

comment posted pull from a huge XML file a little bit at a time

Dave Terracino 05-May-2009

What we have is a rather large XML file [1.82GB] that we need to pull all the data from a little bit at a time. This is a file that was serialized to an XML file from a .NET application and needs to be read back in by our Visual C++ 6.0 application. We need this ASAP, as we've been trying other paths to no avail.

You can quickly write an application to process a file a little bit at a time starting with the Open method. The following code loops through all of the object elements to process the properties of each object.

CMarkup xmlpullparser;
xmlpullparser.Open( "hugexmlfile.xml", CMarkup::MDF_READFILE );
while ( xmlpullparser.FindElem("//object") )
{
  // process object properties...

comment posted missing tag in huge XML file

Dave Terracino 08-May-2009

So here is where we get into trouble... if a property is NULL, the .NET serializer doesn't write out the tag at all. So what happens when we use FindElem("property1") is that it scans to the end, and we can't read any more of the data. Do you have any suggestions? Is there some way to record where we started, and roll the file pointer back to that location so we can avoid the problem of a missing tag in the XML?

If you were loading the entire file into memory, you could use methods like ResetMainPos, RestorePos and GotoElemIndex to go back and scan for each property from the beginning of the object. But in file read mode you are forward-only. So don't do this:

xmlpullparser.IntoElem();
xmlpullparser.FindElem( "property1" );
str sProp1 = xmlpullparser.GetData();
xmlpullparser.FindElem( "property2" );
str sProp2 = xmlpullparser.GetData();
xmlpullparser.OutOfElem();

If any of those properties is not found, the parser will scan to the end of the object element bypassing the remaining properties. Or if any of those properties is not in the expected order, again you could bypass some of the properties. Instead, you must handle each property in the order of occurence like this:

xmlpullparser.IntoElem();
while ( xmlpullparser.FindElem() )
{
  str sPropName = xmlpullparser.GetTagName();
  str sPropValue = xmlpullparser.GetData();
  if ( sPropName == "property1" )
    ; // do something with property 1
  else if ( sPropName == "property2" )
    ; // do something with property 2
  // etc
}
xmlpullparser.OutOfElem();