Split XML file into smaller pieces

To split an XML file into smaller pieces you read through the input file, creating output files and transferring subdocuments as you go. Whether in C++ or scripting in FOAL, CMarkup makes it simple. For large XML files, use CMarkup file read mode shown below to read the large XML file with very little memory while extracting subdocuments. See this video of an XML splitter script to watch the process in action. Also check out this other article with even simpler techniques to split XML.

The question when splitting XML is where do you want to split it? There could be a logical place to divide the XML like into the subdocuments immediately under the root. Or you might simply have a size limit and want to divide your large XML file with ten million objects into files with one million each.

Below is C++ XML splitter code to split an XML file containing N million objects into N files containing 1 million objects. Here is the idea:

You could also use size rather than object count as the basis of where to split the XML document. To do this, keep a tally of the subdocument sizes until a threshhold is reached. The subdocument transfer shown above occurs in xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ). You can instead do it in two steps and track the size like this:

Note though that the nOutputLength is not the same as the output file byte size if your in-memory encoding is different from your file encoding or your encoding is 2-byte based UTF-16.

comment posted File read GetSubDoc incomplete

Aditya Raut 07-Apr-2011

I believe I have found a bug. I am sending you a sample file with this email along with my foal script. The problem.xml contains the same record five times. When we split it using my code part1.xml is corrupt and not msxml valid where the GetSubDoc() function has failed to get the element properly. Surprisingly the next four parts are fine! Its the same record copied and pasted but is wrong the first time and right the next four times.

split()
{
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "E:\\problem.xml", MDF_READFILE );
  int nObjectCount = 0, nFileCount = 0;
  while ( xmlInput.FindElem("//title") )   
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "E:\\part" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( "root" );
      xmlOutput.IntoElem();  //till here a copy of your example script
    }    
    xmlOutput.AddElem( "simplepage" ); //adding a element of my own
    xmlOutput.IntoElem(); 
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ); //adding the title
    xmlInput.FindElem("//text"); 		//finding text tag
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ); //adding text
    xmlOutput.OutOfElem(); 	//going out of ( "simplepage" ) tag created
    ++nObjectCount;
    if ( nObjectCount == 1 ) //splitting the file per article
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }    
  }  
  xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

The key to fixing this bug was the sample problem.xml file you attached. Thanks for reporting this and providing such an excellent reproducible example! This bug has been fixed in foxe release 2.4.2 and CMarkup release 11.5. Upon doubling the memory buffer twice from 16k to 64k for the 48k element node without subelements, CMarkup miscalculated the starting offset of the node. After the first element node like this the buffer was large enough for the remaining elements which is why it only failed on part1.xml.