Split XML file into smaller pieces

To split an XML file into smaller pieces you read through the input file, creating output files and transferring subdocuments as you go. Whether in C++ or scripting in FOAL, CMarkup makes it simple. For large XML files, use CMarkup file read mode shown below to read the large XML file with very little memory while extracting subdocuments. See this video of an XML splitter script to watch the process in action. Also check out this other article with even simpler techniques to split XML.

The question when splitting XML is where do you want to split it? There could be a logical place to divide the XML like into the subdocuments immediately under the root. Or you might simply have a size limit and want to divide your large XML file with ten million objects into files with one million each.

split XML file with ten million objects into 10 XML files with one million objects each

Below is C++ XML splitter code to split an XML file containing N million objects into N files containing 1 million objects. Here is the idea:

  • Use two CMarkup objects, one for the input file to be split, and one for the output files
  • Open the big input file to begin looping through all the objects in it
  • Open an output file using the output file count to form the filename
  • Transfer object subdocuments from input file to output file until object count maximum
  • Close the output file, reset the object count, increment the output file count
  • If not at the end of the input file, open a new output file as above and continue
  • At the end of the input file, exit loop, close output file (if left open), close input file
  • // Split XML
    CMarkup xmlInput, xmlOutput;
    xmlInput.Open( "please_split.xml", MDF_READFILE );
    int nObjectCount = 0, nFileCount = 0;
    while ( xmlInput.FindElem("//object") )
    {
      if ( nObjectCount == 0 )
      {
        ++nFileCount;
        xmlOutput.Open( "piece" + StrFromInt(nFileCount) + ".xml", MDF_WRITEFILE );
        xmlOutput.AddElem( "root" );
        xmlOutput.IntoElem();
      }
      xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
      ++nObjectCount;
      if ( nObjectCount == 1000000 )
      {
        xmlOutput.Close();
        nObjectCount = 0;
      }
    }
    if ( nObjectCount )
      xmlOutput.Close();
    xmlInput.Close();

    You could also use size rather than object count as the basis of where to split the XML document. To do this, keep a tally of the subdocument sizes until a threshhold is reached. The subdocument transfer shown above occurs in xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ). You can instead do it in two steps and track the size like this:

    MCD_STR sObject = xmlInput.GetSubDoc();
    nOutputLength += MCD_STRLENGTH(sObject);
    xmlOutput.AddSubDoc( sObject );

    Note though that the nOutputLength is not the same as the output file byte size if your in-memory encoding is different from your file encoding or your encoding is 2-byte based UTF-16.