Split XML file into smaller pieces

To split an XML file into smaller pieces you read through the input file, creating output files and transferring subdocuments as you go. Whether in C++ or scripting in FOAL, CMarkup makes it simple. For large XML files, use CMarkup file read mode shown below to read the large XML file with very little memory while extracting subdocuments. See this video of an XML splitter script to watch the process in action. Also check out this other article with even simpler techniques to split XML.

The question when splitting XML is where do you want to split it? There could be a logical place to divide the XML like into the subdocuments immediately under the root. Or you might simply have a size limit and want to divide your large XML file with ten million objects into files with one million each.

split XML file with ten million objects into 10 XML files with one million objects each

Below is C++ XML splitter code to split an XML file containing N million objects into N files containing 1 million objects. Here is the idea:

  • Use two CMarkup objects, one for the input file to be split, and one for the output files
  • Open the big input file to begin looping through all the objects in it
  • Open an output file using the output file count to form the filename
  • Transfer object subdocuments from input file to output file until object count maximum
  • Close the output file, reset the object count, increment the output file count
  • If not at the end of the input file, open a new output file as above and continue
  • At the end of the input file, exit loop, close output file (if left open), close input file
  • // Split XML
    CMarkup xmlInput, xmlOutput;
    xmlInput.Open( "please_split.xml", MDF_READFILE );
    int nObjectCount = 0, nFileCount = 0;
    while ( xmlInput.FindElem("//object") )
    {
      if ( nObjectCount == 0 )
      {
        ++nFileCount;
        xmlOutput.Open( "piece" + StrFromInt(nFileCount) + ".xml", MDF_WRITEFILE );
        xmlOutput.AddElem( "root" );
        xmlOutput.IntoElem();
      }
      xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
      ++nObjectCount;
      if ( nObjectCount == 1000000 )
      {
        xmlOutput.Close();
        nObjectCount = 0;
      }
    }
    if ( nObjectCount )
      xmlOutput.Close();
    xmlInput.Close();

    You could also use size rather than object count as the basis of where to split the XML document. To do this, keep a tally of the subdocument sizes until a threshhold is reached. The subdocument transfer shown above occurs in xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ). You can instead do it in two steps and track the size like this:

    MCD_STR sObject = xmlInput.GetSubDoc();
    nOutputLength += MCD_STRLENGTH(sObject);
    xmlOutput.AddSubDoc( sObject );

    Note though that the nOutputLength is not the same as the output file byte size if your in-memory encoding is different from your file encoding or your encoding is 2-byte based UTF-16.

     

    comment posted File read GetSubDoc incomplete

    Aditya Raut 07-Apr-2011

    I believe I have found a bug. I am sending you a sample file with this email along with my foal script. The problem.xml contains the same record five times. When we split it using my code part1.xml is corrupt and not msxml valid where the GetSubDoc() function has failed to get the element properly. Surprisingly the next four parts are fine! Its the same record copied and pasted but is wrong the first time and right the next four times.

    split()
    {
      CMarkup xmlInput, xmlOutput;
      xmlInput.Open( "E:\\problem.xml", MDF_READFILE );
      int nObjectCount = 0, nFileCount = 0;
      while ( xmlInput.FindElem("//title") )   
      {
        if ( nObjectCount == 0 )
        {
          ++nFileCount;
          xmlOutput.Open( "E:\\part" + nFileCount + ".xml", MDF_WRITEFILE );
          xmlOutput.AddElem( "root" );
          xmlOutput.IntoElem();  //till here a copy of your example script
        }    
        xmlOutput.AddElem( "simplepage" ); //adding a element of my own
        xmlOutput.IntoElem(); 
        xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ); //adding the title
        xmlInput.FindElem("//text"); 		//finding text tag
        xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ); //adding text
        xmlOutput.OutOfElem(); 	//going out of ( "simplepage" ) tag created
        ++nObjectCount;
        if ( nObjectCount == 1 ) //splitting the file per article
        {
          xmlOutput.Close();
          nObjectCount = 0;
        }    
      }  
      xmlOutput.Close();
      xmlInput.Close();
      return nFileCount;
    }

    The key to fixing this bug was the sample problem.xml file you attached. Thanks for reporting this and providing such an excellent reproducible example! This bug has been fixed in foxe release 2.4.2 and CMarkup release 11.5. Upon doubling the memory buffer twice from 16k to 64k for the 48k element node without subelements, CMarkup miscalculated the starting offset of the node. After the first element node like this the buffer was large enough for the remaining elements which is why it only failed on part1.xml.