Split XML file into smaller pieces
To split an XML file into smaller pieces you read through the input file, creating output files and transferring subdocuments as you go. Whether in C++ or scripting in FOAL, CMarkup makes it simple. For large XML files, use CMarkup file read mode shown below to read the large XML file with very little memory while extracting subdocuments. See this video of an XML splitter script to watch the process in action. Also check out this other article with even simpler techniques to split XML.
The question when splitting XML is where do you want to split it? There could be a logical place to divide the XML like into the subdocuments immediately under the root. Or you might simply have a size limit and want to divide your large XML file with ten million objects into files with one million each.

Below is C++ XML splitter code to split an XML file containing N million objects into N files containing 1 million objects. Here is the idea:
// Split XML
CMarkup xmlInput, xmlOutput;
xmlInput.Open( "please_split.xml", MDF_READFILE );
int nObjectCount = 0, nFileCount = 0;
while ( xmlInput.FindElem("//object") )
{
if ( nObjectCount == 0 )
{
++nFileCount;
xmlOutput.Open( "piece" + StrFromInt(nFileCount) + ".xml", MDF_WRITEFILE );
xmlOutput.AddElem( "root" );
xmlOutput.IntoElem();
}
xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
++nObjectCount;
if ( nObjectCount == 1000000 )
{
xmlOutput.Close();
nObjectCount = 0;
}
}
if ( nObjectCount )
xmlOutput.Close();
xmlInput.Close();
You could also use size rather than object count as the basis of where to split the XML document. To do this, keep a tally of the subdocument sizes until a threshhold is reached. The subdocument transfer shown above occurs in xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ). You can instead do it in two steps and track the size like this:
MCD_STR sObject = xmlInput.GetSubDoc(); nOutputLength += MCD_STRLENGTH(sObject); xmlOutput.AddSubDoc( sObject );
Note though that the nOutputLength is not the same as the output file byte size if your in-memory encoding is different from your file encoding or your encoding is 2-byte based UTF-16.

