Split XML with XML editor script

On another forum a user asked how to "Split XML and output to different files" and did not get an answer, just several confusing responses about versions of XSLT. This is how simple the answer is in the free firstobject XML editor:

split_XML()
{
  CMarkup input;
  input.Load( "input.xml" );
  while ( input.FindElem("//npc") )
    WriteTextFile( "npc"+input.GetAttrib("id")+".xml", input.GetSubDoc() );
}

This script will read input.xml and create an npc[ID].xml file for each npc subdocument. This was the question:

Starting with... Desired result

"input.xml"

<root>
<npc id="1">
  <p pid="1"/>
  <p pid="2"/>
  <p pid="3"/>
</npc>
<npc id="2">
  <p pid="3"/>
  <p pid="4"/>
  <p pid="5"/>
</npc>
<npc id="3">
  <p pid="4"/>
  <p pid="5"/>
  <p pid="6"/>
</npc>
</root>
 

"npc1.xml"

<npc id="1">
  <p pid="1"/>
  <p pid="2"/>
  <p pid="3"/>
</npc>

"npc2.xml"

<npc id="2">
  <p pid="3"/>
  <p pid="4"/>
  <p pid="5"/>
</npc>

"npc3.xml"

<npc id="3">
  <p pid="4"/>
  <p pid="5"/>
  <p pid="6"/>
</npc>

The simplicity of a short bit of CMarkup code is yet another reason to avoid XSLT. Here is a similar question asked here:

 

comment posted xml splitter

Dita Ciulacu 01-Jul-2009

I am searching for a xml splitter to generate the file name using values from a child field. I can't make [your script] work for my specific file name. Is there a way to have the file name this way: xmlOutput.Open( "test" + "_" + [Child value from REFERRAL_ID] + "_"+ nFileCount + ".xml", MDF_WRITEFILE ); My XML is:

<REFERRAL_DISCHARGE>
  <FILE_VERSION>1.0</FILE_VERSION>
  <REFERRAL_ID>9999</REFERRAL_ID>
  <ORGANISATION_ID>A12345-6</ORGANISATION_ID>
  <ORGANISATION_TYPE>100</ORGANISATION_TYPE>
  <EXTRACT_FROM_DATE_TIME>2009-06-01T00:00:00</EXTRACT_FROM_DATE_TIME>
  <EXTRACTED_DATE_TIME>2009-06-30T00:30:00</EXTRACTED_DATE_TIME>
  <TEAM_CODE>1111</TEAM_CODE>
  <EVENT_HCU_ID>AAA1234</EVENT_HCU_ID>
  <SEX>M</SEX>
  <DATE_OF_BIRTH>1900-05-05</DATE_OF_BIRTH>
  <REFERRAL_FROM>UN</REFERRAL_FROM>
  <START_DATE_TIME>2008-12-24T00:00:00</START_DATE_TIME>
</REFERRAL_DISCHARGE>

The parent is REFERRAL_DISCHARGE. I need the file name exactly how you have it plus the individual value from REFERRAL_ID to make it easy to link to the data included. We are a not-for-profit organization and we have to report to the Ministry of Health and our data is to be packed as individual XML files. We are not dealing with huge files (this one is only 316kb).

Since the input file is under 10MB it makes sense to Load it all at once rather than using the Open method for read mode. Also, it is easier to pick out the data for naming the output file using an in-memory XML document than being restricted by forward-only file read mode. Here's the script to divide the XML into files with one REFERRAL_DISCHARGE each:

split()
{
  CMarkup xmlInput;
  xmlInput.Load( "split.xml" );
  int nFileCount = 0;
  while ( xmlInput.FindElem("//REFERRAL_DISCHARGE") )
  {
    ++nFileCount;
    xmlInput.FindChildElem( "REFERRAL_ID" );
    str sID = xmlInput.GetChildData();
    str sFilename = "test" + "_" + sID + "_"+ nFileCount + ".xml";
    WriteTextFile( sFilename, xmlInput.GetSubDoc() );
  }
  return nFileCount;
}

In this simple REFERRAL_DISCHARGE subdocument the REFERRAL_ID is a child element and we can find it using the FindChildElem method while keeping the main position at the REFERRAL_DISCHARGE element. This allows us to still retrieve the whole REFERRAL_DISCHARGE subdocument with GetSubDoc after building the filename.

If the source is a large XML file, and loading it all into memory is not feasable, it is still pretty simple:

split()
{
  CMarkup xmlInput, xmlReferralDischarge;
  xmlInput.Open( "big_split.xml", MDF_READFILE );
  int nFileCount = 0;
  while ( xmlInput.FindElem("//REFERRAL_DISCHARGE") )
  {
    ++nFileCount;
    xmlReferralDischarge.SetDoc( xmlInput.GetSubDoc() );
    xmlReferralDischarge.FindChildElem( "REFERRAL_ID" );
    str sID = xmlReferralDischarge.GetChildData();
    str sFilename = "test" + "_" + sID + "_"+ nFileCount + ".xml";
    xmlReferralDischarge.Save( sFilename );
  }
  xmlInput.Close();
  return nFileCount;
}

Since each individual REFERRAL_DISCHARGE subdocument is small, we populate xmlReferralDischarge in-memory using SetDoc, grab the ID out of it, and then use the Save method to write the file.

And it is not too hard to put multiple subdocuments into each output file and to split XML into very large files from an extremely large file. In Split XML file into smaller pieces and the video of an XML splitter script I put multiple subdocuments into each output file and use "file mode" to keep a low footprint.