Counting XML tag names and values with foal

I was browsing an XML file and wondered what all the different elements were. I could see the most common element tag names, but I also wanted to know what some of the rare elements were. To demonstrate how you can count element names in your XML document, I'll use the example XML file of a Shakespeare play called play.xml that is in the CMarkup download.

Rather than write the script from scratch I thought "what is a similar script I could have the freeware firstobject XML editor generate for me that I could modify to do what I want?" This led to another interesting idea: if I counted unique speakers I could see who did the most talking in the play. Then with the code for counting unique speakers I could modify it to count unique tag names.

List unique values for a certain element name

So with the file open in the editor I first right-clicked on the SPEAKER element, and went to Generate Program -> Gather Unique Values. This generated the following foal program:

str UniqueSPEAKER_Generated( CMarkup mDocToQuery )
{
  // Unique list of SPEAKER
  CMarkup mList;
  mDocToQuery.ResetPos();
  while ( mDocToQuery.FindElem("//SPEAKER") )
  {
    str sVal = mDocToQuery.GetData();
    if ( mList.RestorePos(sVal) )
    {
      // Increment count
      mList.SetAttrib( "n", StrToInt(mList.GetAttrib("n"))+1 );
    }
    else
    {
      // First time for this unique value
      mList.AddElem( "I", sVal );
      mList.SetAttrib( "n", 1 );
      mList.SavePos( sVal );
    }
  }
  return mList;
}

I clicked F9 to run it, selected play.xml from the drop down list, and saw the following results:

<I n="2">PHILO</I>
<I n="204">CLEOPATRA</I>
<I n="8">Clown</I>
<I n="2">Guard</I>
<I n="3">SELEUCUS</I>
<I n="10">CANIDIUS</I>
<I n="1">Attendants</I>
<I n="41">POMPEY</I>
<I n="1">VARRIUS</I>
<I n="35">MENAS</I>
<I n="4">VENTIDIUS</I>
<I n="3">SILIUS</I>
<I n="4">First Servant</I>
<I n="3">Second Servant</I>
<I n="2">MENECRATES</I>
<I n="7">MARDIAN</I>
<I n="42">Messenger</I>
<I n="204">MARK ANTONY</I>
<I n="4">Second Guard</I>
<I n="1">Third Guard</I>
<I n="5">DERCETAS</I>
<I n="7">DIOMEDES</I>
<I n="11">First Guard</I>
<I n="14">First Soldier</I>
<I n="11">Second Soldier</I>
<I n="10">Third Soldier</I>
<I n="3">Fourth Soldier</I>
<I n="9">All</I>
<I n="13">Soldier</I>
<I n="1">Captain</I>
<I n="13">OCTAVIA</I>
<I n="98">OCTAVIUS CAESAR</I>
<I n="10">PROCULEIUS</I>
<I n="1">GALLUS</I>
<I n="2">Egyptian</I>
<I n="12">THYREUS</I>
<I n="5">EUPHRONIUS</I>
<I n="23">DOLABELLA</I>
<I n="1">TAURUS</I>
<I n="29">AGRIPPA</I>
<I n="32">LEPIDUS</I>
<I n="16">MECAENAS</I>
<I n="2">Second Messenger</I>
<I n="3">First Attendant</I>
<I n="1">Second Attendant</I>
<I n="2">DEMETRIUS</I>
<I n="63">CHARMIAN</I>
<I n="18">IRAS</I>
<I n="15">ALEXAS</I>
<I n="113">DOMITIUS ENOBARBUS</I>
<I n="12">SCARUS</I>
<I n="27">EROS</I>
<I n="14">Soothsayer</I>
<I n="2">Attendant</I>

A quick scan of the list shows Mark Antony and Cleopatra were both the speaker an equal number of times (204 times). Interesting to see such a stunning example of equality from 400 years ago when "Antony and Cleopatra" was written!

List element names in any XML document

The following script can be used as-is to list element names in any XML. I modified the generated script above to gather unique tag names by changing the parts marked in bold:

str UniqueTagNames( CMarkup mDocToQuery )
{
  // Unique list of tag names
  CMarkup mList;
  mDocToQuery.ResetPos();
  while ( mDocToQuery.FindElem("//*") )
  {
    str sVal = mDocToQuery.GetTagName();
    if ( mList.RestorePos(sVal) )
    {
      // Increment count
      mList.SetAttrib( "n", StrToInt(mList.GetAttrib("n"))+1 );
    }
    else
    {
      // First time for this unique value
      mList.AddElem( "I", sVal );
      mList.SetAttrib( "n", 1 );
      mList.SavePos( sVal );
    }
  }
  return mList;
}

You can go into the firstobject XML editor, from the File menu select New Program, paste the UniqueTagNames function in, and press F9 to run it on your own XML (or HTML or other markup) document. You will be prompted with a list of all files open in your editor. Select the file you want to analyze and it will display the list of element names with their counts.

For play.xml, it outputs the following result:

<I n="1">PLAY</I>
<I n="49">TITLE</I>
<I n="281">STAGEDIR</I>
<I n="1">SUBHEAD</I>
<I n="1174">SPEECH</I>
<I n="1179">SPEAKER</I>
<I n="3560">LINE</I>
<I n="42">SCENE</I>
<I n="6">PGROUP</I>
<I n="35">PERSONA</I>
<I n="1">SCNDESCR</I>
<I n="1">PLAYSUBT</I>
<I n="5">ACT</I>
<I n="6">GRPDESCR</I>
<I n="1">PERSONAE</I>

This gives me a survey of all the elements in any XML document and how often they are used which provides interesting insights into the makeup and content of the XML.

FOAL programs use C++ syntax. The above programs can also be easily adapted for your C++ programs utilizing CMarkup.