Lookup XML Data with CMarkup

CMarkup makes navigation easy and efficient with its core methods. These can bring you real value at lightning speeds, and all while keeping your code easily maintainable and extendable.

Loop And Compare

What could be easier than using familiar core functions of your XML tool to loop through all the items until finding the one that matches? Here is an example involving the need to ignore case while searching for a value. It is adapted from MSDN article 315719 on MSXML case-insensitive search.

<Domains>
 <DomainName userid="rain5">Uhdomain1.COM</DomainName>
 <DomainName userid="cloud1">Mydomain1.COM</DomainName>
</Domains>

This code loops through any DomainName elements under the root element, and does something with the userid if the matching value is found. The beauty of this solution is that the difference between a case-sensitive search and a non-sensitive one is trivial. If someone went into someone else's code to modify it, they wouldn't have to do any research. In fact they could even implement a much more complex comparison such as comparing with and without the http prefix without much difficulty either.

xml.ResetPos();
while ( xml.FindChildElem("DomainName") )
  if ( xml.GetChildData().CompareNoCase("mydomain1.com") == 0 )
  {
    DoSomething( xml.GetChildAttrib("userid") );
    break;
  }

Incidentally, these same CMarkup methods will work whether xml is CMarkup or CMarkupMSXML (although there are performance tradeoffs with the MSXML Wrapper CMarkupMSXML because it is a wrapper of MSXML). Note though that the example uses an MFC string comparison function, you might use whatever function your programming environment provides such as stricmp.

Why XPath is a Bad Idea

So, CMarkup makes it easy and efficient to lookup something in your document. If instead you try to use XPath (a lookup technology used in some XML tools) it is not easy and likely not efficient either. With MSXML XPath the complexity begins with the differences in functionality between product versions. With MSXML 3.0 you must turn on XPath and use the translate function. This is vbscript; the long strings inside selectSingleNode are divided into 3 parts only for readability.

oXML.setProperty "SelectionLanguage", "XPath"
set node = oXML.selectSingleNode( "Domains/DomainName[
  translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')
   = 'mydomain1.com']" )

With MSXML 4.0 you have the option of using the ms:string-compare function with the 'i' flag once you set the namespace property.

oXML.setProperty "SelectionNamespaces",
  "xmlns:ms='urn:schemas-microsoft-com:xslt'"
set node = oXML.selectSingleNode( "Domains/DomainName[
  ms:string-compare(., 'mydomain1.com', 'en-US', 'i')
   = 0]" )

A potential advantage of XPath in this situation is that it can sometimes achieve a high performance by taking advantage of the inner workings of the component since it goes all the way to the result in one call. But the disadvantages of XPath are many (the CMarkup solution shown above has none of these disadvantages).

values must be manually escaped: the example shows the XPath string with the domain name already in it, but in practice you will be building this string out of variables and literals. If the variables might contain special characters such as quotes, you will need to escape them.
runtime parsing: the XPath expression is parsed at runtime which takes resources and processing time, as well as hiding debugging issues.
learning curve and maintenance: XPath represents additional functions and syntax for all developers who maintain the code to learn.
implementation differences: for example, in IE the first node is [0], while the W3C standard stipulates [1] is the first node.
version dependencies: you must be sensitive to the version of the component that will be available to your program such as with the MSXML 3 and 4 difference shown above.

But perhaps the biggest disadvantage of all with XPath is the additional complexity of going from a simple search to one that ignores case. What might be assumed to be a trivial modification becomes a potential headache. And again XPath becomes even more difficult and less efficient when you add a complication like not assuming the domains are normalized to the http:// form or you need to check uniqueness.

Creating a Lookup Table (Unique Name Map)

CMarkup also allows you to easily build a map of all the domains in the document for quick lookup. Suppose we need to look up domain names quickly, we would loop through them once and save their positions.

xml.ResetPos(); // top of doc
xml.FindElem(); // /Domains
xml.IntoElem();
while ( xml.FindElem("DomainName") )
{
  CString strDomain = xml.GetData();
  strDomain.MakeLower();
  xml.SavePos( strDomain );
}

Internally, CMarkup uses the string name as the key to a hash map so it is a very quick lookup (SavePos/RestorePos/SetMapSize support multiple logical lookup tables per document). Then whenever we need to look up the domain name and do something with the userid, just:

strDomain.MakeLower();
if ( xml.RestorePos(strDomain) )
  DoSomething( xml.GetAttrib("userid") );

Building A Unique List

Another application of unique named positions in CMarkup is compiling a count of unique words. For example, a customer database has a country element telling where each customer is located. The following code will loop through the customer XML database and generate a small document listing countries and counts. This example uses the anywhere path //Country which is a feature of the developer version of CMarkup (see Paths In CMarkup) but it can be easily replaced with plain navigation as used in the above examples depending on the format of the XML customer database xmlCustomerDB.

CMarkup xmlCountries;
xmlCustomerDB.ResetPos();
while ( xmlCustomerDB.FindElem("//Country") )
{
  CString csCountry = xmlCustomerDB.GetData();
  if ( xmlCountries.RestorePos(csCountry) )
  {
    // Increment count
    xmlCountries.SetAttrib( "n", atoi(xmlCountries.GetAttrib("n"))+1 );
  }
  else
  {
    // Add country to list
    xmlCountries.AddElem( "C", csCountry );
    xmlCountries.SetAttrib( "n", 1 );
    xmlCountries.SavePos( csCountry );
  }
}

<C n="32">United States</C>
<C n="12">Canada</C>
<C n="14">United Kingdom</C>
<C n="2">China</C>
<C n="8">Japan</C>
<C n="1">Kenya</C>

More On Navigating XML

There are several other articles about getting around in your XML with CMarkup.

comment posted Locate elements fast

M 26-Apr-2007

Lets say I have a large XML file with elements and attributes, and let say attrib ID="number", this number is unique (1 -- unlimited), and I will save this number when parsing the XML first time as a reference, now I want to locate this element fast by this ID, so I can read other stuff from the parent or child element, what's the fastest way to perform this operation without the need to re-scan the XML elements and compare the IDs to find the right element

While the SavePos/RestorePos hash functions used below are available in the Evaluation Version, the path (// and []) and index features referred to are only in CMarkup Developer and the free XML editor FOAL C++ scripting.

There are a number of ways to go about it where you can weigh performance issues. The simplest to code if you have implemented the ID attribute is to use the attribute value predicate to find it (see Paths In CMarkup):

xml.ResetPos();
xml.FindElem( "//*[@ID='5']" );

That will do a depth first traversal internally to find the element. If you are finding them in order, you don't need to ResetPos() before each one and it will be quite fast. However, if you need to go to any one of them at any time (random access), and you do this kind of lookup several times, consider the SavePos and GetElemIndex options described below.

To utilize a hash table lookup for quicker random access, save each position with the string ID as you are creating it.

xml.SetAttrib( "ID", "5" );
xml.SavePos( "5" );

Then later you can go directly back to that position:

xml.RestorePos( "5" );

See SavePos and RestorePos. If you have hundreds of IDs the saved position performance will degrade but still be better than the attribute value predicate for random access. These saved positions are lost when the document is reparsed using Load or SetDoc, but you can set them with a quick scan through the document (using the anywhere path and attribute predicate described in Paths In CMarkup):

xml.ResetPos();
while ( xml.FindElem("//*[@ID]") )
  xml.SavePos( xml.GetAttrib("ID") );

Ultimately, you can control implementation and performance using indexes (see ElemIndex Navigation). Since your ID is a simple array from 1 to n, you can just use an integer array or vector to store the indexes.

SetAttrib( "ID", i );
a[i] = xml.GetElemIndex();
++i;

and later return to ID i as follows:

xml.GotoElemIndex( a[i] );

These indexes remain valid even as the document is modified, until it is reparsed. So you would need to build this array every time the document is parsed.

Building this array of indexes is actually a very quick process roughly the same order of magnitude as the time to parse the document. Use a "grow by" mechanism or size estimation to reserve array size ahead and avoid realloc churn. This quick once-through every time you parse will give you instantaneous random access to your large document. If every element you need to lookup has an ID attribute, something like this will build the array:

CArray a;
xml.ResetPos();
while ( xml.FindElem("//*[@ID]") )
{
  int i = atoi(xml.GetAttrib("ID"));
  a.SetAtGrow( i, xml.GetElemIndex() );
}

You may need to scan all the ID values every time you re-parse to know what the next available ID is, anyway.

comment posted optimized search feature

Davide 05-Dec-2007

Something like FindElem( "Name", "Filippo") or (dream) FindRegexData( "Name", "Fil*") to retrieve node containing data is needed in daily use.

<Data>
    <Record>
        <ID>1</ID>
        <Name>Davide</Name>
    </Record>
    <Record>
        <ID>2</ID>
        <Name>Filippo</Name>
    </Record>
    ......
</Data>

Since I've gone with an XPath subset in the FindElem and FindGetData methods, I'll likely stick with that (i.e. element value predicate "Record[Name=Filippo]") although using a separate argument for the value like you suggested does make for quicker code since you don't need to escape quotes in the value. Comparison/substring functions have been avoided because that leads down a never ending path such as the semi-procedural functions in XPath.

However, taking a Regex approach is an interesting point I hadn't considered. I don't recall ever needing to do something like find "Fil*"; it is amazing the breadth of different needs of different developers/projects. One issue with supporting "Fil*" is that you need an escape code for the asterisk in case you actually want to compare with the asterisk character and that leads to an extension/incompatibility with other path specifications.

CMarkup always errs on the side of simplicity, letting you perform the full range of comparison options in your natural procedural language (as mentioned above). For example, you would search for "Fil*" as follows:

xml.ResetPos();
while ( xml.FindElem("//Name") )
{
  if ( strncmp(xml.GetData(),"Fil",3) == 0 )
  {
    // process match for "Fil*"
  }
}