CMarkup Indexing Explained

CMarkup release 7.0 is an overhaul of the indexing system used before, but not a fundamental change. For each element, there are still eight 32-bit integers used to store the location and tree links. These integers are now used more efficiently and more intuitively.

CMarkup is fast and efficient due to its minimalist strategy which is to leave the data in the document string and maintain a hierarchical arrangement of indexes mapping out the document. So really all you have is a document string, and integer indexes mapping out the elements.

As mentioned above, for each element in the document there is a structure consisting of 8 integers. This structure is 32 bytes in size. As a generalization, CMarkup usually allocates roughly the same amount of memory for the indexes as is used for the document (depending on whether there are more or less than 1 element per 32 bytes in the document). A sample 250KB document called play.xml has 6343 elements and requires about 200KB of memory for the indexes to map those elements.

int nStart;
int nLength;
int nTagLengths;
int nFlags;
int iElemParent;
int iElemChild;
int iElemNext;
int iElemPrev;

The first 3 integers tell where in the document the element starts, its length and start and end tag lengths. Since nStart is a 32-bit integer, the maximum document size is about 2GB, and likewise the maximum element length is the same. The nTagLengths integer is broken down into 22 bits (4194304 limit) for the starting tag length (the start tag can contain attributes), and 10 bits (1024 limit) for the end tag. In the following element, the length of the start tag is 14, the end tag is 8, and the length of the whole element is 29:

<topic id="5">triumph</topic>

The nFlags integer actually stores the level or depth of the element in the lower 16 bits (65536 limit), and specialized flags in the upper 16 bits. The level is 0 for the root element, 1 for the children of the root element and so forth. The flags tell whether the element is the first or last sibling, an empty element, or if the element has been deleted (and this index structure can be reused).

The four iElem integers link to the surrounding elements. iElemParent points to the parent element, iElemChild points to the first child element, and iElemNext points to the next element. When an element is the last sibling, iElemNext is 0. When an element is not the first sibling, iElemPrev points to the sibling before it. If an element is the first sibling, iElemPrev points to the last sibling. So, to get a parent's last child, follow the iElemChild link and from there follow the iElemPrev link.

If you are familiar with the way tree nodes are linked together like this, you will understand the impact of this mapping. It is designed for efficient navigation up and down the levels of the tree, and looping through siblings, but random access to the nth child element requires looping through all of the siblings before it. At any rate, navigating around in the document no longer requires parsing once the indexes have been created.

When a document is parsed this information is gathered. Then as the document is modified, this information is modified accordingly. For example, if an attribute is added, the start tag length changes, the element length changes, and all subsequent and containing elements are adjusted as well. If an element is removed, offsets are changed and the previous element's iElemNext is changed to bypass the deleted element and so forth.

 

comment posted Memory limit question

Mark 28-Mar-2007

I was wondering if you could tell me if there is a size limit that a Markup (STL based) XML doc can be? We seem to be having a problem when the doc gets slightly larger than 300 megabytes.

CMarkup keeps the entire document in a single contiguous string object. In addition to that an array of indexes is built but the index array is allocated in single megabyte chunks so it is not a concern although it does use approximately the size of the document again. If the document is modified, the string will be reallocated which means a moment when both the old and the new string buffer are both allocated in memory and the new buffer is 1.5 times the size of the old. For big documents it depends on whether a contiguous area of memory is available, and if it is modified, the two contiguous areas are available. There is no rule, but if 5 times the size of the document is available in physical memory, you should be okay. One customer that gets into the 400MB+ range, uses machines with 2GB memory without any other memory hungry applications.

So here is a rough idea with your 300MB document:

read-only:
300MB contiguous document string
300MB 300 1MB index blocks
-------
600MB total (but note 300MB contiguous requirement)

added to:
300MB contiguous document string
450MB contiguous document string to replace 300MB one
300MB 300 1MB index blocks
-------
1050MB total (but note 300MB,450MB contiguous requirement)

modified document ends up with:
450MB contiguous document string
300MB 300 1MB index blocks
-------
750MB total (but note 450MB contiguous requirement)

I realize it is not ideal, but customers dealing with very large documents generally just ensure that there is plenty of memory available; there is no graceful handling of "out of memory."

 

comment posted Memory limit workaround

Harold E Austin Jr 15-Jul-2007

One minor limitation with CMarkup seems to be its handling of modified very large input files. For the example given, a modified 300MB document temporarily requires 1050MB of memory (with contiguous blocks of 300MB and 450MB required simultaneously for duplication). What about adding an optional flag to ReadTextFile() and Load() to indicate the explicit desire to modify the document string, so that the full 450MB contiguous chunk gets allocated up front? Or even an expected allocation GrowthFactor>=1, which is currently assumed to be 1.5? -hea

Yes, a pre-allocate and/or "Grow By" value will do the trick, and it is easy to implement. The good thing about CMarkup is you have the source so you have full control in extreme situations like this.

Update September 27, 2008: With CMarkup release 10.0, ReadTextFile allocates an extra 1% for the document string so that modifications slightly increasing the document size do not require reallocations.