| ||||||||
CMarkup Indexing ExplainedOver 3 years ago, I wrote an explanation of CMarkup indexing which is now being replaced by this article. Release 7.0 is an overhaul of the indexing system used before this point, but not a fundamental change. For each element, there are still eight 32-bit integers used to store the location and tree links. These integers are now used more efficiently and more intuitively. CMarkup is fast (see CMarkup Performance Tests) and efficient due to its minimalist strategy which is to leave the data in the document string and maintain a hierarchical arrangement of indexes mapping out the document. So really all you have is a document string, and integer indexes mapping out the elements.
As mentioned above, for each element in the document there is a structure consisting of 8 integers. This structure is 32 bytes in size. As a generalization, CMarkup usually allocates roughly the same amount of memory for the indexes as is used for the document (depending on whether there are more or less than 1 element per 32 bytes in the document). A sample 250KB document called play.xml has 6343 elements and requires about 200KB of memory for the indexes to map those elements. int nStart; int nLength; int nTagLengths; int nFlags; int iElemParent; int iElemChild; int iElemNext; int iElemPrev; The first 3 integers tell where in the document the element starts, its length and start and end tag lengths. Since
<topic id="5">triumph</topic>
The The four If you are familiar with the way tree nodes are linked together like this, you will understand the impact of this mapping. It is designed for efficient navigation up and down the levels of the tree, and looping through siblings, but random access to the nth child element requires looping through all of the siblings before it. At any rate, navigating around in the document no longer requires parsing once the indexes have been created. When a document is parsed this information is gathered. Then as the document is modified, this information is modified accordingly. For example, if an attribute is added, the start tag length changes, the element length changes, and all subsequent and containing elements are adjusted as well. If an element is removed, offsets are changed and the previous element's
CMarkup keeps the entire document in a single contiguous string object. In addition to that an array of indexes is built but the index array is allocated in single megabyte chunks so it is not a concern although it does use approximately the size of the document again. If the document is modified, the string will be reallocated which means a moment when both the old and the new string buffer are both allocated in memory and the new buffer is 1.5 times the size of the old. For big documents it depends on whether a contiguous area of memory is available, and if it is modified, the two contiguous areas are available. There is no rule, but if 5 times the size of the document is available in physical memory, you should be okay. One customer that gets into the 400MB+ range, uses machines with 2GB memory without any other memory hungry applications. So here is a rough idea with your 300MB document: read-only: added to: modified document ends up with: I realize it is not ideal, but customers dealing with very large documents generally just ensure that there is plenty of memory available; there is no graceful handling of "out of memory."
Yes, a pre-allocate and/or "Grow By" value will do the trick, and it is easy to implement. The good thing about CMarkup is you have the source so you have full control in extreme situations like this. But I would like to add this capability as soon as the best design becomes apparent. It is a great suggestion. The least impact design would be to declare a flag called something like But wouldn't you say that for a lot of applications when you are going to modify a huge file, 50% extra is too much? The default grow-by factor is 50% but that is chosen primarily for the case when you are generating a document from scratch. If you open up a 300MB file, it might be adequate instead of 450MB (50% extra) to just allocate 330MB (10% extra) which would allow for significant modifications without a reallocation. I think 10% would be good in most cases, and a huge enhancement (thanks!). What do you think? We could add an argument to | ||||||||||
|
Posted July 17, 2004. Question or comment about this article? ©Copyright 2008 First Objective Software, Inc. All rights reserved. |