XML formats, languages, and vocabularies tend to evolve. Versioning is an extremely important consideration in the design of an XML format.
The very first version of an XML format invariably fails to foresee even the complete immediate uses of it! But it doesn't matter because during initial development none of the products using that XML are released yet and you can keep revising it until you have what we'll dub "version 1.0". But actually it should not matter if you need to change it after version 1.0 either, if you can follow the advice here.
Why is versioning so critical?
XML is used for things like configuration, datainterchange, webservice APIs, persistence, and documentation. The agents that produce and consume this XML are programs, stored procedures, webservices, and even sometimes people using an editor. Once you release software or a specification for version 1.0 of your XML format, you have created an interdependency between the XML and the agents. Even if there is only one program involved, you may be faced with the issue of XML versioning as you build new releases of the program.
Oftentimes developers will say their XML format is frozen... that it will not change from a certain point forward. It is true that you don't want to be changing the format without good reason, but this freeze will only last until a good enough reason comes along to unfreeze it. Otherwise it might simply be obsoleted by its rigidity.
Rather than trying to know everything beforehand so you can freeze the format, follow a few simple guidelines in this article so that extending your format to completely unforeseen uses will be as smooth and painless as possible.
One of the chief benefits of XML is the ability to add additional kinds of information to an XML format without necessarily upsetting existing agents or documents.
Say version 1.0 contains height and weight elements for each container.
<shipyard> <container> <height>12</height> <weight>6745</weight> </container> </shipyard>
Then in version 1.1 you add a width element.
<shipyard> <container> <height>12</height> <width>16</width> <weight>6745</weight> </container> </shipyard>
Software can often be designed to work with both versions of documents so that it will remain backward compatible with archived documents or legacy agents that generate the old version, and so that various legacy consumers of those documents will remain forward compatible with newer document versions.
When you add new information to an XML format, you can usually make it optional in the sense that if it is left out then there is an implied or default behavior or value.
In the above example the width element was added. In order to be able to seemlessly use old documents, new agents can be designed to handle an unknown width. On reports, the width field is left blank for records that do not supply the width. Width statistics reports can be designed to provide a footnote indicating the number records for which width data was not available. These steps will make your solution more robust and versatile.
People often think that to move forward you need to migrate all of the old data, and upgrade all of the agents involved. But in many circumstances this is not practically possible. Rather than create a rift between new and old systems and data, you can introduce intelligent and pragmatic solutions that allow you to move forward immediately while still taking advantage as older systems and data are upgraded.
When you build code to process XML, don't make assumptions that could cause newer document versions to break your code.
There is one simple rule: select values explicitly by tag name. Don't be tempted during coding to assume there won't be additional siblings in between the ones you are extracting.
If you have 12 database columns, and the original document format has exactly 12 corresponding elements in the correct order, don't put them directly into the database without regard to their tag names. If you really want to do this for performance reasons, check the first row to map elements to columns and then assume from there on, or agree on a version attribute like
ver="1.0" to guarantee the content.
RSS formats have a lot of optional elements that are ignored by many RSS readers. And new custom elements can be introduced by anyone without upsetting existing RSS readers.
Make extending your format easy
Just because it is XML (eXtensible Markup Language) doesn't mean adding new information will result in a natural and self-explanatory format. Below are two common sense guidelines to better prepare you for unexpected expansion in the future (applicable to most common hierarchical uses of XML, but not documentation or mixed content).
1. Use an element rather than an attribute if you can conceive of there being more than one. While attributes are for tiny pieces of descriptive info, the key schematic difference between elements and attributes is that you can have multiples of the same element, not so for an attribute (note that I am avoiding any generic debate of element vs. attribute).
For example, your version 1.0 might include a conference room number, and later you have a case where there are 2 or 3 conference rooms (with video conferencing anything is possible!); you'll be glad you used an element instead of an attribute.
Here is the short-sighted format where you can't easily include two rooms:
What are you going to do? Add a new attribute like room2?
<meeting room="512" room2="216"/>
Here is a format that would logically allow multiple rooms:
<meeting> <room>512</room> </meeting>
Adding another room would be self-explanatory, even if you have no intention of supporting that capability right now:
<meeting> <room>512</room> <room>216</room> </meeting>
But this is still not ideal. See the next guideline:
2. Only use the data value of elements that are atomic. Atomic means it cannot be divided or expanded into multiple pieces of information. Once you use the content of the element for data your hands are tied; you cannot give it child elements later.
Continuing the above example with the room, the number is not the only data you can imagine being associated with the room, so using an attribute or a sub-element would be better. Here I use an attribute to hold the room number:
<meeting> <room number="512"/> </meeting>
The first option (512 as element data) is bad if room is not an atomic thing. If you ever need to hang additional information off of the room element, it will be problematic.
With the second option you could easily add building number and multiple rooms (in version 1.1), still allowing for whole trees of unforeseen sub-elements to be added inside the room elements. Legacy applications would still find the first room number in new documents (ignoring the additional information), and newer applications would still find something useful from old documents.
<meeting> <room number="512" building="4"/> <room number="216" building="5"/> </meeting>
For another example, say you want to have a car element with model name as the data.
Later you start attaching additional information such as make and year and you can get away with using attributes:
<car make="Alfa Romeo" year="2007">Brera</car>
But then you want to add the previous owner, and then you realize there may be multiple previous owners. Suddenly you wish you had the foresight to make car a parent element in the first place. If you had applied this guideline you would have realized car is not atomic, but model is.
<car> <model>Brera</model> </car>
Or at least not to use the data of the car element:
In either case, additional information would fit without upsetting the original format because model is still in the same place relative to the car element:
<car> <model>Brera</model> <make>Alfa Romeo</make> <year>2007</year> <owner> <date>2007-01-05</date> </owner> </car>
<car model="Brera" make="Alfa Romeo" year="2007"> <owner date="2007-01-05"/> </car>
Keep it simple
Planning for extensions is good, but don't over do it. Although these illustrations show possible expansions in the formats, you want to be happy with the format if it never expands too. So don't go overboard in preparing for future possibilities.
Using version number attributes in the XML format turns out to be rarely helpful except when needed as a flag to govern behavior as in the performance shortcut mentioned above. A much more useful piece of information would be the version of the agent that generated the document.
Even if you use a format version number, try to design changes and extensions so that the simple existence of the information is the sign that it is there. Newer agents can generate and look for the additional information and act accordingly, while old agents continue as they were. This is more self-explanatory than a version number which requires additional explanation to understand.
In his MSDN treatise Designing Extensible, Versionable XML Formats Dare Obasanjo recommends introducing new namespaces for your XML format extensions. This would mean something like:
<shipyard xmlns="http://yourcompany.com/shipyard" xmlns:yce2="http://yourcompany.com/ext2"> <container> <height>12</height> <yce2:width>16</yce2:width> <weight>6745</weight> </container> </shipyard>
Using namespaces sounds sensible on the surface, but it adds complexity. Not only does it appear complicated (and appearances are everything when looking at XML), but it forces agents to deal with namespace constructs. It could be argued that namespaces are appropriate to avoid clashes when adding to widely deployed formats like RSS (e.g.
slash:comments in RSS 2.0). But in general this is a step away from the simplicity behind the power and popularity of XML.
XML validation, using DTD, XML Schema, Relax NG, or Schematron etc will always hinder your flexibility in XML versioning. When validation is deployed with agents it is likely to become an obstacle to modifying your format. The best practice is for agents to check only the data they consume as they extract it from the document, not using XML validation.
Versioning of your XML format can be achieved painlessly due to the raw power of XML in its simplest form. In many cases, expanding your XML format should be expected and welcomed as part of developing more iterative, adaptive, robust and loosely coupled solutions.