The Problem With DTD and XML Schema Validation
Validation by DTD or XML Schema is often used for the sake of industry compliance or simply because developers think that to use XML you must formally validate it. But it is rarely something that directly or powerfully improves the software. In fact, the use of validation against DTDs and XML Schemas is counter-productive to creating good software in many different situations.
There is one case where XML Schemas are useful, and that is with the XML datatype in databases, particularly for queries. The XML Schema allows you to specify data types for attribute and element values which can yield efficiencies in querying the data. Without the type, the value is stored as a string so for a query such as "amount>5.9," every string amount would have to be converted to a number to do the comparison. Knowing the type, the database engine can actually store the value as a number internally in the post-schema validation infoset (PSVI), and/or create an additional index of amounts to speed up searching and sorting.
Basically, an XML Schema is useful with the XML type in a database for the same tried and true purpose that database schemas are useful. But there are so many places where XML Schemas and DTDs have been used where they don't help.
DTDs and Schemas are especially bad in user interfaces. With a web form, the concept that an entire document is either valid or not is too inflexible for programmers trying to create a good user experience. The programmer usually needs to give user friendly and helpful feedback on multiple fields that is not provided by the DTD/Schema validation, and would have to implement all of the same checks twice, procedurally in the programming language and declaratively in the DTD/Schema.
Since a Schema tells what can be in the document, there have been many efforts to implement user interface form technology that is guided by a Schema, validating as you enter information into the form. It sounds great in theory, but a Schema is just not adequate for any but the simplest forms and ends up only being used as a starting point for developing the form. In Microsoft InfoPath, forms are templated in documents called manifests while the Schema/DTD is used as a starting point, but a sample XML document can be used too. The idea going in to this technology is that the XML Schema will reduce the work, but the result coming out is that as the form evolves, synchronizing and understanding the functionality gap between the schema and form becomes counter-productive to the project.
In a client application, XML need not be validated against a DTD or Schema as you pass XML back and forth or read it from disk. Merely testing the return codes of the parser and navigation methods will catch any corruption. It is more straightforward to check values against bounds and so forth as you normally would in any program, than to refrain from checking values because the document has been validated.
DTDs and Schemas are supposed to be useful to communicate document layout between collaborating external software systems development teams. However, they are so difficult to read and modify that they end up slowing down the process, unless both collaborators share a nice graphical tool for displaying and modifying the DTD or Schema. For programmers, sample documents are usually best because there is no conceptual leap between what an XML sample looks like and what the XML in question looks like.
In these collaborative systems, validation can be used by the consumer to accept or reject an XML document and by the producer to check before sending the document. However, DTDs and Schemas are never quite up to the task of describing all of the ins and outs of the document structure, so additional validation such as checking an order number against a database is done in the object model anyway. These are checks that have always been done between collaborative systems, and since DTD and Schema validation are not able to do it all, you've now created two places where validation has to be managed and it is probably more work.
A DTD or Schema represents an additional technology requiring expertise and creating another locus of maintenance that adds to the complexity of software development (see also What Stops You From Using XML). For example, if you have a list of valid category codes in a Schema and in a database, they need to be kept up to date in both places whenever the list changes. In addition, old archived XML documents might no longer be valid according to the new Schema so you'd better archive the old Schema with them. The complexity over time can grow exponentially.
Validation against DTD or Schema is often a technology that generates a lot of work without benefiting software development. XML validation comes out of the SGML content management world and as XML rode a wave of hype it was envisioned that validation was going to be part of a software development revolution. But checking data is a basic part of programming; there is no magic solution! It is better to let the programmer do validation in his own familiar language as the requirements dictate.
Thanks. I've also written about this in The Versatile Way to Program XML. As the problem with validation becomes more generally admitted it may diminish the reputation of XML itself because unfortunately people think of these add-on technologies as part and parcel of XML (even though they shouldn't). In a sign of the changing times Dare Obasanjo, an important figure in the XML world, wrote that "XSD has held back the proliferation and advancement of XML technologies by about two or three years".
You need to validate XML? This can mean one of two things. On one hand, it might be a directive or requirement of your project to use DTD or Schema or Relax NG or Schematron (in which case your "hands are tied"). On the other hand, it may mean just to check your data. Validation sounds like a big thing, but ultimately the goal is simply to "check your data". This is something you often do without thinking about it if you were trained as a programmer.
If your data is an address and phone number, validation would include checking that the necessary values are provided and non-empty, that the phone number is the correct number of numeric digits and so forth like that. This can take some time to figure out how to do with a big XML validation technology, but is probably easy for you in Java since that is the language you are already programming in. I can't help you specifically with Java; checking if a string is empty is simple, checking the digits in the phone number string involves a bit of string manipulation.
With the emphasis on "unfortunately," some projects require you to implement a particular validation. Validation works differently depending on the validation technology, be it DTD, XSD, Relax NG or Schematron. CMarkup does not support any of the established XML validation technologies but instead encourages you to check your document using natural techniques described above. However, the CMarkup wrapper for MSXML (MSXML Wrapper CMarkupMSXML) provides a quick way to use MSXML and gives you access to its validation capabilities. Without going into the complexities of validation, here is some sample code that may give some clues about using validation. I believe the
validateOnParse property is true by default, but it is set here for illustration.
CMarkupMSXML xml; xml.m_pDOMDoc->validateOnParse = VARIANT_TRUE; xml.Load( "test.xml" ); CString csError = xml.GetError();
If you need to specify an XSD file, you can turn off
validateOnParse and associate your schema with the document (look up
XMLSchemaCache) and then
IXMLDOMParseErrorPtr pErr = xml.m_pDOMDoc->validate() and use
pErr->reason. Needless to say, this is only the tip of the iceberg. These validation techniques are rife with complexities such as dependencies on file paths and remote URLs. Proceed at your own risk.
No validation technology has ever been completely satisfactory, which is why new ones are always being invented. In addition to the reasons expressed above, this is also due to the fact that the logic behind XML validation is unnatural for the software development principle of module ownership. When the producer and consumer of an XML document are separate parties, the XML is essentially behaving as a point of datainterchange between them. Traditionally, the producer of a data format makes a contract to provide data in a certain way, and the consumer will do some checks to assert that the data is on par with its needs.
The concept of XML validation foists a common declarative definition on that contract which impedes its flexibility and ties both parties into a particular validation vocabulary. In addition, they are left with the option of using the same exact version of the validation technology or risk potential implementation conflicts, putting them into an unfortunate interdependency that never before existed in datainterchange solutions. The challenge for the XML development industry as I see it is to get past this mistake and move on.
Mark's is a practical perspective from seeing first hand with an architectural eye that traditional XML validation is harmful. The best practice is for agents to check only the data they consume as they extract it from the document, not using XML validation. This allows the XML format to evolve without affecting the agents it doesn't need to affect (See XML Versioning).
Proponents of XML validation always try to turn it into an argument about whether you should check your data or not. Dare Obasanjo seems to understand that versioning is the key to Mark's argument, but redirects the point like this:
|The fact that you enforce that the XML documents you receive must follow a certain structure or must conform to certain constraints does not mean that your system cannot be flexible in the face of new versions.|
Of course enforcing constraints doesn't necessarily hurt your system. The question is how to do it in a way that isn't harmful. XML validation enforces constraints with an all or nothing gatekeeper style separated from the rest of your business logic. This is the problem. As new schema versions are introduced for a particular document type, your software's complexities and interdependencies multiply.
You can check your data right within the business logic that uses it, as you extract it from the document. But people responding to Mark's argument keep implying that he is saying not to check it at all. In the same paragraph, Dare asks "what happens if there are no constraints on the values of elements and attributes in an input document?" And Marc de Graauw worries about a program not checking the currency type of a monetary amount. These are legitimate concerns but no one said you don't need to check your data, particularly the data that you are using!
When new versions of documents start including the price of tea in China, why should all legacy systems have to upgrade to the new schema just to extract the same information they were getting before?
Developers must check the data they receive from external sources like files or TCP/IP streams, but an XML validation mechanism is not a good way to do that.
You mention trying to debug an XML validation related error message, and that is a good starting point for understanding its limitations. XML Validation always enters into the solution as an additional technology which entails added integration, learning curve and maintenance costs. Many of the checks you can do with XML Validation could be done almost as easily without it (as you process the data, not as a separate validation step), some checks can be done more easily outside of it, and some (such as database lookups) are impractical with it. In sum total, with the costs mentioned above, your solution will be much more effective without XML Validation.
Proponents of XML Validation envisioned in an n-tier architecture that form field values could be delivered as an XML document and screened using XML validation before processing in the business logic layer, or even as part of the client side checking to validate the input before contacting the server. Early examples of this technology actually showed the XML validation output when the document was "invalid," which was horrible from a user standpoint (your input is rejected wholesale due to some cryptic error). Even assuming implementation advances, trying to insert XML validation into web form applications creates work and adds costs, rather than improving the software.
Take for example a required web form field that is a store ID for one of a chain's 20 or so stores in the country. Listing the store IDs in the DTD or schema would be a bad practice because it would require the programmer to keep the DTD or schema synchronized with the master database of stores. Updating a schema might seem simple, but good design practice should always seek to reduce the software complexity of an action like adding a store to the system.
But however you choose to check it, the store ID value should be checked on the client side giving useful DHTML feedback to the user about it. Any measures to improve the user experience such as loading the list of IDs in a drop-down or even displaying a map, will essentially duplicate any checking that could be done with client side XML validation. The browser script will need to know all the details of the various fields in order to provide a user friendly experience. If you use XML validation, your browser-side script will need to translate and cross-reference any validation results between the XML validation stage and indicating the specific problems to the user next to the web form fields, increasing the workload, complexity, and chances for failure.
On the server side you will be looking up the store ID in a database which is essentially the authority on the validity of the store ID; so performing validation against DTD or schema would be redundant. If you are using XML validation on the server side like a middle-tier gate keeper security measure against a malicious client, that is a lot of added infrastructure to perform a simple value safety checking function commonly supported by web platforms.
XML validation is the creation of architecture astronauts who identified the general need to check documents and felt it could be abstracted as a modular component of XML development. It is a design born in theory without a good justification in practice because checking data permeates so many parts of a solution and the most practical place to check data is in the code that also processes the data. As mentioned above, proponents of XML Validation usually fall back to suggesting the alternative is not to check your data, which could not be further from the truth.
If you are really trying to evaluate XML validation in your solution, look at what specific values and data relationships you are checking with XML validation and also which checks cannot be supported by XML validation, and which checks are done both in the XML validation stage as well as other parts of the solution.