Working with XML Schemas: Comparing DTDs and XML Schemas

Home Products Support Corporate

Support Knowledge Base, Article 677

Product

General

Title

Working with XML Schemas: Comparing DTDs and XML Schemas

Solution

Working with XML Schemas: Comparing DTDs and XML Schemas

by Dan Wahlin

Validating data contained within XML documents is an important part of many applications used for data exchange, content management, and configuration. If you've had the chance to validate XML documents using Document Type Definitions (DTDs) then you know why the approval of XML schemas on May 2, 2001 by the W3C was so highly anticipated by the XML community. DTDs are limited in several areas including no support for namespaces, they are not well-formed, and they contain virtually no support for data types. The XML community knew that in order for XML to become even more useful and widely adopted, a new manner of validating needed to be created. With the release of XML schemas, XML documents can now be validated much more precisely.

DTDs do have several good qualities. For example, their syntax is very compact and easy to learn once the appropriate keywords are known. DTDs are also very good at validating an XML document's structure, meaning that element and attribute names can be checked and parent/child relationships among elements can be ensured. DTDs also provide support for entities (reuseable pieces of data). One of the biggest arguments in favor of DTDs is that the majority of validating XML parsers allow for validation against DTDs.

The number of validating parsers providing XML Schema support is growing but still relatively small compared to those that support DTDs at this point. That will change with time, however, and as a result this article will compare DTDs and XML schemas to provide you with an overview of why XML schemas are so useful and why you should consider migrating to them from DTDs.

Why do we need DTDs and Schemas?
Before jumping into a comparison of DTDs and XML schemas it's important that I discuss why XML documents need to be validated using DTDs or XML schemas. One of the reasons XML has become such a hot topic over the last few years is due to the fact that it is "extensible." XML document authors have complete control over how they structure the data in the document and what names are used for elements and attributes.

While this extensibility does provide many benefits, it also causes difficulties when separate parties exchange data marked-up using XML. To see this problem more clearly, examine the XML document shown in Listing 1.

<?xml version="1.0"?> <customers> <customer custID="AB-1453"> <firstName>Dave</firstName> <lastName>Todd</lastName> </customer> <customer custID="AB-1489"> <firstName>Thomas</firstName> <lastName>Rawlings</lastName> </customer> </customers>
Listing 1. Customer information marked-up using XML

Looking at the XML document you'll see that it contains an attribute named custID. While this attribute name and position may make perfect sense to you, the company you exchange the document with may expect custID to be an element or may expect the data it contains to be described using a name other than custID. They also may expect that each value contained within the custID attribute is unique so that they can easily import the data into some type of data store.

DTDs and XML schemas provide a roadmap for how XML documents should be created and what they should contain. Once the "map" is known, a company receiving XML data can use the associated DTD or Schema to validate that the XML document contains proper element and attribute names and follows a defined structure. Validation allows the "extensible" nature of XML to be constrained enough to allow agreements between companies and applications to be achieved.

Listing 2 shows an example of a DTD that could be used to validate the XML in Listing 1. Listing 3 shows an equivalent schema. Details on some of the differences between the two types of documents will follow.

<!ELEMENT customers (customer+)> <!ELEMENT customer (firstName, lastName)> <!ATTLIST customer custID ID #REQUIRED> <!ELEMENT firstName (#PCDATA)> <!ELEMENT lastName (#PCDATA)>
Listing 2. A DTD used to validate the customer information shown in Listing 1.

<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:element name="customers"> <xs:complexType> <xs:sequence> <xs:element ref="customer" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="customer"> <xs:complexType> <xs:sequence> <xs:element ref="firstName"/> <xs:element ref="lastName"/> </xs:sequence> <xs:attribute name="custID" type="xs:ID" use="required"/> </xs:complexType> </xs:element> <xs:element name="firstName" type="xs:string"/> <xs:element name="lastName" type="xs:string"/> </xs:schema>
Listing 3. A W3C XML Schema used to validate the customer information shown in Listing 1.

Validation of XML documents against DTDs or XML schemas is not always necessary and generally adds additional overhead. This being the case, when should XML documents be validated using DTDs or XML schemas? There is no right or wrong answer to this question unfortunately. If you develop an XML document and an application that consumes it, validation may be overkill since no one else is involved in the processing of the document. If you exchange or receive XML documents with other companies (or even other people or departments in your company) then validation may be warranted. The bottom line is that if you're concerned whether or not the structure and/or data contained within an XML document meets specific standards set by you, another company, a specification, or an application, then validation is appropriate.

To clarify this more, let me give a quick example. Awhile back I wrote an application that creates hierarchical menus similar to the start menu in Windows that can be used on a web page (see http://www.xmlforasp.net if you'd like the C# code for these menus). The menus use XML to markup their data and nesting structure. Because I wrote the XML document as well as the application and since the menus needed to load as quickly as possible, validation of the XML document was not necessary. In fact, validation would only serve the purpose of slowing the application down.

The XML menus example can be contrasted with a company I'm currently working with that handles millions of XML records containing customer service information on a monthly basis. The documents can be exchanged with many other companies and in order for exchange to be possible and the data to be useful, the documents must adhere to a particular specification. Validation is crucial in this situation because documents not conforming to the specification will cause a chain of "reject" records to appear slowing down the overall process.

Ultimately, the choice to validate is up to you and needs to be determined on a case-by-case basis. Your decision needs to be determined by finding a balance between the necessity of validating the XML against the cost of validating it.

Comparing DTDs and Schemas
Now that you've had a chance to see a simple DTD and XML Schema and know more about when and when not to use validation, let's take a look at some of the differences between the two document types.

Defining Elements
The customer information data shown in Listing 1 contained an element named customer as shown below:

<customer custID="AB-1453"> <firstName>Dave</firstName> <lastName>Todd</lastName> </customer>

To define the customer element in a DTD as well as its content model (the content model is everything between its start and end tags) the following line of code should be added:

<!ELEMENT customer (firstName, lastName)>

Analyzing this code you'll see that it does not follow the XML rules. DTD definitions start with the <! characters and end with the > character. The content model for the customer element is defined by placing the appropriate names within parenthesis and separating them with a comma.

Let's contrast this DTD element definition with the same definition in an XML Schema.

<xs:element name="customer"> <xs:complexType> <xs:sequence> <xs:element ref="firstName"/> <xs:element ref="lastName"/> </xs:sequence> <xs:attribute name="custID" type="xs:ID" use="required"/> </xs:complexType> </xs:element>

While more verbose, this definition contains well-formed XML and involves the use of a tag name element. As with DTDs, the element being defined is named (customer in this case) although this is done using the name attribute in the schema. Child nodes of the customer element are defined within a complexType tag. The complexType tag's child is named sequence and specifies that the firstName element must precede the lastName element as a child of customer. This is similar to using the comma as a separator in the DTD definition. If the author of the XML document schema didn't care which order firstName or lastName appeared under the customer element, a different tag named all could be used instead of the sequence tag as shown below:

<xs:element name="customer"> <xs:complexType> <xs:all> <xs:element ref="firstName"/> <xs:element ref="lastName"/> </xs:all> <xs:attribute name="custID" type="xs:ID" use="required"/> </xs:complexType> </xs:element>

There is no equivalent to the all tag in DTDs without resorting to complex occurrence indicator nesting. Occurrence indicators will be discussed next.

Looking at the customer element definition shown previously you'll notice that the firstName and lastName elements are not fully defined here. Instead, each element is "referenced" by using the ref attribute. The referenced definitions are shown below:

<xs:element name="firstName" type="xs:string"/> <xs:element name="lastName" type="xs:string"/>

This is much like a parameter entity in DTDs since the element definitions are defined once and can be re-used many times within the schema. Parameter entities allow re-useable content to be defined once and then used in several places throughout the DTD as shown below:

<!DOCTYPE root [ <!ENTITY % addr "(Address,City,Zip)"> <!ELEMENT homeAddress %addr;> <!ELEMENT busAddress %addr;> <!ELEMENT shipAddress %addr;> <!-- More definitions would follow - -> ]

By defining the firstName and lastName elements separately from where they are actually used, they can easily be re-used throughout the schema as appropriate.

Occurrence Indicators
In cases where children of a particular element may appear more than once or not at all, DTDs allow for special characters called "occurrence indicators" (these are also referred to as "cardinality operators"). An example of using an occurrence indicator is shown in the following DTD element definition:

<!ELEMENT customers (customer+)>

This definition states that the customers element can have 1 or more child elements named customer. Table 1 provides a list of possible occurrence indicators and how they can be used.

Occurrence Indicator	Description
+	Indicates that an element may appear 1 or more times
*	Indicates that an element may appear 0 or more times
?	Indicates than element may appear 0 or 1 times

Table 1

A child element can be made optional by using the ? character:

<!ELEMENT homeAddress (address1,address2?,city,street)>

An element with multiple child elements that must appear together as a group 0 or more times can be represented in the following manner:

<!ELEMENT email (username,password,provider)*>

While these characters are easy to use and certainly compact, XML schemas make it even easier to define how many times an element can occur by using minOccurs and maxOccurs attributes. You saw an example of using the maxOccurs attribute with the customer element schema definition back in Listing 1. Here's the definition again:

<xs:element ref="customer" maxOccurs="unbounded"/>

If maxOccurs or minOccurs is not included, their values will default to 1. Unlike DTDs, these attributes can also accept numeric values. If the customer element must appear 2 times but no more than 5 times, it is straightforward to represent this in the schema:

<xs:element ref="customer" minOccurs="2" maxOccurs="5"/>

Defining Attributes
Attributes are fairly straightforward to define in both DTDs and Schemas. The structure used for each type of document is shown below:

DTD: <!ATTLIST customer custID ID #REQUIRED> Schema: <xs:attribute name="custID" type="xs:ID" use="required"/>

One big difference between the two definitions is that DTDs allow all attributes for a given element to be defined using one ATTLIST keyword. Schemas allow each attribute to be defined individually from each other much like elements are defined. Schemas also allow attributes to be grouped together (similar to using Parameter Entities for frequently used attributes in DTDs) for re-use purposes and allow attribute values to be associated with several different data types.

Attributes defined in a schema can't simply be nested under the element definition. Instead, they must be wrapped with a complexType element as shown below:

Data Types
While DTDs are very good at defining and validating the structure of an XML document, they contain next to no support for standard data types. Although all of the different types won't be listed here, for the most part all data in an XML document is either considered parsed or unparsed character data (text). This presents problems when data in an XML document needs to be checked to ensure that it is a valid date or number.

XML schemas offer robust support for data types. Using schemas you can validate if dates are valid, values are integers, as well as check for many other things. This is very important as XML is used to move data between different storage systems such as databases.

Figure 1 shows an image published by the W3C that represents the different data types available in the XML Schema specification.

Figure 1. XML Schema Data Types

Using these different data types in an XML schema is accomplished through the type attribute. Both elements and attributes can use this attribute to specify their data type:

<xs:attribute name="birthday" type="xs:date" use="required"/> <xs:element name="GPA" type="xs:decimal" />

This is a huge improvement over DTDs where a given element's child text node can only contain PCDATA (parsed character data) as a data type.

XML schemas also allow custom data types to be created in cases where a base type needs to be extended. For example, a custom type can be created that uses an integer as its base but requires that the integer value fall between 1001 and 1005. Custom data types can also leverage the power of regular expressions to do pattern matching. More information about creating custom data types will be provided in the next article in this series.

Entity Support
DTDs provide excellent support for entities. If you're not familiar with entities, you can think of them as being similar to server-side includes or any other type or re-useable piece of data. Defining entities in DTDs is accomplished by using the ENTITY keyword as shown in Listing 4.

<!ELEMENT businessInfo (address+,businessSummary?)> <!ELEMENT address (#PCDATA)> <!ELEMENT businessSummary (#PCDATA)> <!ENTITY busAddress "1234 Anywhere St."> <!ENTITY busSummary SYSTEM "summary.txt">
Listing 4. Defining Internal an External Entities in a DTD

Once defined in the DTD, entities can be used in the XML by prefixing the entity name with an ampersand character (&) and ending it with a semi-colon. Listing 5 shows an example.

<?xml version="1.0"?> <businessInfo> <address>&busAddress;</address> <businessSummary>&busSummary;</businessSummary> </businessInfo>
Listing 5. Using Entities in an XML document

So where do entities fit into XML schemas? They don't actually, at least in the same way as in DTDs. You can define pieces of information once in a schema by making the definition fixed, but the flexibility associated with using entities in DTDs is not there with schemas. Here's an example from the W3C schema primer document that shows an example of defining a fixed value by using an element:

<xsd:element name="eacute" type="xsd:token" fixed="é"/> -------------------------------------------------------- <?xml version="1.0" ?> <purchaseOrder xmlns="http://www.example.com/PO1" xmlns:c="http://www.example.com/characterElements" orderDate="1999-10-20>  <city>Montr<c:eacute/>al</city>  </purchaseOrder>

Although the above example gets the job done in this situation, using DTDs for entity support is not only easier but more flexible since internal and external entities can be referenced with minimal effort. This means that you may use DTDs and schemas together when working with XML documents. DTDs may be used for entity support while schemas will be used for validation of the XML document.

Namespace Support
Namespaces are an integral part of XML documents that allow naming collisions to be avoided by associating an element (or attribute) with a unique identifier. These identifiers are typically referred to as Universal Resource Identifiers (URI) or Universal Resource Names (URN). Due to the importance of namespaces, the XML Schema specification includes support for them.

DTDs do not provide support for namespaces at all. This is mainly due to the fact that they stem from Standard Generalized Markup Language (SGML) and are therefore more dated. Rather than adding support for namespaces in DTDs, the W3C working group decided put all of the effort related to this task into XML schemas.

To see namespaces in action, Listing 6 shows the addition of a local namespace to the XML document first introduced in Listing 1. Notice that the namespace prefix of "acme" is associated with the customer element.

<?xml version="1.0"?> <customers xmlns:acme="http://www.acme.com/namespace"> <acme:customer custID="AB-1453"> <firstName>Dave</firstName> <lastName>Todd</lastName> </acme:customer> <acme:customer custID="AB-1489"> <firstName>Thomas</firstName> <lastName>Rawlings</lastName> </acme:customer> </customers>
Listing 6. Using a namespace in an XML document

If this XML document is validated against the DTD shown earlier in Listing 2 an error will be generated. This is because the DTD treats the namespace prefix and element name as one name: acme:customer. DTDs don't know anything about namespaces and if you change the namespace prefix in the XML you'll need to create a different DTD (or leverage parameter entities) to accommodate this change. Listing 7 shows how the DTD would need to be modified. Notice that the namespace prefix is "hard-coded" into the different element and attribute definitions.

<!ELEMENT customers (acme:customer+)> <!ELEMENT acme:customer (firstName, lastName)> <!ATTLIST acme:customer custID ID #REQUIRED > <!ELEMENT firstName (#PCDATA)> <!ELEMENT lastName (#PCDATA)>
Listing 7. Modifying the DTD to handle the namespace prefix

XML schemas are fully namespace compliant and therefore know how to deal with them properly. In fact, XML Schema documents are a part of a namespace (either declared as local or default) with a URI equal to "http://www.w3.org/2001/XMLSchema."

To validate the XML document shown in Listing 6, the schema shown below in Listing 8 can be used.

<?xml version="1.0" encoding="UTF-8"?> <xs:schema targetNamespace="http://www.acme.com/namespace" xmlns:acme="http://www.acme.com/namespace" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xs:element name="customers"> <xs:complexType> <xs:sequence> <xs:element ref="acme:customer" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="customer"> <xs:complexType> <xs:sequence> <xs:element ref="acme:firstName"/> <xs:element ref="acme:lastName"/> </xs:sequence> <xs:attribute name="custID" type="xs:ID" use="required"/> </xs:complexType> </xs:element> <xs:element name="firstName" type="xs:string"/> <xs:element name="lastName" type="xs:string"/> </xs:schema>
Listing 8. Using Namespaces in XML schemas

Notice that the acme namespace prefix is defined in the XML schema root tag and then used as appropriate when different elements are referenced. Many different namespace declarations can be combined in one document without having to write a separate schema.

This short discussion of namespaces does not do them justice unfortunately. You can read more about namespaces at the following URL: http://msdn.microsoft.com/msdnmag/issues/01/07/xml/xml0107.asp

Associating DTDs and Schemas with XML Documents
DTDs can be internal or external to an XML document. In both cases, hooking the DTD up with the XML document is accomplished by using the DOCTYPE keyword. For example, to reference an external DTD named customers.dtd, the following syntax can be placed immediately under the Prolog section of the XML document:

<!DOCTYPE root SYSTEM "customers.dtd">

Associating XML documents with XML schemas is similar although namespaces have to be taken into consideration. If an XML document has no namespaces defined, the following attributes can be added to the document's root element to associate the document with an external schema:

<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noSchemaNamespaceLocation="customers.xsd"> <!-- elements follow -- > </root>

If the XML document does have one or more namespaces associated with it the following syntax can be used:

<root xmlns:acme="http://www.acme.com/namespace" xmlns="http://www.acme.com/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.acme.com/namespace customers.xsd"> <!-- elements follow -- > </root>

Validating against XML Schemas in .NET
It's a fairly simple process to validate an XML document against a schema using the .NET platform. The validation process involves using two classes in the System.Xml namespace (installed with the .NET SDK) named XmlTextReader and XmlValidatingReader. Listing 9 shows how these classes can be used for validation purposes. The code shown in Listing 9 is contained in a file named ValidatationTest.aspx.cs. This file is available in this article's downloadable code - located at the bottom of this page.

bool _valid = true; private void Page_Load(object sender, System.EventArgs e) { XmlTextReader reader = null; XmlValidatingReader vReader = null; try { reader = new XmlTextReader(Server.MapPath("Listing6.xml")); vReader = new XmlValidatingReader(reader); vReader.ValidationType = ValidationType.Schema; vReader.ValidationEventHandler += new ValidationEventHandler(ValidationChecker); while (reader.Read()){} if (_valid) { this.message.ForeColor = Color.Navy; this.message.Text = "Validation Succeeded"; } else { this.message.ForeColor = Color.Red; this.message.Text = "Validation Failed"; } } catch (Exception exp) { Response.Write(exp.Message); } finally { reader.Close(); vReader.Close(); } } public void ValidationChecker (object sender, ValidationEventArgs args) { _valid = false; Response.Write(args.Message); }
Listing 9. Validating with the XmlValidatingReader (found in ValidatationTest.aspx.cs in code download for this article)

Though not shown in Listing 9, the System.Xml and System.Xml.Schema namspaces must be referenced by the ASP.NET code behind page in order to have access to the proper classes used in validating the XML document.

The code starts off by instantiating the XmlTextReader and passing the path to the XML document to its constructor. The XmlTextReader is then passed into the XmlValidatingReader's constructor. Next, the type of validation to perform is established by setting the ValidationType proper of the XmlValidatingReader to Validation.Schema. Other possibilities include Validation.Auto, Validation.DTD, Validation.XDR, and Validation.None.

The next line of code hooks the XmlValidatingReader up with an event handler that will be called if an error is encountered during validation. In this example, ValidationCheck() will be called and information about the error event will be passed into it via the ValidationEventArgs object. You'll learn more about the event handler and arguments in future articles.

Now that the validation type has been specified and the event handler is ready to go, the Read() method of the XmlTextReader is called and the XML document is read from start to finish. As it is being read, the different elements, attributes, and other nodes in the document are compared to the XML schema to ensure that they are valid. If anything in the XML document is inconsistent with the schema, a global Boolean variable named _valid will be set to false.

Summary
Although not every difference between DTDs and XML schemas has been covered in this article, you've been presented with some of the major differences. Both DTDs and XML schemas provide XML validation and documentation capabilities. DTDs fall short in validating data types in XML documents and they do not contain namespace support. While most validating parsers support DTDs currently, more and more parsers are beginning to support XML schemas due to their inherent advantages. As a result, you can bet that the incorporation of XML schemas into different applications will increase in the near future.

Sample Code
Download the sample code, Wahlin-Code.zip, at the bottom of this page.

Dan Wahlin is the author of XML for ASP.NET Developers (Sams) and is an independent consultant for Wahlin Consulting, LLC which offers XML and Web Service training and consulting. He also founded the XML for ASP.NET Developers website (http://www.XMLforASP.NET) which focuses on using XML and Web Services in Microsoft's .Net platform. Dan co-authored Professional Windows DNA (Wrox), ASP.NET: Tips, Tutorials, and Code (Sams), and writes for several magazines.

Attachments

Attachments/KB677_Wahlin-Code.zip

Created : 7/15/2003 1:00:07 PM (last modified : 10/24/2008 4:26:24 PM)

Rate this article!