Working with XML Schemas: Comparing DTDs and XML Schemas
by Dan Wahlin
Validating data contained within XML documents is an important part of
many applications used for data exchange, content management, and configuration.
If you've had the chance to validate XML documents using Document Type
Definitions (DTDs) then you know why the approval of XML schemas on May
2, 2001 by the W3C was so highly anticipated by the XML community. DTDs
are limited in several areas including no support for namespaces, they
are not well-formed, and they contain virtually no support for data types.
The XML community knew that in order for XML to become even more useful
and widely adopted, a new manner of validating needed to be created. With
the release of XML schemas, XML documents can now be validated much more
precisely.
DTDs do have several good qualities. For example, their syntax is very
compact and easy to learn once the appropriate keywords are known. DTDs
are also very good at validating an XML document's structure, meaning that
element and attribute names can be checked and parent/child relationships
among elements can be ensured. DTDs also provide support for entities
(reuseable pieces of data). One of the biggest arguments in favor of DTDs
is that the majority of validating XML parsers allow for validation against
DTDs.
The number of validating parsers providing XML Schema support is growing
but still relatively small compared to those that support DTDs at this
point. That will change with time, however, and as a result this article
will compare DTDs and XML schemas to provide you with an overview of why
XML schemas are so useful and why you should consider migrating to them
from DTDs.
Why do we need DTDs and Schemas?
Before jumping into a comparison of DTDs and XML schemas it's important
that I discuss why XML documents need to be validated using DTDs or XML
schemas. One of the reasons XML has become such a hot topic over the last
few years is due to the fact that it is "extensible." XML document authors
have complete control over how they structure the data in the document
and what names are used for elements and attributes.
While this extensibility does provide many benefits, it also causes difficulties
when separate parties exchange data marked-up using XML. To see this problem
more clearly, examine the XML document shown in Listing 1.
<?xml version="1.0"?>
<customers>
<customer custID="AB-1453">
<firstName>Dave</firstName>
<lastName>Todd</lastName>
</customer>
<customer custID="AB-1489">
<firstName>Thomas</firstName>
<lastName>Rawlings</lastName>
</customer>
</customers>
Listing 1. Customer information marked-up using XML
Looking at the XML document you'll see that it contains an attribute
named custID. While this attribute name and position may make perfect
sense to you, the company you exchange the document with may expect custID
to be an element or may expect the data it contains to be described using
a name other than custID. They also may expect that each value contained
within the custID attribute is unique so that they can easily import the
data into some type of data store.
DTDs and XML schemas provide a roadmap for how XML documents should be
created and what they should contain. Once the "map" is known, a company
receiving XML data can use the associated DTD or Schema to validate that
the XML document contains proper element and attribute names and follows
a defined structure. Validation allows the "extensible" nature of XML
to be constrained enough to allow agreements between companies and applications
to be achieved.
Listing 2 shows an example of a DTD that could be used to validate the
XML in Listing 1. Listing 3 shows an equivalent schema. Details on some
of the differences between the two types of documents will follow.
<!ELEMENT customers (customer+)>
<!ELEMENT customer (firstName, lastName)>
<!ATTLIST customer custID ID #REQUIRED>
<!ELEMENT firstName (#PCDATA)>
<!ELEMENT lastName (#PCDATA)>
Listing 2. A DTD used to validate the customer information shown in
Listing 1.
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:element name="customers">
<xs:complexType>
<xs:sequence>
<xs:element ref="customer"
maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="customer">
<xs:complexType>
<xs:sequence>
<xs:element ref="firstName"/>
<xs:element ref="lastName"/>
</xs:sequence>
<xs:attribute name="custID" type="xs:ID"
use="required"/>
</xs:complexType>
</xs:element>
<xs:element name="firstName" type="xs:string"/>
<xs:element name="lastName" type="xs:string"/>
</xs:schema>
Listing 3. A W3C XML Schema used to validate the customer information
shown in Listing 1.
Validation of XML documents against DTDs or XML schemas is not always
necessary and generally adds additional overhead. This being the case,
when should XML documents be validated using DTDs or XML schemas? There
is no right or wrong answer to this question unfortunately. If you develop
an XML document and an application that consumes it, validation may be
overkill since no one else is involved in the processing of the document.
If you exchange or receive XML documents with other companies (or even
other people or departments in your company) then validation may be warranted.
The bottom line is that if you're concerned whether or not the structure
and/or data contained within an XML document meets specific standards
set by you, another company, a specification, or an application, then
validation is appropriate.
To clarify this more, let me give a quick example. Awhile back I wrote
an application that creates hierarchical menus similar to the start menu
in Windows that can be used on a web page (see http://www.xmlforasp.net
if you'd like the C# code for these menus). The menus use XML to markup
their data and nesting structure. Because I wrote the XML document as
well as the application and since the menus needed to load as quickly
as possible, validation of the XML document was not necessary. In fact,
validation would only serve the purpose of slowing the application down.
The XML menus example can be contrasted with a company I'm currently
working with that handles millions of XML records containing customer
service information on a monthly basis. The documents can be exchanged
with many other companies and in order for exchange to be possible and
the data to be useful, the documents must adhere to a particular specification.
Validation is crucial in this situation because documents not conforming
to the specification will cause a chain of "reject" records to appear
slowing down the overall process.
Ultimately, the choice to validate is up to you and needs to be determined
on a case-by-case basis. Your decision needs to be determined by finding
a balance between the necessity of validating the XML against the cost
of validating it.
Comparing DTDs and Schemas
Now that you've had a chance to see a simple DTD and XML Schema and know
more about when and when not to use validation, let's take a look at some
of the differences between the two document types.
Defining Elements
The customer information data shown in Listing 1 contained an element
named customer as shown below:
<customer custID="AB-1453">
<firstName>Dave</firstName>
<lastName>Todd</lastName>
</customer>
To define the customer element in a DTD as well as its content model
(the content model is everything between its start and end tags) the following
line of code should be added:
<!ELEMENT customer (firstName, lastName)>
Analyzing this code you'll see that it does not follow the XML rules.
DTD definitions start with the <! characters and end with the >
character. The content model for the customer element is defined by placing
the appropriate names within parenthesis and separating them with a comma.
Let's contrast this DTD element definition with the same definition in
an XML Schema.
<xs:element name="customer">
<xs:complexType>
<xs:sequence>
<xs:element ref="firstName"/>
<xs:element ref="lastName"/>
</xs:sequence>
<xs:attribute name="custID" type="xs:ID"
use="required"/>
</xs:complexType>
</xs:element>
While more verbose, this definition contains well-formed XML and involves
the use of a tag name element. As with DTDs, the element being defined
is named (customer in this case) although this is done using the name
attribute in the schema. Child nodes of the customer
element are defined within a complexType tag.
The complexType tag's child is named sequence
and specifies that the firstName element must
precede the lastName element as a child of customer.
This is similar to using the comma as a separator in the DTD definition.
If the author of the XML document schema didn't care which order firstName
or lastName appeared under the customer element, a different tag named
all could be used instead of the sequence tag as shown below:
<xs:element name="customer">
<xs:complexType>
<xs:all>
<xs:element ref="firstName"/>
<xs:element ref="lastName"/>
</xs:all>
<xs:attribute name="custID" type="xs:ID"
use="required"/>
</xs:complexType>
</xs:element>
There is no equivalent to the all tag in DTDs without resorting to complex
occurrence indicator nesting. Occurrence indicators will be discussed
next.
Looking at the customer element definition shown previously you'll notice
that the firstName and lastName
elements are not fully defined here. Instead, each element is "referenced"
by using the ref attribute. The referenced definitions are shown below:
<xs:element name="firstName" type="xs:string"/>
<xs:element name="lastName" type="xs:string"/>
This is much like a parameter entity in DTDs since the element definitions
are defined once and can be re-used many times within the schema. Parameter
entities allow re-useable content to be defined once and then used in
several places throughout the DTD as shown below:
<!DOCTYPE root [
<!ENTITY % addr "(Address,City,Zip)">
<!ELEMENT homeAddress %addr;>
<!ELEMENT busAddress %addr;>
<!ELEMENT shipAddress %addr;>
<!-- More definitions would follow - ->
]
By defining the firstName and lastName
elements separately from where they are actually used, they can easily
be re-used throughout the schema as appropriate.
Occurrence Indicators
In cases where children of a particular element may appear more than once
or not at all, DTDs allow for special characters called "occurrence indicators"
(these are also referred to as "cardinality operators"). An example of
using an occurrence indicator is shown in the following DTD element definition:
<!ELEMENT customers (customer+)>
This definition states that the customers element can have 1 or more
child elements named customer. Table 1 provides a list of possible occurrence
indicators and how they can be used.
Occurrence Indicator |
Description |
+ |
Indicates that an element may appear 1 or more times |
* |
Indicates that an element may appear 0 or more times |
? |
Indicates than element may appear 0 or 1 times |
Table 1 A child element can be made optional by using the ? character:
<!ELEMENT homeAddress (address1,address2?,city,street)>
An element with multiple child elements that must appear together as
a group 0 or more times can be represented in the following manner:
<!ELEMENT email (username,password,provider)*>
While these characters are easy to use and certainly compact, XML schemas
make it even easier to define how many times an element can occur by using
minOccurs and maxOccurs
attributes. You saw an example of using the maxOccurs attribute with the
customer element schema definition back in Listing 1. Here's the definition
again:
<xs:element ref="customer" maxOccurs="unbounded"/>
If maxOccurs or minOccurs
is not included, their values will default to 1. Unlike DTDs, these attributes
can also accept numeric values. If the customer element must appear 2
times but no more than 5 times, it is straightforward to represent this
in the schema:
<xs:element ref="customer" minOccurs="2" maxOccurs="5"/>
Defining Attributes
Attributes are fairly straightforward to define in both DTDs
and Schemas. The structure used for each type
of document is shown below:
DTD: <!ATTLIST customer custID ID #REQUIRED>
Schema: <xs:attribute name="custID" type="xs:ID" use="required"/>
One big difference between the two definitions is that DTDs allow all
attributes for a given element to be defined using one ATTLIST
keyword. Schemas allow each attribute to be defined individually from
each other much like elements are defined. Schemas also allow attributes
to be grouped together (similar to using Parameter Entities
for frequently used attributes in DTDs) for re-use purposes and allow
attribute values to be associated with several different data types.
Attributes defined in a schema can't simply be nested under the element
definition. Instead, they must be wrapped with a complexType element as
shown below:
<xs:element name="customer">
<xs:complexType>
<xs:sequence>
<xs:element ref="firstName"/>
<xs:element ref="lastName"/>
</xs:sequence>
<xs:attribute name="custID" type="xs:ID"
use="required"/>
</xs:complexType>
</xs:element>
Data Types
While DTDs are very good at defining and validating the structure of an
XML document, they contain next to no support for standard data types.
Although all of the different types won't be listed here, for the most
part all data in an XML document is either considered parsed or unparsed
character data (text). This presents problems when data in an XML document
needs to be checked to ensure that it is a valid date or number.
XML schemas offer robust support for data types. Using schemas you can
validate if dates are valid, values are integers, as well as check for
many other things. This is very important as XML is used to move data
between different storage systems such as databases.
Figure 1 shows an image published by the W3C
that represents the different data types available in the XML Schema specification.
Figure 1. XML Schema Data Types
Using these different data types in an XML schema is accomplished through
the type attribute. Both elements and attributes can use this attribute
to specify their data type:
<xs:attribute name="birthday" type="xs:date" use="required"/>
<xs:element name="GPA" type="xs:decimal" />
This is a huge improvement over DTDs where a given element's child text
node can only contain PCDATA (parsed character
data) as a data type.
XML schemas also allow custom data types to be created in cases where
a base type needs to be extended. For example, a custom type can be created
that uses an integer as its base but requires that the integer value fall
between 1001 and 1005. Custom data types can also leverage the power of
regular expressions to do pattern matching. More information about creating
custom data types will be provided in the next article in this series.
Entity Support
DTDs provide excellent support for entities. If you're not familiar with
entities, you can think of them as being similar to server-side includes
or any other type or re-useable piece of data. Defining entities in DTDs
is accomplished by using the ENTITY keyword as
shown in Listing 4.
<!ELEMENT businessInfo (address+,businessSummary?)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT businessSummary (#PCDATA)>
<!ENTITY busAddress "1234 Anywhere St.">
<!ENTITY busSummary SYSTEM "summary.txt">
Listing 4. Defining Internal an External Entities in a DTD
Once defined in the DTD, entities can be used in the XML by prefixing
the entity name with an ampersand character (&) and ending it with a semi-colon.
Listing 5 shows an example.
<?xml version="1.0"?>
<businessInfo>
<address>&busAddress;</address>
<businessSummary>&busSummary;</businessSummary>
</businessInfo>
Listing 5. Using Entities in an XML document
So where do entities fit into XML schemas? They don't actually, at least
in the same way as in DTDs. You can define pieces of information once
in a schema by making the definition fixed, but the flexibility associated
with using entities in DTDs is not there with schemas. Here's an example
from the W3C schema primer document that shows an example of defining
a fixed value by using an element:
<xsd:element name="eacute" type="xsd:token" fixed="é"/>
--------------------------------------------------------
<?xml version="1.0" ?>
<purchaseOrder xmlns="http://www.example.com/PO1"
xmlns:c="http://www.example.com/characterElements"
orderDate="1999-10-20>
<!-- etc. -->
<city>Montr<c:eacute/>al</city>
<!-- etc. -->
</purchaseOrder>
Although the above example gets the job done in this situation, using
DTDs for entity support is not only easier but more flexible since internal
and external entities can be referenced with minimal effort. This means
that you may use DTDs and schemas together when working with XML documents.
DTDs may be used for entity support while schemas will be used for validation
of the XML document.
Namespace Support
Namespaces are an integral part of XML documents
that allow naming collisions to be avoided by associating an element (or
attribute) with a unique identifier. These identifiers are typically referred
to as Universal Resource Identifiers (URI) or
Universal Resource Names (URN). Due to the importance
of namespaces, the XML Schema specification includes support for them.
DTDs do not provide support for namespaces at all. This is mainly due
to the fact that they stem from Standard Generalized
Markup Language (SGML) and are therefore more dated. Rather than
adding support for namespaces in DTDs, the W3C working group decided put
all of the effort related to this task into XML schemas.
To see namespaces in action, Listing 6 shows the addition of a local
namespace to the XML document first introduced in Listing 1. Notice that
the namespace prefix of "acme" is associated with the customer element.
<?xml version="1.0"?>
<customers xmlns:acme="http://www.acme.com/namespace">
<acme:customer custID="AB-1453">
<firstName>Dave</firstName>
<lastName>Todd</lastName>
</acme:customer>
<acme:customer custID="AB-1489">
<firstName>Thomas</firstName>
<lastName>Rawlings</lastName>
</acme:customer>
</customers>
Listing 6. Using a namespace in an XML document
If this XML document is validated against the DTD shown earlier in Listing
2 an error will be generated. This is because the DTD treats the namespace
prefix and element name as one name: acme:customer. DTDs don't know anything
about namespaces and if you change the namespace prefix in the XML you'll
need to create a different DTD (or leverage parameter entities) to accommodate
this change. Listing 7 shows how the DTD would need to be modified. Notice
that the namespace prefix is "hard-coded" into the different element and
attribute definitions.
<!ELEMENT customers (acme:customer+)>
<!ELEMENT acme:customer (firstName, lastName)>
<!ATTLIST acme:customer
custID ID #REQUIRED
>
<!ELEMENT firstName (#PCDATA)>
<!ELEMENT lastName (#PCDATA)>
Listing 7. Modifying the DTD to handle the namespace prefix
XML schemas are fully namespace compliant and therefore know how to deal
with them properly. In fact, XML Schema documents are a part of a namespace
(either declared as local or default) with a URI equal to "http://www.w3.org/2001/XMLSchema."
To validate the XML document shown in Listing 6, the schema shown below
in Listing 8 can be used.
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema targetNamespace="http://www.acme.com/namespace"
xmlns:acme="http://www.acme.com/namespace"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
<xs:element name="customers">
<xs:complexType>
<xs:sequence>
<xs:element ref="acme:customer"
maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="customer">
<xs:complexType>
<xs:sequence>
<xs:element ref="acme:firstName"/>
<xs:element ref="acme:lastName"/>
</xs:sequence>
<xs:attribute name="custID" type="xs:ID"
use="required"/>
</xs:complexType>
</xs:element>
<xs:element name="firstName" type="xs:string"/>
<xs:element name="lastName" type="xs:string"/>
</xs:schema>
Listing 8. Using Namespaces in XML schemas
Notice that the acme namespace prefix is defined in the XML schema root
tag and then used as appropriate when different elements are referenced.
Many different namespace declarations can be combined in one document
without having to write a separate schema.
This short discussion of namespaces does not do them justice unfortunately.
You can read more about namespaces at the following URL: http://msdn.microsoft.com/msdnmag/issues/01/07/xml/xml0107.asp
Associating DTDs and Schemas with XML Documents
DTDs can be internal or external to an XML document. In both cases, hooking
the DTD up with the XML document is accomplished by using the DOCTYPE
keyword. For example, to reference an external DTD named customers.dtd,
the following syntax can be placed immediately under the Prolog section
of the XML document:
<!DOCTYPE root SYSTEM "customers.dtd">
Associating XML documents with XML schemas is similar although namespaces
have to be taken into consideration. If an XML document has no namespaces
defined, the following attributes can be added to the document's root
element to associate the document with an external schema:
<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noSchemaNamespaceLocation="customers.xsd">
<!-- elements follow -- >
</root>
If the XML document does have one or more namespaces associated with
it the following syntax can be used:
<root xmlns:acme="http://www.acme.com/namespace"
xmlns="http://www.acme.com/namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.acme.com/namespace
customers.xsd">
<!-- elements follow -- >
</root>
Validating against XML Schemas in .NET
It's a fairly simple process to validate an XML document against a schema
using the .NET platform. The validation process involves using two
classes in the System.Xml namespace (installed with the .NET SDK)
named XmlTextReader and XmlValidatingReader.
Listing 9 shows how these classes can be used for validation purposes.
The code shown in Listing 9 is contained in a file named ValidatationTest.aspx.cs. This file is available in this article's downloadable code - located at the bottom of this page.
bool _valid = true;
private void Page_Load(object sender, System.EventArgs e) {
XmlTextReader reader = null;
XmlValidatingReader vReader = null;
try {
reader = new XmlTextReader(Server.MapPath("Listing6.xml"));
vReader = new XmlValidatingReader(reader);
vReader.ValidationType = ValidationType.Schema;
vReader.ValidationEventHandler +=
new ValidationEventHandler(ValidationChecker);
while (reader.Read()){}
if (_valid) {
this.message.ForeColor = Color.Navy;
this.message.Text = "Validation Succeeded";
} else {
this.message.ForeColor = Color.Red;
this.message.Text = "Validation Failed";
}
}
catch (Exception exp) {
Response.Write(exp.Message);
}
finally {
reader.Close();
vReader.Close();
}
}
public void ValidationChecker (object sender, ValidationEventArgs args)
{
_valid = false;
Response.Write(args.Message);
}
Listing 9. Validating with the XmlValidatingReader (found in ValidatationTest.aspx.cs
in code download for this article)
Though not shown in Listing 9, the System.Xml
and System.Xml.Schema namspaces must be referenced
by the ASP.NET code behind page in order to have access to the proper
classes used in validating the XML document.
The code starts off by instantiating the XmlTextReader
and passing the path to the XML document to its constructor. The XmlTextReader
is then passed into the XmlValidatingReader's constructor.
Next, the type of validation to perform is established by setting the
ValidationType proper of the XmlValidatingReader
to Validation.Schema. Other possibilities include Validation.Auto, Validation.DTD,
Validation.XDR, and Validation.None.
The next line of code hooks the XmlValidatingReader
up with an event handler that will be called if an error is encountered
during validation. In this example, ValidationCheck()
will be called and information about the error event will be passed into
it via the ValidationEventArgs object. You'll
learn more about the event handler and arguments in future articles.
Now that the validation type has been specified and the event handler
is ready to go, the Read() method of the XmlTextReader
is called and the XML document is read from start to finish. As it is
being read, the different elements, attributes, and other nodes in the
document are compared to the XML schema to ensure that they are valid.
If anything in the XML document is inconsistent with the schema, a global
Boolean variable named _valid will be set to false.
Summary
Although not every difference between DTDs and XML schemas has been covered
in this article, you've been presented with some of the major differences.
Both DTDs and XML schemas provide XML validation and documentation capabilities.
DTDs fall short in validating data types in XML documents and they do
not contain namespace support. While most validating parsers support DTDs
currently, more and more parsers are beginning to support XML schemas
due to their inherent advantages. As a result, you can bet that the incorporation
of XML schemas into different applications will increase in the near future.
Sample Code
Download the sample code, Wahlin-Code.zip, at the bottom of this page.
Dan Wahlin is the author of XML for ASP.NET Developers (Sams) and
is an independent consultant for Wahlin Consulting, LLC which offers XML
and Web Service training and consulting. He also founded the XML for ASP.NET
Developers website (http://www.XMLforASP.NET)
which focuses on using XML and Web Services in Microsoft's .Net platform.
Dan co-authored Professional Windows DNA (Wrox), ASP.NET: Tips, Tutorials,
and Code (Sams), and writes for several magazines. |