DTD - Document Type Definition

Table of contents:

Planet Node: XML schema example

In this instructive example we show the syntax of a simple dtd and xml document. In this example the dtd included at the start of the xml document.

The XML document shown starts with the DTD - this tells us that there must be a top-level node planet which may contain any number of country nodes. Each country must contain two nodes.

Following the DTD section we have the data itself - so far we have only two countries - France and Ireland.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE planet [
  <!ELEMENT planet (country*)>
  <!ELEMENT country (name,pop)>
  <!ELEMENT name    (#PCDATA)>
  <!ELEMENT pop     (#PCDATA)>
]>
<planet>
 <country>
  <name>France</name>
  <pop>59.7</pop>
 </country>
 <country>
  <name>Ireland</name>
  <pop>3.8</pop>
 </country>
</planet>

Specifying child nodes

We use an external DTD. We specify the children of a node using a form of BNF.
We alter the planet node here. Rather than store data in the attributes we record details as character data in the node content.

Rules to be enforced:

The following symbols may be used:

The dtd given is fine for the United Kingdom, but it does not suit France or Norway as it should.

country.dtd

<![CDATA[<!ELEMENT country (name,capital,king?,queen?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT capital (#PCDATA)>
<!ELEMENT king (#PCDATA)>
<!ELEMENT queen (#PCDATA)>
]]>

Valid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>United Kingdom</name>
  <capital>London</capital>
  <queen>Elizabeth II</queen>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>France</name>
  <capital>Paris</capital>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>Norway</name>
  <capital>Oslo</capital>
  <king>Harald</king>
  <queen>Sonja</queen>
</country>

Invalid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>Ruritania</name>
  <capital>Strelsau</capital>
  <king>Rudolf</king>
  <king>Michael</king>
  <king>Rudolf</king>
  <!--Should not allow three kings -->
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>Ruritania</name>
  <king>Rudolf</king>
  <!-- capital is missing-->
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <capital>Strelsau</capital>
  <-- name is missing -->
</country>

Allowing options

The bar may be used to allow options.

The bar | may be used to allow options. A|B means either A or B is permitted. We can use brackets in these regular expressions.

We wish to enforce the following rules:

country.dtd

<![CDATA[<!ELEMENT country (name,(president|monarch))>
<!ELEMENT name      (#PCDATA)>
<!ELEMENT president (#PCDATA)>
<!ELEMENT monarch   (#PCDATA)>
]]>

Valid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>United Kingdom</name>
  <monarch>Elizabeth II</monarch>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>France</name>
  <president>Chirac</president>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>Norway</name>
  <monarch>Harald</monarch>
</country>

Invalid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>Ruritania</name>
  <monarch>Rudolf</monarch>
  <monarch>Michael</monarch>
  <monarch>Rudolf</monarch>
  <!--Should not allow three monarchs -->
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>Ruritania</name>
  <president>Robert</president>
  <monarch>Rudolf</monarch>
  <!-- Cannot have president and monarch-->
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <name>Ruritania</name>
  <!-- Should have president or monarch-->
</country>

Options and brackets

We can nest options, this allows us to specify complicated options - but we must give “deterministic” expressions. The bar | may be used to allow options. A|B means either A or B is permitted. We can use brackets in these regular expressions.

Any of the following is allowed:

The dtd given is correct - but it cannot be used because it is not deterministic.

Non-Deterministic Expression

A non-deterministic expression can be parsed in more than one way. The validator uses a “one symbol look ahead” - if there are two options that start with the same element the validator cannot cope.

country.dtd

<![CDATA[<!ELEMENT country (president|(king,queen?)|queen)>
<!ELEMENT president (#PCDATA)>
<!ELEMENT king      (#PCDATA)>
<!ELEMENT queen     (#PCDATA)>
]]>

Valid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <!-- United kingdom -->
  <queen>Elizabeth II</queen>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <!-- France -->
  <president>Chirac</president>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <!-- Norway -->
  <king>Harald</king>
  <queen>Sonja</queen>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <!-- Nowhere -->
  <king>Andrew</king>
</country>

Invalid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <king>Rudolf</king>
  <king>Michael</king>
  <king>Rudolf</king>
  <!--Should not allow three kings -->
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <president>Robert</president>
  <king>Rudolf</king>
  <!-- Cannot have president and king-->
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <!-- King comes before queen-->
  <queen>Henrietta</queen>
  <king>Rudolf</king>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <!-- Should have president or monarch-->
</country>

Repeats and brackets

We can use * to permit items to be repeated.

Rule to be enforced:

country.dtd

<![CDATA[<!ELEMENT country (state*|(county|borough)*)>
<!ELEMENT state   (#PCDATA)>
<!ELEMENT county  (#PCDATA)>
<!ELEMENT borough (#PCDATA)>
]]>

Valid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <!-- United kingdom -->
  <county>Kent</county>
  <county>Essex</county>
  <borough>Coventry</borough>
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <!-- USA -->
  <state>Ohio</state>
  <state>Oklahoma</state>
</country>

Invalid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <state>Airstrip 1</state>
  <county>Ambridge</county>
  <!-- May not mix state and county -->
</country>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE country SYSTEM "country.dtd">
<country>
  <state>England</state>
  <borough>Coventry</borough>
  <!-- May not mix state and borough-->
</country>

The plus operator

The + operator indicates that at least one item should be present. The expression A+ allows for one or more A nodes.

Rule to be enforced:

module.dtd

<![CDATA[<!ELEMENT module (name,teacher+,prerequisite*)>
<!ELEMENT name         (#PCDATA)>
<!ELEMENT teacher      (#PCDATA)>
<!ELEMENT prerequisite (#PCDATA)>
]]>

Valid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE module SYSTEM "module.dtd">
<module>
  <name>XML 3</name>
  <teacher>Andrew</teacher>
  <prerequisite>Programming 2</prerequisite>
</module>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE module SYSTEM "module.dtd">
<module>
  <name>Database 2</name>
  <teacher>Andrew</teacher>
  <teacher>Ken</teacher>
</module>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE module SYSTEM "module.dtd">
<module>
  <name>XML4</name>
  <teacher>Ken</teacher>
  <prerequisite>XML 3</prerequisite>
  <prerequisite>Web Scripts 3</prerequisite>
</module>

Invalid input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE module SYSTEM "module.dtd">
<module>
  <name>XML5</name>
  <!-- no teacher -->
  <prerequisite>XML 3</prerequisite>
  <prerequisite>Web Scripts 3</prerequisite>
</module>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE module SYSTEM "module.dtd">
<module>
  <name>XML3</name>
  <name>IML3</name>
  <!-- Two names -->
  <teacher>Ken</teacher>
  <prerequisite>XML 3</prerequisite>
  <prerequisite>Web Scripts 3</prerequisite>
</module>
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE module SYSTEM "module.dtd">
<module>
  <name>XML3</name>
  <teacher>Andrew</teacher>
  <room>A17</room>
  <!-- no such node allowed -->
</module>