Document Type Definitions      Peter Komisar v.4.0

references: 'Mastering XML', Navarro, White & Burman,. 'XML in a Nutshell', Harold &Means
'XML and Web Services Unleashed', Ron Schmelzer et al. TH. Estier credits M. Marcotty & H. Ledgard,

The World of Programming Languages, Springer-Verlag, Berlin 1986 for the BNF table used below.


XML Content Models    



XML content can be anything that doesn't break the rules of XML. Wherever
necessary, special characters have to be escaped appropriately. The combination
of elements, attributes and text collectively represent the message that an XML
document contains. The policies deciding how this content can be extended is the
subject of content models. The model includes a reflection of the intent of the
document creator. The user has ultimate control and can always override the
intent of the document creator with an internally defined data definition document.


Open Content Model

An xml page that is well formed but is not constrained by a document type
declaration of any kind is adheres to what is referred to as an 'open content
model'. An open model allows a page to be extended in any way.
 

Closed Content Models

Closed models restrict elements and attributes to those specified in the DTD or
schema being used. The XML document creator maintains strict control over
name, number and order of elements. Closed models are useful for strict data
exchanges where there has to be a guarantee of data compliance for the system
to work properly. A shipping slip for example would be a candidate for a closed
content model.

// A page that is governed by a DTD is also open to the extent that any internally
// defined DTD may add definitions to the eternally defined DTDs. This is the only
// way a DTD governed document can be extended.

XML Schema - XML Schema allow alternate schemes to be used to dictate
how a model may be extended. Schema are still limited to some basic rules,
for instance, content cannot be removed that 'damages' the existing model. All
required elements must be present though additional elements can be added.

In a more specific sense the idea of content models applies to what each XML
elements contains.

XML Content Models & Whitespace

It is sometimes not clear whether white space should be treated as significant or not.
How XML processors process white space depends on the content model. In either
an open or closed model, white space is not treated as significant. In a hybrid model,
white space is treated as significant because in this case the parser is not sure. Besides
the space character, whitespace is also created by characters like tabs, linefeeds and
carriage returns.

As a rule, the XML parser still passes all white space along to the application with
white space text intact. In the case of a browser, XML white space is not displayed.
(XSL has a special attribute called  xml:space which can be used to 'preserve' white
space explicitly. )

// white space is not treated as significant in the common open or closed models
// as a rule the XML parser will pass all white space intact to an application.



Extended Backus-Naur Form                // mainly for reference



Aside from knowing generally what Backus Naur Form is, this section is provided
to show the origins of much of the syntax that is used in XML.

Backus Naur Form

XML makes use of a syntax known as Extended Backus-Naur Form or more
cryptically
as EBNF.  The form is named after it's inventors. John Backus invented
the form and Peter Naur improved on it. The author's invented the notation circa
1958 and used it to describe the
programming language ALGOL60. Extended
BNF or EBNF attempted to improves the readability and expressiveness of  BNF
through the addition of extensions. ISO, the international standards body has a
draft for an EBNF standard. The following is an interesting paper by R.S Scowen
that discusses how EBNF could be used to improve the specification of various
vintage languages.

http://www.cl.cam.ac.uk/~mgk25/iso-14977-paper.pdf 

EBNF has since served as the chief model for describing new programming languages,
with
almost every author of a new programming languages using it to specify the syntax
rules
of his or her new language. You will recognize EBNF in XML, in the language
proper
and in associated technologies like XPath. The influence of EBNF is also evident
in various
scripting languages where regular expressions are used. This includes Perl,
JavaScript and the new java.util.regex package in Java j2sdkse 1.4. (Check the syntax

described in the class 'Pattern' development kit documentation.)

Following is a brief description of the meta-symbols that were defined in BNF and
later in EBNF. ( This is not a comprehensive list but only the introduction so we
know where the EBNF used in XML comes from) .
 

The meta-symbols of BNF are found in the following table.
// TH. Estier credits M. Marcotty & H. Ledgard, The World of Programming Languages,
// Springer-Verlag, Berlin 1986. for the following definitions
 
 

 Symbol

 Meaning

 ::=

  "is defined as"

 | 

  "or"

 < >

 angle brackets used to surround category names. The angle brackets
 distinguish syntax rules names (also called non-terminal symbols) from
 terminal symbols which are written exactly as they are to be represented. 

 

 Common Extensions (Extended BNF inclusions. )

  [ ]

 optional symbol 

 { } 

 repetitive symbol

  ' ' 

 single quotes to enclose single character terminals


 

BNF Production Rules

BNF is used to define the set of all possible strings of symbols that constitute legal
programs (i.e. strings) in a language. The 'production rules' in the grammar is created
using a parallel BNF rule. The production rules use what are called terminal and non-
terminal symbols. The following table provides some detail on how the notation is used.
 

Definitions by D. Biggar http://www3.sympatico.ca/dbiggar/BNF.home.html // for reference
 

 terminal 

 Terminal symbols (characters or character sequences) are bracketed
 by the meta-symbol "'". For example: the symbol or character a
 example 'a' 

 non-terminal 

 Non-terminal symbols are bracketed by the meta-symbols "<"
 and ">". example non-terminal symbol set, <set> 

 production
 rule

 Each production rule has a left hand side (LHS) and a right hand
 side (RHS) separated by the meta-symbol "::=" (read as
 "consists of" or "defined as"). The LHS is defined by the RHS.

 The LHS is a non-terminal symbol. The RHS is some sequence
 of terminal and non-terminal symbols that define the rule. For
 example: set is defined as a subset and (followed by) another subset.

example     <set> ::= <subset> <subset>

 repetition 

 A symbol or symbols enclosed in curly brackets ( { and } ) denotes
 possible repetition of the enclosed symbols zero or more times. For
 example: set is defines as 0 or more subsets.

 example <set> ::= { <subset> }

 alternate 

 The meta-symbol "|" (read as "or") is used to define alternate RHS
definitions. For example: set is defined as a subset or a set and a
 subset example      <set> ::= <subset> | <set> <subset>


 

Some EBNF Examples  // from 'Mastering XML' by A. Navarro, C. White & L. Burman

Every BNF grammar rule has the following form.
 

Basic BNF Rule Form

symbol ::=expression    //  where ::= represents the phrase "is defined as"
 

Example of Lowercase Vowels    // from 'Mastering XML'

vowels ::= [ aeiou ]     // the symbol vowels represent a, e, i o and  u.
 

XML definition of  White Space

S ::= (#x20 | #x9 | # xD | # xA )+  // hex values for space, tab, newline or linefeed

// the + sign stands for "one or more"

The following two examples are included for reference to show the origins of
symbols that are often used in XML as well as other pattern matching languages.

Some EBNF extension definitions
// from  http://www.augustana.ab.ca/~mohrj/courses/2000.fall/csc370/lecture_notes/ebnf.html

'The Kleene Cross' -- a sequence of one or more elements of the class marked.
      <unsigned integer> ::= <digit>+

//  the plus symbol is introduced to represent 'one or more'


'The Kleene Star'
-- a sequence of zero or more elements of the class marked.
<identifier> ::= <letter><alphanumeric>*

 
// 'Mr. Kleene' also introduces the asterisk to represent 'zero or more'

Although not perhaps the best example, the following link shows James Gosling's
summary of the syntactical symbols used in Java and makes a passing reference
to BNF symbols he used to define this set or family of symbols.

 http://java.sun.com/docs/books/jls/second_edition/html/syntax.doc.html#44467

A we switch to a study of DTDs we will have some background as to where
some of the symbols originated.


Overview of DTDs



XML supplies two techniques for creating templates that constrain what
can go into an XML document. The older legacy technique involved the
creation of DTDs or Document Type Definitions. The new way is to use
the XML Schema Language. We need to know both. In the first case there
is a vast amount of DTD legacy already in use and DTDs are not being
deprecated so they remain available for future use. The XML Schema
language is both more capable and more complex to than DTDs. XML
Schema has already been widely adopted and is serving an important
role in the emerging web services architecture. We begin by looking at
the legacy system, Document Type Definitions.

How Document Type Definition Are Used 

We have seen that whole languages are easily created using XML. SVG
(an abbreviation for  Scalar Vector Graphics) is one of many examples.
The interpreter that is written to process an SVG type, XML document
will 'know' how to deal with each of the defined tags used in the language.
We can also bet that if a document is not written correctly with respect
to the  SVG standard, an SVG interpreter will not give us the results
we are looking for. This is where a document type definition can be used
to ensure an SVG document is valid.

DTDs are Used to Specify XML Languages

A DTD will specify the exact format that each markup tag of the language
will take, and what kind of content the tag will have. The DTD also controls
the order and number of occurrences of elements in a document instance.

If an XML instance document is created and it conforms fully to the set of rules
described in the DTD, as tested by a validating parser, then the SVG application
'guarantees' it will be able to transform this xml document into a graphical rendering.

Validation is an optional process. For instance, most browser will render a well
formed page
even if the document is not valid with reference to it's DTD. The
full utility of a DTD is applied when a validating parser is used to ensure that an
XML document conforms to a specification. This is handy as a document can be
checked for validity before it is loaded by an application, avoiding processing
corrupt or invalid
data.

// Most browser render regardless of validity. Practically speaking, we need to adopt
// other applications to test our XML documents for validity.

Commonly referenced DTD that are used by several organizations are often
published on the web where changes and modifications can be centrally
managed.

DTDs are also popular for enforcing correctness in configuration files. The J2EE
platform, for example uses DTDs to enforce correctness in the creation of web
application configuration files. An additional example, DTDs are used to dictate
the content and structure of xml configuration files that specify custom tags, created
in conjunction with the Java Server Pages API.

// more recently XML schema has been adopted to do these tasks


Internal and External DTDs



"Hello DTD" In An Internally Defined DTD 

Before surveying the individual aspects of DTD we can inaugurate our entry into
this domain with a look at a simple Hello World, Document Type Definition. The
document starts with the standard xml declaration. This is an internally defined
form we were introduced to when we looked briefly at internal ENTITY declarations,
The form is characterized by the use of square brackets, [ ] , inside the document
type declaration.

// a prolog by definition is an introduction or anticipatory event

The DOCTYPE element has a special place in the document, following the xml
declaration and preceding the first element of the document. This area is called
the 'prolog'. Notice the first element
identifier is the name supplied in the DOCTYPE
tag.

This DTD, which is of the internally defined variety, determines that a compliant
document will have a 'salute' element that must contain #PCDATA. 'PCData' is an
abbreviation for 'parsable character data'. There is a minor trap here for C, C++
and Java programmers. You may automatically wish to create something that looks
like a function, i.e. salute(#PCDATA) which is not acceptable. You need the space
between the element's identifier and it's type.

Example <!ELEMENT salute  (#PCDATA)>  
 

'Hello DTD' in an Internally Defined DTD   // the DTD is part of the xml document it governs

<?xml version="1.0"?>
<!DOCTYPE salute [  <!--  prolog - the area between the xml declaration & the root element -->
<!ELEMENT salute  (#PCDATA)>  
]>
<salute>
Hello DTD!
</salute>

This internal form provides a convenient form to develop DTDs as the data type definitions
can be tested in the body of the xml document. Later, after everything is tested the
type definitions can be moved to an externally defined DTD.

External DTDs

Consider if we moved the single element definition salute, (and nothing else), into it's
own file called
'salute.dtd'. This definition would be then be referenceable externally
via the file name as is shown in the following example. 

 

The Same File Referencing An Externally Defined DTD

<?xml version="1.0"?>
<!DOCTYPE salute  SYSTEM  "salute.dtd">
<salute>
Hello DTD!
</salute>

In both cases we have introduced a constraint on the page from a validation
point of view. If we were to add a tag into this page, something  like 
<Wave>Waving </Wave>, 
a validating parser would declare the document
was invalid even though it was well formed.

// note you can add an element to the body of this xml document and
// the browsers don't complain. Browsers at this time don't validate

 
Mixing Internal and External DTD Forms

A single XML document can use both internal and external DTD forms,
referencing an external DTD file while defining an additional internal DTD
subset. Together, the internal and external DTD's form the complete DTD.

In this  situation the two DTDs must be compatible. They must work together.
Harold & Means state that, as a rule, neither DTDs may override the declarations
the other makes. Entity declarations (which we encounter in the next section)
though may be overridden.

The following example from 'XML in a Nutshell' shows how both an internal
and external DTD can be referenced from the same document. You will recognize
the internal form which we used in defining character data sections. The internal
form is everything between the square braces, while the external form is referenced
by the identifier, 'name.dtd'.


Example     
// from 'XML in a Nutshell'

<!DOCTYPE person SYSTEM "name.dtd" [
    <!ELEMENT profession  (#PCDATA)>
    <!ELEMENT person  (name, profession*)>
   ]>

// the person element depends on name.dtd for the definition of the name element




The Document Type Declaration



We have used the DOCTYPE definition several times now and should stop
to look at it in detail. The Document Type Declaration is what is used inside
an XML document to reference a DTD or Document Type Definition.
 

The Document Type Declaration   // the DOCTYPE element 

The document type declaration is used to specify the document type definition.
This declaration is associated with the DOCTYPE element. Stated more simply,
the DOCTYPE element declares the DTD, whether internal external or both.

SGML requires a DOCTYPE declaration but XML does not. This implies that
XML documents that are designated,  'well-formed' are not required to contain
a document type declaration.

However, if the Document Type Declaration is included it should be the first
thing in a document after the XML declaration and not preceded by comments,
whitespaces or processing instructions. All XML documents that use DTDs to
validate will have a document type declaration.

The DOCTYPE takes the following form.

Form of the DOCTYPE Element

<!DOCTYPE  name  SYSTEM | PUBLIC   DTD_URL  | (  PUBLIC_ID opt. DTD_URL)   [Internal DTDs] >

Where - <! -the exclamation mark marks the beginning of the declaration.
           - DOCTYPE  - keyword for element which abbreviates Document Type Declaration
           - name - the name of the root tag of the XML document
           - SYSTEM - used in conjunction with a url describing an externally defined  DTD
           - PUBLIC - used in conjunction with a public id which may be backed up by a url
          - [  ] - square braces house optionally an internally defined DTD subset.
 

name - This identity is the same as the root or first element of the xml document.
Name in the doctype specifically refers to the identifier that is enclosed in the outermost
tag of
the XML document. For example Atlas is the doctype name in the following
example.

Example    <Atlas>
                <Europe></Europe>
                <North America></North America>
                <!-- etc. -->
                </Atlas>


The simplest DOCTYPE tag for this document would be the following.

Example   <!DOCTYPE Atlas >


The second simplest scenario is the use of an internal DTD in which case we
use the square brackets. We can condense the white space out of our earlier
example to illustrate this variation.

Example <!DOCTYPE salute [ <!ELEMENT salute  (#PCDATA)> ]>

If the DOCTYPE is also specifying an external document definition type other
keywords of the tag are used.

Optionally, a document type declaration will have a SYSTEM or PUBLIC keyword
if a DTD is available externally. The commoner one to use is SYSTEM where the
external DTD is available somewhere by url either on the system or over a network.
The PUBLIC keyword is used for 'well known' DTDs which are referenced by their
'public identifiers'. The public identifiers are published by standards bodies for DTDs
that may be shared by many organizations. These keyword definitions are restated
below.

SYSTEM - SYSTEM is a keyword that indicates that the immediately following
DTD file is available somewhere on the local system or Internet. It is dictated in
uri form

PUBLIC - The PUBLIC keyword is used to supply a public identifier that is not
a url. It follows a different encoding syntax. Just in case the public id is not recognized
by the system, an optional url is supplied as a backup that points to a DTD supplied
on the Internet.

The following is the DOCTYPE declaration for XHTML and is commonly used as
an example.
 

The XHTML DOCTYPE Declaration

<! DOCTYPE html PUBLIC "-//W3C/DTD XHTML 1.0 Transitional//EN"
  "DTD/xhtml1-transitional.dtd">
 

Using PUBLIC and a Public ID

In the above declaration the "-//W3C . . . EN"  string is called a Public ID.
More precisely it is an Public ID in Literal Form. In the W3 XML specification
it is referred to as an 'External ID'. This ID is there to represent a format that is
understood to be well known and public. In the case the Public ID is not known
to the application that is interpreting this information, the program can make a
reference to an optionally provided URL that describes a DTD the appropriate
formatting information. The Public ID normally provided in a literal form that is
called in the W3 specification, a Public ID Literal form. Although there is nothing
disallowing the form from being used in a different pattern than that which is
described below, the general use of the Public ID is as follows.

General Use Public ID Form

"Public ID Character // DTD_proprietor // DTD_description // ISO_Language_Identifier"

In the example above, inside the first string the name starts with a hyphen, '- '.
This indicates that this is an identifier used by an organization other than ISO.
If this were an ISO identifier,  the plus sign or '+' would have been used. In
this format, the //W3C indicates that the W3C own the DTD. The next part
of the literal is a description of what the DTD is about. The //EN specifies the
language used is English.

Square braces, [] may be used to include internally defined DTDs. If both external
and internal DTDs are present the internal DTD found inside the square brackets
override the rules dictated in the external DTD. (We see look at this syntax in more
detail later in the course. For now, here is an example from 'XML in a Nutshell' by
E. Harold and S. Means.
 

Earlier Example from XML in a Nutshell

         <!DOCTYPE person SYSTEM name.dtd [
               <!ELEMENT profession  (#PCDATA)>
               <!ELEMENT person  (name, profession)>
                ]>
 

DOCTYPE Components

Following is a summary of the DOCTYPE elements in table form. Notice we are
avoiding the term attributes as these identifiers are not being assigned values.
 

 Table describing parts of a DOCTYPE declaration
 

 Symbol 

 Description

 < > 

 Start & Finish Tag

 ! DOCTYPE 

 Doctype declaration

 "name"

 The name of the outer tag of the document

 SYSTEM 

 Specifies the following DTD is found on the system. 

 PUBLIC 

 Specifies the following DTD is available publicly by url. 

 [ ]

 Specifies an optional two part form for specifying a DTD.


White Space in DTDs

White space is ignored in DTDs except before and after the exclamation marks
where no whitespace is allowed.
 


DTD Element Definitions



Element Names

Elements follow the same XML naming conventions described earlier where
names may start with a letter or underscore. Subsequent characters may be
letters, numbers, underscores, dashes, or periods. A single colon may be
included in conjunction with a namespace prefix. Also a name may not start
with the character string, xml.
 

Element Declarations

The following example shows the form of the element declaration. Notice the
ELEMENT keyword is case sensitive and has to be in it's all uppercase form.

Element Form    <!ELEMENT elementName ( rule )  >    // ELEMENT all uppercase

Where  - ELEMENT is the generic tag name for element definitions.
           - elementName is the identifier that is selected

           - rule describes the type of data that can be contained inside the element.

'XML in a Nutshell' offers what is perhaps a more intuitive description of the
form of an element, describing the contents of the element in terms of a content
model, ( what we discussed earlier. ) This is a good approach because what
characterizes an element is totally determined by what it contains.


Element Form from 'XML in a Nutshell'
 

<!ELEMENT element_Name  ( content_model ) >

The simplest case is an element that contains parsable character data. Following
is an example showing the PCData form. Parsable character data is represented
by the string #PCData and represents data that the XML parser can parse or
process.


Example
  <!ELEMENT pliers (#PCDATA) >

The element that contains a single child element is a simple form that is not seen
often. Following is an example.

Example  <!ELEMENT  case ( guitar )>

Typically we see elements that have sequences of children. Following is a sample
DTD that shows a series of element declarations. The sequence is represented by
a comma-separated list. If this next example is saved as a file with a .dtd ending it
becomes a viable external DTD. 


Example

<!ELEMENT  toolcase ( wrench, screwdriver, pliers, drillcase )>
<!ELEMENT  wrench (#PCDATA)              >
<!ELEMENT  screwdriver (#PCDATA)        >
<!ELEMENT  pliers (#PCDATA )                >
<!ELEMENT  drillcase (drills, sharpener)     >
<!ELEMENT  drills (#PCDATA)                  >
<!ELEMENT  sharpener (#PCDATA)           >

A Note on Document Order

Notice that there is an ordering constraint put the document that is being governed
by this DTD.  The significant ordering is controlled by the appearance of sub-elements
that appear as parameters to the round brace area of each element definition. The
order that the elements in the DTD appear is itself is not significant. It does make
organizational sense though to keep things consistent.

In the above example, the first element which is the document or root element
definition, is called 'toolcase'. It has four nested elements that are declared in a
comma separated list inside round braces. Each of the sub-elements are then
individually defined as containing parsable character data except for the 'drillcase'
element. This element further nests two more elements that are of the PCDATA
type, 'drills' and 'sharpener'. We can depict the structural hierarchy that this DTD
describes in the following depiction.
 

Diagram of the hierarchal structure that the Toolcase DTD dictates.

toolcase
   |__wrench
   |__screwdriver
   |__pliers
   |__drillcase
         |__drills
         |__sharpener
 

This DTD can be applied to a document by using the following DOCTYPE
declaration with an appropriately structured XML document.
 

Example

<?xml version="1.0"?>
<!DOCTYPE toolcase SYSTEM  "Toolcase.dtd">
<toolcase>
<wrench>Box wrench</wrench>
<screwdriver>Robertson screwdriver</screwdriver>
<pliers>needlenose pliers</pliers>
<drillcase>
<drills>Set of 10 metric</drills>
<sharpener> Tungsten hand sharpener</sharpener>
</drillcase>
</toolcase>

 


Element Content & Structure



DTD element use a set of keywords, symbols and general forms to rule
or to further refine what their content will be and what structure content
will take. These are sometimes called rules, which they are, but the term
'forms' is also an easy fit.

In terms of content, DTD elements can specify they contain parsable character
data, other element both or neither. In terms of structure, symbols can be used
to specify whether an elements sub-parts occur zero or one, zero or many or
one or many times.

Points made using our toolcase example are reiterated in the context of the the
different forms that may govern an elements content.

// Note: Topics have been reordered to follow the order of presentation found
//  in 'XML in a Nutshell'.


The #PCData Form

The #PCData keyword is used to specify that an element will contain regular
character data that can be parsed by the XML parser. Following is the form an
element takes when it is declared as containing parsable character data.
 

Form of an Element Declared With #PCData Rule

<!ELEMENT elementName  ( #PCData )  >

In an xml document this element would take the form of the following example.
 

Example   <elementName> Parsable Character Data goes here </elementName>
 

The Element Only Form

The 'Element Only' Rule restricts the content of the element to other child elements.
This specification is determined simply by how it is declared. For instance, the
following example shows an element called 'Super' will have a single element
called 'Sub' as it's content.

Example    <!ELEMENT Super  ( Sub )>

An element may have more than one child element. Compound declarations can be
specified using a comma-separated list of child elements.

Example

<!ELEMENT elementName  ( elementOne, elementTwo, elementThree, elementFour )>


Both the '#PCData' and the 'Element Only' form we have seen in our example
above.


Using OR Groupings

Child elements may declared in optional, OR groupings. The next example shows the
'OR' ( aka pipe ) symbol being used to declare that, the child that is selected, can be
one element or the other, (male or female).
 

Example    <!ELEMENT gender  ( male  |  female ) >
 

Parenthesis can be used to nest comma separated sequences or choice ( | ) groupings.
 

Example   <!ELEMENT elementName ( ( One | Two ) , ( Three | Four ) )>
 

The Mixed Form

The mixed content form is new. This form allows an element to be declared that
may contain
mixed content, either parsable PCData or child elements, or both.
The choices of content that may appear in an associated xml document are listed,
inside the round braces separated
by the OR or pipe symbol. The asterisk, along
with the pipe symbol is borrowed from EBNF. It is applied to the total contents
of the round brackets
. It signals that any of these elements may appear zero or
more times.

 

Example    <!ELEMENT nameMix ( #PCData  |  elementOne |  elementTwo )*>


The key feature to this form besides the asterisk that indicates zero or more, is
that
the #PCDATA declaration must appear first in the listing. Any number of
child
elements may follow.


A Complete Example Showing the Mixed Form

First we show a simple DTD that defines a root element that can contain text
with term elements mixed in.

A DTD Defining a Mixed Type

<!ELEMENT definitions (#PCDATA | term )* >
<!ELEMENT term  (#PCDATA) >

An XML Instance Governed By this DTD

<?xml version="1.0"?>
<!DOCTYPE definitions SYSTEM  "definitions.dtd">

<definitions>

The description is full of terms that require redefining, since we longer
have Earth as our current context. For instance,<term>air</term> no
longer means the same thing as it did on earth.

The term <term>insect</term>, as we have found, certainly needs some
redefinition. 

</definitions>

 
A style sheet can easily be referenced from this xml document to supply
different markup for the term elements.

The EMPTY Form

There are occasions when an element might need to be specified that should
explicitly have no content.  The EMPTY form is used to accommodate this pattern.
The content EMPTY rule states the element will be empty of all content. This
may not sound that useful at first but the real payload for this form comes with
information that may be associated with an element's attribute(s).

A quick way to illustrate this is to bring attention to the IMG tag in HTML. This
tag has no stated content however the IMG SRC attribute is used to specify an
image. The EMPTY form may also be used to create tags that can be used to
store a directive to format the page a certain way or other sorts of metadata that
would be useful for describing content or perhaps doing diagnostics.
 

Form of an Element Declared With EMPTY

<!ELEMENT elementName EMPTY  >


The ANY Form

The opposite of the EMPTY form is the ANY form. Sometimes a DTD will want
to include a loose definition that allows some latitude to the document writer to
supply information. The ANY keyword allows any legal sort of content to be
contained in an element. The ANY keyword signals the parser to consider valid
the element's content whether that content be empty, parsable character data or
other elements. The ANY rule creates a content rule that makes the validation
process relatively meaningless for the element in question.

Form of an Element Declared With ANY

<!ELEMENT elementName ANY >


There is one restraint on using this form. Any elements that are introduced
must themselves be declared in the DTD. The following example shows
this. We use an internally defined DTD here.


Example

<?xml version="1.0"?>
<!DOCTYPE whatever [<!ELEMENT whatever ANY >
                    <!ELEMENT like (#PCDATA) > 
                   ]>

<whatever>
<!-- validates empty, in mixed form, with just PCData or with just subelement  -->
This is mixing it <like> totally </like>
</whatever>




Adding Element Symbols



We previewed how the asterisk and pipe symbols can be used to control
an element's content at the granular level. We now look at the complete
collection of symbols inherited from EBNF can be applied at the most
granular
level that DTDs afford us. Following is a summary of the different
element symbols.

 

No Symbol - When no symbols are applied to data items, this signifies that
the data will appear once. This is the commonest form of element declaration.
The following example can be thought of as a singular
form of a comma-
separated plural type.

Example   <!ELEMENT RegularForm ( one_value ) >  // one argument


The next three symbols control other variations of how many times an
element may appear. Note that they may be applied to individual elements
or groups of elements enclosed in brackets.

 
Question Mark ? - In order to specify that an element can appear optionally
in a document, the question mark symbol is used. It enforces a 'zero or one' rule.
In the
following example if there was no argument supplied as an option a standard
package could be presumed. Otherwise the added package specified could
be included.

Example   <!ELEMENT OptionsPackage ( name, description? ) >  // ? --> zero or 1 parameter

 
Complete Example Demonstrating the Declaration of an Optional Element

<?xml version = "1.0" standalone="yes"?>
<!DOCTYPE options  [

<!ELEMENT options (name, description?) > 
<!ELEMENT name (#PCDATA) >
<!ELEMENT description (#PCDATA) >
]>
<options>  
<name>chrome trim</name>
<!-- optional user is free to leave out description and the doc is still valid -->
<description>includes chrome hub caps</description>
</options>


Asterisk * _ The asterisk applied to an element parameter implies this type
will
appear zero or more times. That is 0,1,2,3, . . .


Example 1
    <!
ELEMENT Hurricanes ( name* )  >
 

Example 2

<!ELEMENT nameMix ( #PCData  |  elementOne |  elementTwo )* >
 
// another example of a mixed form element

The Plus sign + - Data must appear one or more times. The following example
will have at least one example and any number more data items.

// memo -->there is always one data item specified.  This is the Kleene cross

Example   <!ELEMENT UTO_Sitings ( report+ ) >

 // 'Unidentified Tunneling Organism' Sitings - see 'Tremors', the Movie


The next set of symbols control groupings of elements.

Comma , - The comma separates compound parameters

Example    <!ELEMENT Farm (cows, chickens, horses )  >
 

Parentheses ( ) - The parentheses contain the parameters or rules supplied to an
element. They also can be nested to provide sub-sequences.

Example <!ELEMENT Structure ( single1,  single2,  double( left | right ),  single3 )  >

// the nested form would be good for things like (province | state) or (ZIP | Postal_Code)
 

OR Symbol  | - Also called a pipe, the OR symbol is used to separate a set
of options. We see it used in the example above and again below.

Example <!ELEMENT drink ( tea | coffee | milk | pop )  >
 

 

Applying Element Symbols in DTDs



Following is an XML data definition that might be used to describe a membership
form for a web site. Each  member is described as a separate member element
of the larger Membership context. The data structure this DTD defines asks for
the members initials lastname, type of membership,free or paid, and the members
e-mail, phone-number and address.

 
Example of a Simple XML DTD For a Web Membership

<!ELEMENT  Membership  ( member ) >     // needs to allow more than one member
<!ELEMENT  member  ( initials, lastName, memberType, e-mail, phNumber, address ) >
<!ELEMENT  initials  ( #PCData ) >          // may have zero or more
<!ELEMENT  lastName  (#PCData )  ) >      // may have more than one
<!ELEMENT  memberType  ( free, paid ) >  // may be free or paid but not both
<!ELEMENT  free  ( #PCData  ) >
<!ELEMENT  paid  ( #PCData  ) >
<!ELEMENT  e-mail  ( #PCData ) >                  // may be required
<!ELEMENT  phone-number  ( #PCData ) >    // may be optional
<!ELEMENT  address  ( #PCData ) >               // also may be optional
 

While the definition as it stands is good it is in fact very inflexible. The member
may wish to specify zero, one or more initials. A person may have more than
one last name. Membership may be free or paid. The e-mail can be
required
but the user may wish to association more then one e-mail, so this
becomes a
'one or more' situation. The member may not wish to provide a
phone number
and let us assume the same can be said about providing an
address. These two
fields need to be defined as optional, in other words
specified in a 'zero or one'
relationship. EBNF derived element symbols can be added
to the above form
to add the flexibility that is needed in the document instance.

One major flaw of the above DTD specimen is the fact that the 'No Symbol'
rule is restricting the club to a single member! By suffixing the member element
with a plus sign to indicate one or
more members this problem is resolved. We
can then make phone number and address
optional by suffixing these fields with
question marks. Finally, we can then use the or
symbol to make the memberType
free or paid.

 

Example of an XML DTD For a Web Membership That is Made Flexible Using Symbols

<!ELEMENT  Membership  (member+)>
<!ELEMENT  member  (initials*, lastName+, memberType, email+, phNumber?, address? ) >
<!ELEMENT  initials  (#PCDATA) >
<!ELEMENT  lastName  (#PCDATA) >
<!ELEMENT  memberType  (free | paid ) >
<!ELEMENT  free  (#PCDATA) >
<!ELEMENT  paid  (#PCDATA) >
<!ELEMENT  email  (#PCDATA) >
<!ELEMENT  phNumber  (#PCDATA) >
<!ELEMENT  address  (#PCDATA) >

// save to 5_DTDsymbols.dtd
 

Sample of an XML Document that uses the DTD for Validation

Following is an example of an XML document that takes advantage of the
membership document type definition specified above.
 

<?xml version="1.0"?>
<!DOCTYPE Membership SYSTEM "5_DTDsymbols.dtd">
<Membership>
<member>
<initials > P</initials>
<lastName>Taylor </lastName>
<lastName>Rockwell</lastName>
<memberType>
<free>90 day</free>
<!-- <paid>3 year subscription</paid> -->
</memberType>
<email>bill@bob.net</email>
<phNumber>519 929 2257</phNumber>
<address>RR#3 Ono Township</address>
</member>
</Membership>
 

 


Practical  DTD Validation


 

XML Validation


  Java Command Line Tools in J2SDK1.4.x

 The latest Java Development Kit, version 1.4.x has added many
 packages to support web services. In addition on the sun site they
 have an in depth web services tutorial that is really an online book.
 In it they develop a parsing program called Echo which by the 10th
 rendition includes validation in it's parsing process. You can get a
 copy of it at the Sun site or for convenience I have copied it to the
 following page, Echo10.html. Cut and paste the program into an text
 editor and compile the program using j2sdk1.4.x. Running it against
 your xml documents will check for well formedness and validity.
  To validate your xml documents, follow the following steps. 


 To compile save the file as Echo10.java and run at the command line:

  Example      javac    Echo10.java

 To run type the following at the command line.

  Example      java    Echo10   YourFile.xml

 
 Appache Command Line Tools

 
// note I think Apache has changed some of the class paths of their latest
//  packages so you will need to check their latest documentation to get
//  their parsers to run.

 If you are not running 1.4.x you can go to the www.apache.org site and 
 download version 2 of the xerces package. It expands after unzipping or
 'tar gizzing' to reveal a couple of .jar files. Copy these into your  jre/lib/ext
 directory.

 Example   JAVA_HOME/jre/lib/ext // Linux / Unix 
                 JAVA_HOME\jre\lib\ext  // Windows 

 JAVA_HOME is popularly defined in the operating system configuration
 files, autoexec.bat (on Windows)  or (bashrc,  .bashrc or profile on Linux/Unix)
 as the directory that the Java development kit is in.

Example.  JAVA_HOME=jdk1.3.1_03       // might want to export

 Check the latest documentation at the Apache site to use the Xerces parsers.

 Validating Editors 

 You can also use an XML editor. The fast and straight forward  ' XML writer'
 is available for a free 30 day trial period at the site,   http://xmlwriter.net/

 What seems to be a very famous XML Editor is available at the following site.
 It also comes with a 30 day trial.  http://www.xmlspy.com/

 There is also Xeena available at the IBM web site. You can get  it at this link.
 http://www.alphaworks.ibm.com/aw.nsf/download/xeena.  JEdit is a popular
open source editor that has a huge number of plugins available including ones
for XML. Feel free to take
some time to explore what different editors are
available that can be used to
ensure that your documents are valid.
.
 Online Validators

 Harold & Mean's 'XML in a Nutshell' point out there are online sites which 
 will validate your documents. The only catch is you need to load your page
 to an accessible server along with the associated DTDs and then load one 
 of the validating sites into your browser. At this site you load the page you 
 want validated and the site validates your document.  

'XML in a Nutshell' lists Brown  University's XML Validation form at
 www.stg.brown.edu/service/xmlvalid and Richard Tobin's XML checker 

 which can be found at  www.cogsci.ed.ac.uk/%7Erichard/xml-check.html




DTD I  Self Test                                       Self Test With Answers



 

1) True or False? An XML document may use an internal or external
     DTD but not both. True \ False

2) Which of the following may not appear in a Document Type Declaration
    Element?

a) PRIVATE
b) SYSTEM
c) PUBLIC
d) DOCTYPE

3) True or False? All whitespace in a DTD is ignored. True \ False
 

4) True or False? A DTD is an XML file stored with an xml extension.
    True \ False

5) True or False? The ANY rule allows any sort of data as long as it is
    not empty.  True \ False

6) True or False?  The Element Only Rule restricts the content of an element
    to child elements which may be comma-separated sequences with a number
    of elements declared in optional OR groupings. True \ False

7) True or False? No symbol signifies data will appear once.   True \ False
 


Exercise



1) Create a short XML document that declares a data type definition internally.
The document will have for it's first tag SIGNAL. Three internal tags will be
defined called GO, STOP and  CAUTION. Each of these will be defined to
take parsable character data. (This key to an internal DTD is the definitions
go inside the square braces of the DOCTYPE tag.)

2) Create a document type that might be used at a medical clinic. The first tag
may be called visitors and able to accommodate one or more patients. The DTD
should go on to specify name, address and phone number. The form should be
able to accommodate that the person may not have a phone or may have more
than one phone numbers. The form will specify an element for each of 'age'
and 'sex'. The DTD should allow that the client may omit entering their age.
Finally, the form should have a supplemental health_plan_ID element the
appearance of which is optional. The document should also contain an element
for payment that permits payment by cash, credit card or health plan.

3) After these elements have been listed and a DTD created, create a XML
document that is governed by the DTD. Use one of the suggested validating
techniques described above to confirm your entries have met the criteria
dictated by the DTD.

// Alternatively you can do a DTD scheme of your own making as long as
// you take care to make use
of all the control symbols used in the above example.