Document Type
Definitions
Peter Komisar v.4.0 ©
references: 'Mastering XML',
Navarro, White & Burman,. 'XML in a Nutshell', Harold &Means
'XML
and Web Services Unleashed', Ron Schmelzer et al. TH. Estier credits
M. Marcotty & H. Ledgard,
The
World of Programming Languages, Springer-Verlag, Berlin 1986 for the
BNF table used below.
XML Content Models
XML content can be anything that doesn't break the rules of
XML. Wherever
necessary, special characters have to be escaped
appropriately. The combination
of elements, attributes and text
collectively represent the message that an XML
document contains.
The policies deciding how this content can be extended is the
subject of content models. The model includes a reflection of the
intent of the
document creator. The user has ultimate control and
can always override the
intent of the document creator with an
internally defined data definition document.
Open Content Model
An xml page that is well formed but is not
constrained by a document type
declaration
of any kind is adheres to what is referred to as an 'open content
model'. An open model allows a page to be
extended in any way.
Closed Content Models
Closed models restrict elements and
attributes to those specified in the DTD or
schema
being used. The XML document creator maintains strict control over
name, number and order of elements. Closed
models are useful for strict data
exchanges
where there has to be a guarantee of data compliance for the system
to work properly. A shipping slip for
example would be a candidate for a closed
content
model.
// A page that is governed
by a DTD is also open to the extent that any internally
//
defined DTD may add definitions to the eternally defined DTDs. This
is the only
// way a DTD governed document can be extended.
XML Schema
- XML Schema allow alternate schemes to be used to dictate
how a model may be extended. Schema are
still limited to some basic rules,
for
instance, content cannot be removed that 'damages' the existing
model. All
required elements must be
present though additional elements can be added.
In a more specific
sense the idea of content models applies to
what each XML
elements contains.
XML Content Models & Whitespace
It is sometimes not clear whether white
space should be treated as significant or not.
How
XML processors process white space depends on the content model. In
either
an open or closed model,
white space is not treated as significant. In a hybrid model,
white space is treated as significant
because in this case the parser is not sure. Besides
the
space character, whitespace is also created by characters like tabs,
linefeeds and
carriage returns.
As a rule, the XML parser still passes all
white space along to the application with
white
space text intact. In the case of a browser, XML white space is not
displayed.
(XSL has a special
attribute called xml:space which can be used to 'preserve'
white
space explicitly. )
// white space is not
treated as significant in the common open or closed models
// as a
rule the XML parser will pass all white space intact to an
application.
Extended Backus-Naur Form // mainly for reference
Aside from knowing generally what Backus Naur Form is, this
section is provided
to show the origins of much of the syntax
that is used in XML.
Backus Naur Form
XML makes use of a syntax known as Extended
Backus-Naur Form or more
cryptically as
EBNF. The form is named after it's inventors. John Backus
invented
the form and Peter Naur improved on it. The author's
invented the notation circa
1958 and used it to describe the
programming language ALGOL60. Extended
BNF or EBNF attempted to improves the readability and
expressiveness of BNF
through the addition of extensions.
ISO, the international standards body has a
draft for an EBNF
standard. The following is an interesting paper by R.S Scowen
that
discusses how EBNF could be used to improve the specification of
various
vintage languages.
http://www.cl.cam.ac.uk/~mgk25/iso-14977-paper.pdf
EBNF has since served as the chief model for
describing new programming languages,
with almost
every author of a new programming languages using it to specify the
syntax
rules of his or her new
language. You will recognize EBNF in XML, in the language
proper
and in associated technologies like XPath. The
influence of EBNF is also evident
in various scripting
languages where regular expressions are used. This includes Perl,
JavaScript and the new java.util.regex
package in Java j2sdkse 1.4. (Check the syntax
described
in the class 'Pattern' development kit documentation.)
Following is a
brief description of the meta-symbols that were
defined in BNF and
later in EBNF. ( This is not a comprehensive
list but only the introduction so we
know where the EBNF used in
XML comes from) .
The
meta-symbols of BNF are found in the following table.
//
TH. Estier credits M. Marcotty & H. Ledgard, The World of
Programming Languages,
//
Springer-Verlag, Berlin 1986. for the following definitions
Symbol |
Meaning |
::= |
"is defined as" |
| |
"or" |
< > |
angle brackets used to surround category
names. The angle brackets |
|
Common Extensions (Extended BNF inclusions. ) |
[ ] |
optional symbol |
{ } |
repetitive symbol |
' ' |
single quotes to enclose single character terminals |
BNF Production Rules
BNF is used to
define the set of all possible strings of symbols
that constitute legal
programs (i.e. strings) in a language. The
'production rules' in the grammar is created
using a parallel BNF
rule. The production rules use what are called terminal and non-
terminal symbols. The following table provides some detail on how
the notation is used.
Definitions
by D. Biggar http://www3.sympatico.ca/dbiggar/BNF.home.html //
for reference
terminal |
Terminal symbols (characters or character
sequences) are bracketed |
non-terminal |
Non-terminal symbols are bracketed by the
meta-symbols "<" |
production |
Each production rule has a left hand side
(LHS) and a right hand |
repetition |
A symbol or symbols enclosed in curly
brackets ( { and } ) denotes |
alternate |
The meta-symbol "|" (read as "or") is
used to define alternate RHS |
Some EBNF Examples // from 'Mastering XML' by A. Navarro, C. White & L. Burman
Every BNF grammar rule has the following
form.
Basic BNF Rule Form
symbol
::=expression
// where ::= represents the phrase "is defined as"
Example of Lowercase Vowels // from 'Mastering XML'
vowels ::= [ aeiou
] //
the symbol vowels represent a, e, i o and u.
XML definition of White Space
S ::= (#x20 | #x9 | # xD | # xA )+ // hex values for space, tab, newline or linefeed
// the + sign stands for "one or more"
The following two
examples are included for reference to show the
origins of
symbols that are often used in XML as well as other
pattern matching languages.
Some
EBNF extension definitions
//
from
http://www.augustana.ab.ca/~mohrj/courses/2000.fall/csc370/lecture_notes/ebnf.html
'The Kleene Cross' -- a
sequence of one or more elements of the class marked.
<unsigned integer> ::= <digit>+
// the plus symbol is introduced to represent 'one or more'
'The Kleene Star' -- a
sequence of zero or more elements of the class marked.
<identifier>
::= <letter><alphanumeric>*
// 'Mr. Kleene'
also introduces the asterisk to represent 'zero or more'
Although not
perhaps the best example, the following link shows
James Gosling's
summary of the syntactical symbols used in Java
and makes a passing reference
to BNF symbols he used to define
this set or family of symbols.
http://java.sun.com/docs/books/jls/second_edition/html/syntax.doc.html#44467
A
we switch to a study of DTDs we will have some background as to
where
some of the symbols originated.
Overview of DTDs
XML supplies two techniques for creating templates that
constrain what
can go into an XML document. The older legacy
technique involved the
creation of DTDs or Document Type
Definitions. The new way is to use
the XML Schema Language. We
need to know both. In the first case there
is a vast amount of
DTD legacy already in use and DTDs are not being
deprecated so
they remain available for future use. The XML Schema
language is
both more capable and more complex to than DTDs. XML
Schema has already been widely adopted and is serving an important
role in the emerging web services architecture. We begin by looking at
the legacy system, Document Type Definitions.
How Document Type Definition Are Used
We have seen that
whole languages are easily created using XML.
SVG
(an abbreviation for Scalar Vector Graphics) is one of
many examples.
The interpreter that is written to process an SVG
type, XML document
will 'know' how to deal with each of the
defined tags used in the language.
We can also bet that if a document is not written correctly with
respect
to the SVG standard, an SVG interpreter will not give us the
results
we are looking for. This is where a document type definition can be used
to ensure an SVG document is valid.
DTDs are Used to Specify XML Languages
A DTD
will specify the exact format that each markup tag of the
language
will take, and what kind of content the tag will have.
The DTD also controls
the order and number of occurrences of
elements in a document instance.
If an XML instance
document is created and it conforms fully to
the set of rules
described in the DTD, as tested by a validating
parser, then the SVG application
'guarantees' it will be able to
transform this xml document into a graphical rendering.
Validation is an optional process. For
instance, most browser will render a well
formed page even
if the document is not valid with reference to it's DTD. The
full
utility of a DTD is applied when a validating parser is used to
ensure that an
XML document conforms to a specification. This is
handy as a document can be
checked for validity before it is
loaded by an application, avoiding processing
corrupt or invalid
data.
// Most browser render
regardless of validity. Practically speaking, we need to adopt
//
other applications to test our XML documents for validity.
Commonly referenced DTD that are used by
several organizations are often
published on the web where
changes and modifications can be centrally
managed.
DTDs are also popular for enforcing
correctness in configuration files. The J2EE
platform, for example
uses DTDs to enforce correctness in the creation of web
application
configuration files. An additional example, DTDs are used to dictate
the content and structure of xml configuration files that specify
custom tags, created
in conjunction with the Java Server Pages
API.
// more recently XML schema has been
adopted to do these tasks
Internal and External DTDs
"Hello DTD" In An
Internally Defined DTD
Before surveying the individual aspects of
DTD we can inaugurate our entry into
this
domain with a look at a simple Hello World, Document Type Definition.
The
document starts with the
standard xml declaration. This is an internally defined
form we were introduced to when we looked briefly
at internal ENTITY declarations,
The form is characterized by the use of square brackets, [ ] , inside
the document
type
declaration.
// a prolog by definition is an
introduction or anticipatory event
The DOCTYPE
element has a special place in the document, following the xml
declaration and preceding the first element
of the document. This area is called
the 'prolog'. Notice the
first element identifier is the name
supplied in the DOCTYPE
tag.
This DTD, which is of the internally
defined variety, determines that a compliant
document
will have a 'salute' element that must contain #PCDATA. 'PCData' is
an
abbreviation for 'parsable
character data'. There is a minor trap here for C, C++
and Java
programmers. You may automatically wish to create something that
looks
like a function, i.e. salute(#PCDATA) which is not
acceptable. You need the space
between the element's identifier
and it's type.
Example
<!ELEMENT
salute (#PCDATA)>
'Hello DTD' in an Internally Defined DTD // the DTD is part of the xml document it governs
<?xml version="1.0"?>
<!DOCTYPE salute [
<!-- prolog - the area
between the xml declaration & the root element -->
<!ELEMENT
salute (#PCDATA)>
]>
<salute>
Hello DTD!
</salute>
This internal form provides a convenient form to
develop DTDs as the data type definitions
can
be tested in the body of the xml document. Later, after everything is
tested the
type definitions can be
moved to an externally defined DTD.
External DTDs
Consider if we moved the single element
definition salute, (and nothing else), into it's
own file
called 'salute.dtd'. This definition
would be then be referenceable externally
via the file name as is
shown in the following example.
The Same File Referencing An Externally Defined DTD
<?xml version="1.0"?>
<!DOCTYPE salute
SYSTEM "salute.dtd">
<salute>
Hello DTD!
</salute>
In both cases we
have introduced a constraint on the page from a
validation
point of view. If we were to add a tag into this page,
something like
<Wave>Waving
</Wave>, a validating parser would declare the
document
was invalid even though it was well formed.
//
note you can add an element to the body of this xml document and
// the browsers don't complain.
Browsers at this time don't validate
Mixing Internal and External
DTD Forms
A single XML
document can use both internal and external DTD
forms,
referencing an external DTD file while defining an
additional internal DTD
subset. Together, the internal and
external DTD's form the complete DTD.
In this
situation the two DTDs must be compatible. They must
work together.
Harold & Means state that, as a rule, neither
DTDs may override the declarations
the other makes. Entity
declarations (which we encounter in the next section)
though may
be overridden.
The following
example from 'XML in a Nutshell' shows how both an
internal
and external DTD can be referenced from the same
document. You will recognize
the internal form which we used in
defining character data sections. The internal
form is everything
between the square braces, while the external form is referenced
by
the identifier, 'name.dtd'.
Example //
from 'XML in a Nutshell'
<!DOCTYPE
person SYSTEM "name.dtd" [
<!ELEMENT profession
(#PCDATA)>
<!ELEMENT
person (name, profession*)>
]>
// the person element depends on name.dtd for the definition of the name element
The Document Type Declaration
We have used the
DOCTYPE definition several times now and should
stop
to look at it in detail. The Document Type Declaration is
what is used inside
an XML document to reference a DTD or
Document Type Definition.
The Document Type Declaration // the DOCTYPE element
The document type
declaration is used to specify the document type
definition.
This declaration is associated with the DOCTYPE
element. Stated more simply,
the DOCTYPE element declares the
DTD, whether internal external or both.
SGML requires a
DOCTYPE declaration but XML does not. This implies that
XML
documents that are designated, 'well-formed' are not
required to contain
a document type declaration.
However, if the
Document Type Declaration is included it should be
the first
thing in a document after the XML declaration and not
preceded by comments,
whitespaces or processing instructions. All XML documents that use DTDs
to
validate will have a document type declaration.
The DOCTYPE
takes the following form.
Form of the DOCTYPE Element
<!DOCTYPE name SYSTEM | PUBLIC DTD_URL | ( PUBLIC_ID opt. DTD_URL) [Internal DTDs] >
Where
- <! -the exclamation mark marks the
beginning of the declaration.
- DOCTYPE - keyword for element which abbreviates Document Type
Declaration
- name - the name of the root tag of the XML document
- SYSTEM - used in conjunction with a url describing an externally
defined DTD
- PUBLIC - used in conjunction with a public id which may be backed
up by a url
- [ ] - square braces house optionally an internally defined
DTD subset.
name -
This identity is the same as the root or first element of the xml
document.
Name in the doctype
specifically refers to the identifier that is enclosed in the
outermost
tag of the XML document.
For example Atlas is the doctype name in the following
example.
Example
<Atlas>
<Europe></Europe>
<North America></North America>
<!-- etc. -->
</Atlas>
The simplest DOCTYPE tag for this document
would be the following.
Example <!DOCTYPE Atlas >
The second simplest scenario is the use
of an internal DTD in which case we
use the square brackets. We
can condense the white space out of our earlier
example to
illustrate this variation.
Example <!DOCTYPE
salute [ <!ELEMENT salute
(#PCDATA)> ]>
If the DOCTYPE is also specifying an
external document definition type other
keywords of the tag are
used.
Optionally, a
document type declaration will have a SYSTEM or
PUBLIC keyword
if a DTD is available externally. The commoner one
to use is SYSTEM where the
external DTD is available somewhere by
url either on the system or over a network.
The PUBLIC keyword is
used for 'well known' DTDs which are referenced by their
'public
identifiers'. The public identifiers are published by standards
bodies for DTDs
that may be shared by many organizations. These
keyword definitions are restated
below.
SYSTEM -
SYSTEM is a keyword that indicates that the immediately following
DTD file is available somewhere on the
local system or Internet. It is dictated in
uri
form
PUBLIC -
The PUBLIC keyword is used to supply a public identifier that is not
a url. It follows a different encoding
syntax. Just in case the public id is not recognized
by
the system, an optional url is supplied as a backup that points to a
DTD supplied
on the Internet.
The following is
the DOCTYPE declaration for XHTML and is commonly
used as
an example.
The XHTML DOCTYPE Declaration
<! DOCTYPE html
PUBLIC "-//W3C/DTD XHTML 1.0
Transitional//EN"
"DTD/xhtml1-transitional.dtd">
Using PUBLIC and a Public ID
In the above
declaration the "-//W3C . . . EN"
string is called a Public ID.
More precisely it is an Public ID
in Literal Form. In the W3 XML specification
it is referred to as
an 'External ID'. This ID is there to represent a format that is
understood to be well known and public. In the case the Public ID
is not known
to the application that is interpreting this
information, the program can make a
reference to an optionally
provided URL that describes a DTD the appropriate
formatting
information. The Public ID normally provided in a literal form that
is
called in the W3 specification, a Public ID Literal form.
Although there is nothing
disallowing the form from being used in
a different pattern than that which is
described below, the general
use of the Public ID is as follows.
General Use Public ID Form
"Public ID Character // DTD_proprietor // DTD_description // ISO_Language_Identifier"
In the example
above, inside the first string the name starts with
a hyphen, '- '.
This indicates that this is an identifier used by
an organization other than ISO.
If this were an ISO identifier,
the plus sign or '+' would have been used. In
this format, the
//W3C indicates that the W3C own the DTD. The next part
of the
literal is a description of what the DTD is about. The //EN specifies
the
language used is English.
Square braces, []
may be used to include internally defined DTDs.
If both external
and internal DTDs are present the internal DTD
found inside the square brackets
override the rules dictated in
the external DTD. (We see look at this syntax in more
detail
later in the course. For now, here is an example from 'XML in a
Nutshell' by
E. Harold and S. Means.
Earlier Example from XML in a Nutshell
<!DOCTYPE
person SYSTEM name.dtd [
<!ELEMENT profession (#PCDATA)>
<!ELEMENT person (name, profession)>
]>
DOCTYPE Components
Following is a summary of the DOCTYPE
elements in table form. Notice we are
avoiding
the term attributes as these identifiers are not being assigned
values.
Table
describing parts of a DOCTYPE declaration
Symbol |
Description |
< > |
Start & Finish Tag |
! DOCTYPE |
Doctype declaration |
"name" |
The name of the outer tag of the document |
SYSTEM |
Specifies the following DTD is found on the system. |
PUBLIC |
Specifies the following DTD is available publicly by url. |
[ ] |
Specifies an optional two part form for specifying a DTD. |
White Space in DTDs
White space is
ignored in DTDs except before and after the
exclamation marks
where no whitespace is allowed.
DTD Element Definitions
Element Names
Elements follow the
same XML naming conventions described earlier
where
names may start with a letter or underscore. Subsequent
characters may be
letters, numbers, underscores, dashes, or
periods. A single colon may be
included in conjunction with a
namespace prefix. Also a name may not start
with the character
string, xml.
Element Declarations
The following
example shows the form of the element declaration.
Notice the
ELEMENT keyword is case sensitive and has to be in
it's all uppercase form.
Element Form <!ELEMENT elementName ( rule ) > // ELEMENT all uppercase
Where
- ELEMENT is the generic tag name for
element definitions.
-
elementName is the identifier that is selected
- rule describes the type of data that can be contained inside the
element.
'XML in a Nutshell'
offers what is perhaps a more intuitive
description of the
form of an element, describing the contents of
the element in terms of a content
model, ( what we discussed
earlier. ) This is a good approach because what
characterizes an
element is totally determined by what it contains.
Element Form from 'XML in a Nutshell'
<!ELEMENT
element_Name ( content_model
) >
The simplest case
is an element that contains parsable character
data. Following
is an example showing the PCData form. Parsable
character data is represented
by the string #PCData and
represents data that the XML parser can parse or
process.
Example <!ELEMENT pliers
(#PCDATA) >
The element that
contains a single child element is a simple form
that is not seen
often. Following is an example.
Example
<!ELEMENT case (
guitar )>
Typically we see
elements that have sequences of children.
Following is a sample
DTD that shows a series of element
declarations. The sequence is represented by
a comma-separated
list. If this next example is saved as a file with a .dtd ending it
becomes a viable external DTD.
Example
<!ELEMENT
toolcase ( wrench, screwdriver, pliers,
drillcase )>
<!ELEMENT wrench
(#PCDATA)
>
<!ELEMENT screwdriver
(#PCDATA)
>
<!ELEMENT pliers (#PCDATA
)
>
<!ELEMENT drillcase (drills,
sharpener) >
<!ELEMENT drills
(#PCDATA)
>
<!ELEMENT sharpener
(#PCDATA)
>
A Note on Document Order
Notice that there
is an ordering constraint put the document that
is being governed
by this DTD. The significant ordering is
controlled by the appearance of sub-elements
that appear as
parameters to the round brace area of each element definition.
The
order that the elements in the DTD appear is itself is not
significant. It does make
organizational sense though to keep things consistent.
In the above
example, the first element which is the document or
root element
definition, is called 'toolcase'. It has four
nested elements that are declared in a
comma separated list
inside round braces. Each of the sub-elements are then
individually
defined as containing parsable character data except for the
'drillcase'
element. This element further nests two more elements
that are of the PCDATA
type, 'drills' and 'sharpener'. We can
depict the structural hierarchy that this DTD
describes in the
following depiction.
Diagram of the hierarchal structure that the Toolcase DTD dictates.
toolcase
|__wrench
|__screwdriver
|__pliers
|__drillcase
|__drills
|__sharpener
This DTD can be
applied to a document by using the following
DOCTYPE
declaration with an appropriately structured XML
document.
Example
<?xml
version="1.0"?>
<!DOCTYPE toolcase SYSTEM "Toolcase.dtd">
<toolcase>
<wrench>Box
wrench</wrench>
<screwdriver>Robertson
screwdriver</screwdriver>
<pliers>needlenose
pliers</pliers>
<drillcase>
<drills>Set of 10 metric</drills>
<sharpener> Tungsten hand
sharpener</sharpener>
</drillcase>
</toolcase>
Element Content & Structure
DTD element use a set of keywords, symbols and general forms
to rule
or to further refine what their content will be and what
structure content
will take. These are sometimes called rules,
which they are, but the term
'forms' is also an easy fit.
In terms of
content, DTD elements can specify they contain
parsable character
data, other element both or neither. In terms
of structure, symbols can be used
to specify whether an elements
sub-parts occur zero or one, zero or many or
one or many times.
Points made using
our toolcase example are reiterated in the
context of the the
different forms that may govern an elements
content.
// Note: Topics have been
reordered to follow the order of presentation found
// in 'XML in a
Nutshell'.
The #PCData Form
The #PCData keyword
is used to specify that an element will
contain regular
character data that can be parsed by the XML
parser. Following is the form an
element takes when it is
declared as containing parsable character data.
Form of an Element Declared With #PCData Rule
<!ELEMENT elementName ( #PCData ) >
In an xml document
this element would take the form of the
following example.
Example
<elementName>
Parsable Character Data goes here </elementName>
The Element Only Form
The 'Element Only' Rule restricts the
content of the element to other child elements.
This
specification is determined simply by how it is declared. For
instance, the
following example
shows an element called 'Super' will have a single element
called 'Sub' as it's content.
Example <!ELEMENT Super ( Sub )>
An element may have more than one child
element. Compound declarations can be
specified
using a comma-separated list of child elements.
Example
<!ELEMENT elementName ( elementOne, elementTwo, elementThree, elementFour )>
Both the '#PCData' and the 'Element
Only' form we have seen in our example
above.
Using OR Groupings
Child elements may declared in optional, OR
groupings. The next example shows the
'OR'
( aka pipe ) symbol being used to declare that, the child that is
selected, can be
one element or the
other, (male or female).
Example
<!ELEMENT gender (
male | female ) >
Parenthesis can be
used to nest comma separated sequences or
choice ( | ) groupings.
Example
<!ELEMENT
elementName ( ( One | Two ) , ( Three | Four ) )>
The Mixed Form
The mixed content form is new. This form
allows an element to be declared that
may contain mixed
content, either parsable PCData or child elements, or both.
The
choices of content that may appear in an associated xml document are
listed,
inside the round braces separated by
the OR or pipe symbol. The asterisk, along
with the pipe symbol is
borrowed from EBNF. It is applied to the total contents
of the
round brackets. It signals that any of
these elements may appear zero or
more times.
Example <!ELEMENT nameMix ( #PCData | elementOne | elementTwo )*>
The key feature to this form besides the
asterisk that indicates zero or more, is
that the
#PCDATA declaration must appear first in the listing. Any number of
child elements may follow.
A Complete Example Showing the Mixed Form
First we show a simple DTD that defines a
root element that can contain text
with term elements mixed in.
A DTD Defining a Mixed Type
<!ELEMENT
definitions (#PCDATA | term )* >
<!ELEMENT
term (#PCDATA) >
An XML Instance Governed By this DTD
<?xml
version="1.0"?>
<!DOCTYPE
definitions SYSTEM "definitions.dtd">
<definitions>
The
description is full of terms that require
redefining, since we longer
have Earth as our current context.
For instance,<term>air</term> no
longer
means the same thing as it did on earth.
The
term <term>insect</term>,
as we have found, certainly needs some
redefinition.
</definitions>
A style sheet can easily be
referenced from this xml document to supply
different markup for
the term elements.
The EMPTY Form
There are occasions
when an element might need to be specified
that should
explicitly have no content. The EMPTY form is
used to accommodate this pattern.
The content EMPTY rule states
the element will be empty of all content. This
may not sound that
useful at first but the real payload for this form comes with
information that may be associated with an element's
attribute(s).
A quick way to
illustrate this is to bring attention to the IMG
tag in HTML. This
tag has no stated content however the IMG SRC
attribute is used to specify an
image. The EMPTY form may also be
used to create tags that can be used to
store a directive to
format the page a certain way or other sorts of metadata that
would
be useful for describing content or perhaps doing diagnostics.
Form of an Element Declared With EMPTY
<!ELEMENT elementName EMPTY >
The ANY Form
The opposite of the
EMPTY form is the ANY form. Sometimes a DTD
will want
to include a loose definition that allows some latitude
to the document writer to
supply information. The ANY keyword
allows any legal sort of content to be
contained in an element.
The ANY keyword signals the parser to consider valid
the
element's content whether that content be empty, parsable character
data or
other elements. The ANY rule creates a content rule that
makes the validation
process relatively meaningless for the
element in question.
Form of an Element Declared With ANY
<!ELEMENT elementName ANY >
There is one restraint on using this form. Any elements that
are introduced
must themselves be declared in the DTD. The
following example shows
this. We use an internally defined DTD
here.
Example
<?xml
version="1.0"?>
<!DOCTYPE whatever [<!ELEMENT
whatever ANY >
<!ELEMENT like (#PCDATA) >
]>
<whatever>
<!--
validates empty, in mixed form, with just PCData or with just
subelement -->
This is mixing it <like> totally
</like>
</whatever>
Adding Element Symbols
We previewed how the asterisk and pipe symbols can be used to
control
an element's content at the granular level. We
now look at the complete
collection of symbols inherited from EBNF
can be applied at the most
granular level
that DTDs afford us. Following is a summary
of the different
element symbols.
No Symbol -
When no symbols are applied to data items, this signifies that
the
data will appear once. This is the commonest form of element
declaration.
The following example can be thought of as a singular
form of a comma-
separated plural type.
Example <!ELEMENT RegularForm ( one_value ) > // one argument
The next three symbols control other variations of how many times an
element may appear. Note that they may be applied to individual elements
or groups of elements enclosed in brackets.
Question Mark ? -
In order to specify that an element can appear optionally
in a
document, the question mark symbol is used. It enforces a 'zero or
one' rule.
In the following example
if there was no argument supplied as an option a standard
package could be presumed. Otherwise the
added package specified could
be
included.
Example <!ELEMENT OptionsPackage ( name, description? ) > // ? --> zero or 1 parameter
Complete Example Demonstrating the
Declaration of an Optional Element
<?xml
version = "1.0"
standalone="yes"?>
<!DOCTYPE options
[
<!ELEMENT options (name, description?) >
<!ELEMENT name (#PCDATA) >
<!ELEMENT description
(#PCDATA) >
]>
<options>
<name>chrome
trim</name>
<!-- optional user
is free to leave out description and the doc is still valid
-->
<description>includes chrome hub
caps</description>
</options>
Asterisk *
_ The asterisk applied to an element parameter implies this type
will appear zero or more times. That
is 0,1,2,3, . . .
Example 1
<!ELEMENT Hurricanes ( name* )
>
Example 2
<!ELEMENT
nameMix ( #PCData | elementOne | elementTwo )* >
// another
example of a mixed form element
The Plus sign + - Data must
appear one or more times. The following example
will have at
least one example and any number more data items.
// memo -->there is always one data item specified. This is the Kleene cross
Example <!ELEMENT UTO_Sitings ( report+ ) >
// 'Unidentified Tunneling Organism' Sitings - see 'Tremors', the Movie
Comma , - The comma separates compound parameters
Example
<!ELEMENT Farm (cows, chickens,
horses ) >
Parentheses ( ) - The
parentheses contain the parameters or rules supplied to an
element.
They also can be nested to provide sub-sequences.
Example <!ELEMENT Structure ( single1, single2, double( left | right ), single3 ) >
// the nested form would be
good for things like (province | state) or (ZIP | Postal_Code)
OR Symbol | - Also
called a pipe, the OR symbol is used to separate a set
of
options. We see it used in the example above and again below.
Example
<!ELEMENT
drink ( tea | coffee | milk | pop ) >
Applying Element Symbols in DTDs
Following is an XML data definition that
might be used to describe a membership
form
for a web site. Each member is described as a separate member
element
of the larger Membership
context. The data structure this DTD defines asks for
the
members initials lastname, type of membership,free or paid, and the
members
e-mail, phone-number and
address.
Example of a Simple XML DTD For a Web
Membership
<!ELEMENT
Membership ( member ) >
// needs to allow more than one
member
<!ELEMENT
member ( initials, lastName, memberType,
e-mail, phNumber, address ) >
<!ELEMENT
initials ( #PCData ) >
// may have zero or
more
<!ELEMENT
lastName (#PCData ) ) >
// may have more than one
<!ELEMENT
memberType ( free, paid ) >
// may be free or paid but not both
<!ELEMENT
free ( #PCData ) >
<!ELEMENT
paid ( #PCData ) >
<!ELEMENT
e-mail ( #PCData ) >
// may be
required
<!ELEMENT
phone-number ( #PCData ) >
// may be optional
<!ELEMENT
address ( #PCData ) >
// also may be optional
While the definition as it stands is good it
is in fact very inflexible. The member
may
wish to specify zero, one or more initials. A person may have more
than
one last name. Membership may be free or paid. The e-mail
can be required
but the user may
wish to association more then one e-mail, so this becomes
a
'one or more' situation. The member may not wish to provide a
phone number
and let us assume the same can
be said about providing an address.
These two
fields need to be defined as optional, in other words
specified in a 'zero or one'
relationship.
EBNF derived element symbols can be added to
the above form
to add the flexibility that is needed in the
document instance.
One major flaw of the above DTD specimen is
the fact that the 'No Symbol'
rule is restricting the club to a
single member! By suffixing the member element
with a plus sign
to indicate one or more members this
problem is resolved. We
can then make phone number and address
optional by suffixing these fields with
question marks. Finally, we can then use the or symbol
to make the memberType
free or paid.
Example of an XML DTD For a Web Membership That is Made Flexible Using Symbols
<!ELEMENT
Membership (member+)>
<!ELEMENT member (initials*,
lastName+, memberType, email+, phNumber?, address? ) >
<!ELEMENT initials (#PCDATA) >
<!ELEMENT lastName (#PCDATA) >
<!ELEMENT memberType (free | paid ) >
<!ELEMENT free (#PCDATA) >
<!ELEMENT paid (#PCDATA) >
<!ELEMENT email (#PCDATA) >
<!ELEMENT phNumber (#PCDATA) >
<!ELEMENT address (#PCDATA) >
// save to 5_DTDsymbols.dtd
Sample of an XML Document that uses the DTD for Validation
Following is an
example of an XML document that takes advantage of
the
membership document type definition specified above.
<?xml version="1.0"?>
<!DOCTYPE Membership SYSTEM "5_DTDsymbols.dtd">
<Membership>
<member>
<initials > P</initials>
<lastName>Taylor </lastName>
<lastName>Rockwell</lastName>
<memberType>
<free>90
day</free>
<!-- <paid>3 year
subscription</paid> -->
</memberType>
<email>bill@bob.net</email>
<phNumber>519 929 2257</phNumber>
<address>RR#3 Ono Township</address>
</member>
</Membership>
XML Validation Java Command
Line Tools in J2SDK1.4.x If you are not running 1.4.x you can go
to the www.apache.org site and Example
JAVA_HOME/jre/lib/ext // Linux / Unix
JAVA_HOME is popularly defined in the
operating system configuration Example. JAVA_HOME=jdk1.3.1_03 // might want to export Check the latest documentation at the Apache site to use the Xerces parsers. Validating Editors You can also use an XML editor. The fast
and straight forward ' XML writer' What seems to be a very famous XML Editor
is available at the following site. There is also Xeena available at the IBM
web site. You can get it at this link. Harold & Mean's 'XML in a Nutshell'
point out there are online sites which 'XML in a Nutshell' lists Brown
University's XML Validation form at
|
DTD I Self Test Self Test With Answers
1) True or False?
An XML document may use an internal or external
DTD but not both. True \ False
2) Which of the
following may not appear in a Document Type
Declaration
Element?
a) PRIVATE
b) SYSTEM
c) PUBLIC
d) DOCTYPE
3) True or False?
All whitespace in a DTD is ignored. True \
False
4) True or False? A DTD is an XML file
stored with an xml extension.
True \
False
5) True or False? The ANY rule allows
any sort of data as long as it is
not empty.
True \ False
6) True or False? The Element Only
Rule restricts the content of an element
to child elements which may be comma-separated sequences with a
number
of
elements declared in optional OR groupings. True \ False
7) True or False? No symbol signifies data
will appear once. True \ False
Exercise
1) Create a short XML document that declares a data type
definition internally.
The document will have for it's first tag
SIGNAL. Three internal tags will be
defined called GO, STOP and
CAUTION. Each of these will be defined to
take parsable character
data. (This key to an internal DTD is the definitions
go inside
the square braces of the DOCTYPE tag.)
2) Create a
document type that might be used at a medical clinic.
The first tag
may be called visitors and able to accommodate one
or more patients. The DTD
should go on to specify name, address
and phone number. The form should be
able to accommodate that the
person may not have a phone or may have more
than one phone
numbers. The form will specify an element for each of 'age'
and
'sex'. The DTD should allow that the client may omit entering their
age.
Finally, the form should have a supplemental health_plan_ID
element the
appearance of which is optional. The document should
also contain an element
for payment that permits payment by cash,
credit card or health plan.
3) After these
elements have been listed and a DTD created, create
a XML
document that is governed by the DTD. Use one of the
suggested validating
techniques described above to confirm your
entries have met the criteria
dictated by the DTD.
// Alternatively you can do
a DTD scheme of your own making as long as
// you take care to make use of all the control
symbols used in the above example.