M.Sc. thesis

Introduction

The first chapter of the thesis I submitted in partial fulfilment of an M.Sc. in Informatics at the University of Edinburgh

XPath1 is a language for addressing parts of an XML document, designed to be used by other languages for such a purpose and so provide a common syntax for document addressing. XML (Extensible Markup Language)2 is a markup language that is used to create other markup languages, instances of which are stored as documents. XML came about after a concerted effort by the World Wide Web Consortium (commonly known as the W3C, the Web’s governing body) to design a flexible language for use over the Internet. It is a sub-set of the enormous Standardized General Markup Language (SGML)3, which as its name implies is a general markup language with huge capabilities. However the sheer complexity of SGML means it can be hard to learn and document processing requires much effort. A W3C working group agreed on a sub-set of SGML that meant processing would be much simpler, documents would be easy to create and human-readable, whilst still being flexible enough to ‘support a wide variety of applications’2.

Although a markup language such as XML is not a new idea, it has found great popularity due to well thought-out parts of its design, notably:

<?xml version="1.0" encoding="utf-8"?>
<memorandum id="30df7">
  <header>
    <from>John Smith</from>
    <to>Steve Blois</to>
    <to>Joe Bloggs</to>
    <to>Peter O’Neil</to>
    <subject>Sales figures</subject>
  </header>
  <message>
    Sales figures for August required by noon Friday.
  </message>
</memorandum>
Figure 1.1: A simple XML document.

Of course the structure of the document in figure 1.1 is only apparent if one speaks English, which computers have a tendency not to do. To check whether the structure and syntax of an XML document is valid, XML allows for document type definitions (DTDs) to be stored either as part of the document itself, or referenced from a separate file. Although part of the XML specification, the DTD is a hangover from the days of SGML, and has a syntax all of its own. Once an XML document is well-formed (i.e. it complies with the XML syntax), it is also valid if it conforms to a given DTD. The DTD defines what elements and attributes can appear in a document and where they can appear, along with having some control over what type of data can occur inside elements and attributes.

However, as XML became more popular and more widely used, DTDs were decried for being too complex; so the W3C set out to create a successor. Early 2001 saw the recommendation7 of XML Schema8 9 10, and XML-based syntax for defining XML documents11. Contrary to being simpler than DTDs, XML Schema is a very powerful specification that allows for much more control over XML documents, including: specifying data types for elements and attribute content; inheritance from other schemas; and minimum and maximum occurrences of elements (DTDs rely on regular expressions; XML Schema allows for limits to be set explicitly). XML Schema has superseded DTDs and is likely to become more dominant as awareness of it grows.

XML and its companion recommendations

XML was proverbially in the right place at the right time. It has become the lingua franca of the Internet and ideas for its use has gone beyond the original intentions. Attempts are being made to represent databases in it, and XML now sports among other things query languages12 13 and processing languages14. To aid this development of XML technology — and also to control it in a benevolent fashion — the W3C publishes companion recommendations, i.e. standards that complement and augment XML. Two of these recommendations are XSLT (Extensible Style Language transformations)15 and XPointer5. These are used to transform XML documents into XML documents of different types, and for specifying locations inside an XML document respectively.

Clearly both these languages require a method of addressing parts of an XML document. So rather than duplicate the effort the W3C decided on creating a new language with the primary purpose of providing document addressing functionality, viz. XPath8. Along with its primary purpose XPath also provides basic string, number, and Boolean manipulation. XPath uses a non-XML syntax so that it can be used within XML attribute values in documents themselves. The most important construction in XPath is the expression, an example of which is given below.

/child::document/descendant::paragraph[attribute::type]

This expression looks for the document element as the root element (the element inside which the entire XML document is held), and returns any children or children of children that have the name paragraph and contain an attribute named type. (A large number of examples of XPath expressions can be found in the XPath specification8; further discussion of XPath can be found in this thesis in section 3.3.) When evaluated, an expression will result in either a node set (a node being an element, an attribute, or one of a few other such parts of an XML document), Boolean, string, or number.

Figure 1.2
Figure 1.2: An XML document represented as a tree of nodes.

XPath models XML documents as unranked trees of nodes. Such trees are ordered, finite, and labelled, where nodes can have a arbitrary number of children and so each label is not associated with a fixed rank. Figure 1.2 has such a representation of the document given in figure 1.1. In that view inner nodes correspond to elements which determine the structure of the document while the leaf nodes and the attributes provide the content. Using thirteen axes (relationships between nodes; child and descendant in the expression above are examples of axes), XPath can traverse the tree and locate certain elements, attributes. To take the expression

child::memorandum/descendant::to

(which finds any to nodes underneath the memorandum element in the tree) as an example and evaluate it against the XML document in figure 1.1, XPath would return the node set [to, to, to], each corresponding to one of the three to elements in the document.

In this instance the node set contains three elements, but it also possible that the node set will be returned empty. Evaluating the expression

child::to/descendant::memorandum

would return an empty node set because memorandum does not appear as a child of any to element. Similarly, what if the evaluation is looking for elements with certain attributes, when the elements don’t have any attributes? What if the elements don’t actually exist in the document?

With hindsight it would be obvious that an evaluation of this sort is pointless. On documents the size of that in figure 1.1 it makes little difference as it is quite simple. However, complex documents that are time-consuming to process (e.g. very flat trees with a large number of instances of an element) or very large documents (e.g. databases) bring to question this blind faith in XPath evaluation. Something as simple as a spelling mistake would cause unnecessary processing — and if the evaluation took place over multiple documents this could only add to the problem. Something is required to evaluate whether the evaluation of the XPath expression is feasible or necessary before the XML document itself is consulted.

Evaluating against document types

The one place the details of the document are found other than the document itself is in its type definition, in the form of either a DTD or an XML schema. From the document type one can find out what elements and attributes are allowed, what their data content is, where they can appear, and at what frequency. Indeed, by evaluating the XPath expression against the document type one can perform operations as simple as a spell-check and as complex as discovering whether the fifth item child of a list element that is somewhere below a paragraph element can have an attribute named language with the content español.

There are obvious advantages to this approach. In an extreme example an evaluation against a ten kilobyte schema might show evaluation against a five-hundred megabyte XML document database to be pointless. The hypothesis of this thesis then, is that introducing an intermediate step — before an XPath expression is evaluated against an XML document — where the expression is evaluated against the document type to prove the feasibility of a full evaluation, would be a sagacious measure. The objective of this project is to implement such a system to be used to provide this functionality.

Project overview

The original aim of this project was to design and implement a complete system for such abstract evaluation of XPath mentioned above. While a theory was produced for such a system (see chapter 3), XPath proved to be a more complex language than expected and only a partial implementation of the language was possible given the time limit of this project. The implementation concentrated on a particular type of XPath expression, essentially ignoring XPath’s string, number, and Boolean manipulation facilities and instead concentrating on the language’s most important function, the traversal of XML document trees. The occurrent system implemented six of the thirteen axes (specifically those that traverse the tree in a downward fashion) as proof that the system is feasible.

The system does not handle DTDs. A decision to concentrate solely on XML schemas was made for two reasons, viz. that schemas are the more interesting of the two, but mainly because a DTD’s functionality is a sub-set of XML Schema and it is a simple matter to transform a DTD into an XML schema16. Indeed tools exist to do this (the W3C has a Perl script, DTD2XSD, available from its Web site for example) and it would be trivial to add functionality to the system to take in a DTD, transform it into a schema and use it that way.

The system works as a black box. It takes in input, and based on that input returns output representing either a success or a failure of the evaluation. The input is an XML schema, an XPath expression, the document root element (the upper-most element in the document tree), along with some optional arguments. From here, the schema is modelled as a finite-state automaton and the XPath expression is used as an input language for the automaton. The evaluation is successful if the input language corresponds to the automaton, otherwise it fails17. This project builds upon an XML Schema parser and validation tool by Henry Thompson and Richard Tobin of the W3C and the University of Edinburgh18. As this is written in the Python programming language, the practical work in this project also uses Python.

Thesis outline

The following chapter contains a review of related literature. Chapter 3 contains a discussion on the theory, design, and methods behind the implementation. The system itself and its workings are described in chapter 4, while analysis, tests, and future work follow in chapter 5. The main text concludes with a summary in chapter 6, after which come a small number of appendices and the bibliography.


  1. J. Clark and S. DeRose. XML Path Language (XPath) Version 1.0, 1999. http://www.w3.org/TR/1999/REC-xpath-19991116. ↩︎
  2. T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. Extensible Markup Language (XML) 1.0 (Second Edition), 2000. http://www.w3.org/TR/2000/REC-xml-20001006. ↩︎
  3. International Organization for Standardization. ISO 8879:1986(E). Information processing — Text and office systems — Standard Generalized Markup Language (SGML), 1986. Geneva, Switzerland. ↩︎
  4. The Unicode Consortium. The Unicode Standard. Addison-Wesley Developers Press, Reading, Mass., 2000. ↩︎
  5. J. Ferraiolo. Scalable Vector Graphics (SVG) 1.0 Specification, 2001. http://www.w3.org/TR/SVG/. ↩︎
  6. D. Carlisle, P. Ion, R. Miner, and N. Poppelier. Mathematical Markup Language (MathML) Version 2.0, 2001. http://www.w3.org/TR/MathML2/. ↩︎
  7. The W3C is made up of constituent organizations, and based in three academic institutions across the world. As such it is not a government agency or a recognized international standards organization. To indicate this it uses the softer wording of ‘recommendation’ for what is essentially an international standard. ↩︎
  8. D. C. Fallside. XML Schema Part 0: Primer, 2001. http://www.w3.org/TR/2001/REC-xmlschema-0-20010502/. ↩︎
  9. H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures, 2001. http://www.w3.org/TR/xmlschema-1/. ↩︎
  10. P. V. Biron and A. Malhotra. XML Schema Part 2: Datatypes, 2001. http://www.w3.org/TR/xmlschema-2/. ↩︎
  11. Imagine a future where XML etc. has been lost and people come across an XML document which is described using an XML schema — which is itself an XML document. One can see this as being an equivalent of the chicken-or-the-egg argument: which came first, XML or XML Schema? ↩︎
  12. A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A Query Language for XML, 2002. http://www.w3.org/TR/NOTE-xml-ql/. ↩︎
  13. S. Boag, D. Chamberlin, M. Fernandez, D. Florescu, J. Robie, and J. Siméon. XQuery 1.0: An XML Query Language, 2002. http://www.w3.org/TR/xquery/. ↩︎
  14. H. Hosoya and B. C. Pierce. XDuce: a typed XML processing language. In Int’l Workshop on the Web and Databases, Dallas, Tex., 2000. ↩︎
  15. J. Clark. XSL Transformations (XSLT), 1999. http://www.w3.org/TR/xslt. ↩︎
  16. It is prudent to note here that XML Schema (as a proper noun) is used throughout this thesis to refer to the actual recommendation, while schema or schemas refers to specific instances of the recommendation. ↩︎
  17. Finite-state automata and the theory used in this implementation are discussed in further detail in section 3.1. ↩︎
  18. H. S. Thompson and R. Tobin. Current status of XSV: Coverage, known bugs, etc., 2003. http://www.ltg.ed.ac.uk/~ht/xsv-status.html. ↩︎

This is a chapter of the thesis I submitted in partial fulfilment of an M.Sc. in Informatics at the University of Edinburgh. Read the full thesis.