1 Document Tree

First XML File

We will need an XML file to play with. In principle it can be any XML file for this lesson, but to match the output below you'll need mine. Please download or view it as file-a.xml.

First XML Program

Believe it or not, we are ready for our first XML program. Don't panic! The arrow in it just parses an XML file (the one I have prepared, for example) and pretty-prints the document tree as ASCII art. That sounds like a lot of jazz, but have faith! Two function calls to HXT will do!

So, here we go! You can also download it as lesson-1.hs.

import Text.XML.HXT.Core

play arg = runX (processor arg)

processor :: FilePath -> IOSArrow XmlTree XmlTree
processor filename =
    readDocument [withValidate no] filename >>>
    putXmlTree "-"

How to run this program? At a GHCi prompt, Prelude> :load lesson-1.hs *Main> play "file-a.xml" Try it. Marvel at the resulting ASCII art. Match up things in the ASCII art with things in the XML file. Try some other XML files. When you're done, come back and I have a few more words to say.

Reading and parsing an XML file is most usually accomplished by a simple call to readDocument. It is an HXT arrow that ignores the input, does its job according to the parameters instead, and outputs the resulting document tree. It is an all-in-one deal, and many features can be enabled or suppressed through the parameters. By default, it reads the file, parses as XML, validates, and canonicalizes. The filename can be an HTTP URL, in which case it will be fetched through the Internet automatically, and in addition you can specify a proxy if desired. Please see the HXT docs for more options. Here I use [withValidate no] to suppress validation (as my XML file cites no DTD) and leave other options as default.

Remark: If the XML file cites a DTD, it will be read whether you enable validation or not; if the DTD is given by an HTTP URL, it will be read from the Internet. This may surprise some of you. There is a way to skip reading DTD altogether, but it requires calling a chain of functions instead of a simple readDocument.

First Tree

We are now interested in the structure of the document tree. You will not need to use it, but you still have to be aware of it. Knowing what it looks like, even just roughly, helps you write a working XML processor.

A document tree can be pretty-printed by putXmlTree. The pretty-printing is dumped to the filename given by the parameter; "-" stands for stdout. The document tree comes from the input, and it is passed through to the output. (So any time you write an XML processor and want to see the tree at any point, just insert a call to putXmlTree. It's quite handy.)

Here is the pretty-printing of file-a.xml. (For brevity, I have cut out some regular portions.)

content of: file-a.xml
======================

---XTag "/"
   |   "source"="file-a.xml"
   |   "transfer-URI"="file:///home/trebla/public_html/haskell/hxt-arrow/file-a.xml"
   |   "transfer-Message"="OK"
   |   "transfer-Status"="200"
   |   "version"="1.0"
   |   "encoding"="UTF-8"
   |   "transfer-Encoding"="UTF-8"
   |
   +---XPi "xml-stylesheet"
   |   |   "value"="href=\"../coding-tutorial.css\" type=\"text/css\""
   |
   +---XTag "html"
       |   "xmlns"="http://www.w3.org/1999/xhtml"
       |   "xml:lang"="en"
       |
       +---XText "\n"
       |
       +---XTag "head"
       |   |
       |   +---XText "\n  "
       |   |
       |   +---XTag "meta"
       |   |   |   "http-equiv"="Content-Language"
       |   |   |   "content"="en"
       |   |
       |   +---XText "\n  "
       |   |
       |   +---XTag "title"
       |   |   |
       |   |   +---XText "Hello World"
       |   |
       |   +---XText "\n"
       |
       +---XText "\n"
       |
       +---XTag "body"
       |   |
       |   +---XText "\n\n"
       |   |
       |   +---XTag "h1"
       |   |   |
       |   |   +---XText "Hello World"
       |   |
       |   +---XText "\n"

       ...

       |   +---XTag "pre"
       |   |   |
       |   |   +---XText "Prelude> :load hello1.hs\n*Main> main\n"
       |   |
       |   +---XText "\n\n"

       ...

An HXT parse tree always starts with a root element (tag) of name "/". If the tree is produced by one of the HXT read functions such as readDocument, this root holds a lot of FYI trivia in its attributes, such as the file name and the URL. The exact details of these attributes are largely undocumented, but it also seems that most of them are non-essential (you wouldn't need most of them should you later write the parse tree out as a new XML file). The only essential part seems to be that you have a root element of name "/" and all your XML stuff goes under it. This is known as a complete document including root node in the HXT docs.

Under the root, processing instructions and the (one and only) top-level element (html in this file) are siblings. Starting from the top-level element is the tree of content that we are interested in most of the time. There are element nodes and text nodes. Element nodes can have attributes and child nodes. In the remaining lessons we will see how to access, change, and create them.

Character references, entity references, and CDATA nodes are converted and merged into text nodes. Comments and the DTD are discarded. These features can be turned off by certain options of readDocument or in some cases by calling some other read/parse functions instead.

One last exercise before we close off this lesson. If there is any parse error, including problems finding, reading, parsing, or validating the XML file or the DTD, what tree will you get? Try it! Screw up some tag, add a non-existent DTD URL, even give a non-existent XML file name... See what you get. You will get a root node, with no child, and with some attributes set to indicate the found problem.

The complete data structure of HXT document trees is in the module Text.XML.HXT.DOM.TypeDefs. Again, you do not need to know it, apart from curiosity. We will be using HXT accessor arrows exclusively to process document trees and nodes, and I assure you they have the same expressive power as the data structure itself. Thus in practice we will treat the data structure as abstract. The important thing to know, though, is that to use these accessors effectively, you just need to know the topology of the trees, e.g., how many hops are there from the root node to the node you want, and what's in between. This is adequately fulfilled by looking at a few ASCII arts such as the one in this lesson.

New Friends from This Lesson

Name from Module Summary
readDocument Text.XML.HXT.Arrow.ReadDocument reads, parses, validates, canonicalizes
withValidate, no Text.XML.HXT.Arrow.XmlState.SystemConfig configures readDocument
putXmlTree Text.XML.HXT.Arrow.DocumentOutput writes ASCII art of tree
Text.XML.HXT.DOM.TypeDefs document data structure