2 Walking A Path

First Task: What Is The Title?

We are ready to extract some content from XML files!

We will use file-a.xml again or something similar as our XML input file. For the purpose of this lesson, it looks like: <?xml version="1.0" encoding="UTF-8"?> ... <html ... > <head> ... <title>what is here?</title> ... and the parse tree looks like: ---XTag "/" | . | +---XTag "html" | . +---XTag "head" | | . . | +---XTag "title" | | | | | +---XText "what is here?" ...

We are interested in the text content of the title element, i.e., what goes into the "what is here?". The following program will find out. You can also download it as lesson-2.hs.

 IOSArrow XmlTree String
processor filename =
    readDocument [withValidate no] filename >>>
    getChildren >>>
    isElem >>> hasName "html" >>>
    getChildren >>>
    isElem >>> hasName "head" >>>
    getChildren >>>
    isElem >>> hasName "title" >>>
    getChildren >>>
    getText
]]>

How to run this program? At a GHCi prompt, Prelude> :load lesson-2.hs *Main> play "file-a.xml" Try to run it, modify it for variations, run it with a slightly different XML file, look up the HXT doc for the functions used... until you're thoroughly satisfied or utterly confused. Then you're ready to read on.

Anatomy of The Program

Now let's walk through the program.

>>
    ...
]]>

As we learned from the previous two lessons, this parses the XML document specified by the file name and passes on the parse tree to the next arrow.

>>
    getChildren
]]>

As the name suggests, getChildren is an arrow that inputs a parse tree and outputs the child nodes/subtrees of the root. Now, the root has an arbitrary number of children, and getChildren needs to output all of them. So this is the main reason why an HXT arrow is capable of outputting multiple values by passing them on in a list internally: it can output the list of children.

I will use "node", "subtree", and "subtree rooted at node" interchangeably; to distinguish them too much is counterproductive in a high-level language.

Specifically, at this stage, the children of the document root are possibly: processing instructions, comments, whitespace text, and the (one and only) top-level element. The list of these things (more precisely the list of subtrees rooted at these things) is passed on.

>>
    ...
]]>

Recall that the >>]]> operator chains two arrows by taking the list from the upstream and calling the downstream multiple times, once for each item in the list. This is the right thing to do if we want to apply the same downstream arrow operation to all items. (We will quickly see that it is the case for our task.)

>>
    isElem
]]>

isElem passes on the input to the output (as a singleton list) if the input's root is an element; otherwise it outputs nothing, i.e., the empty list.

Recall that from the upstream we may receive a processing instruction, a comment, a segment of whitespace text, or an element; and for our task we only care about the last case. So isElem is a great way to discard the irrelevant cases and let through the relevant case for further processing. It can be thought as a filter or test.

Now we have the element, but we also want to make sure it is html. This is accomplished by:

>>
    isElem >>> hasName "html"
]]>

hasName passes on the input to the output if the input's root is an element or an attribute with the desired name; otherwise it outputs nothing. So this serves as a test that our element has the right name.

Now we have the html element for sure, but it may have many possible children: a head element, some other elements, and some text nodes. We need to proceed to the head element and discard all the other cases. But taking a hint from the whole ordeal above, we see the way: get all children, then keep only elements, then keep only those with the right name. And afterwards we can do the same to get to the title element too! So here it goes.

>>
    isElem >>> hasName "html"
    getChildren >>>
    isElem >>> hasName "head" >>>
    getChildren >>>
    isElem >>> hasName "title" >>>
    ...
]]>

In general a strategy emerges for tasks of the form "walk a specific path and ignore everything else": use getChildren to travel to the next hop, and use test arrows to narrow down to the desired path.

To obtain the text inside the title element, recall that the string is stored in a text node as a child node of the owning element. (See the parse tree above.) So, we will call getChildren one last time to get to the child nodes of title, then apply an arrow getText that combines two jobs into one: discard input if it is not a text node, and output the text string if the input is a text node - exactly the right operation for our purpose!

>>
    isElem >>> hasName "html"
    getChildren >>>
    isElem >>> hasName "head" >>>
    getChildren >>>
    isElem >>> hasName "title" >>>
    getChildren >>>
    getText
]]>

Whew! That's it!

Another Mental Model

In the beginning, I gave a mental model for such arrow chaining as >> g ]]> Namely, f may output many values, so call g just as many times, once for each value from f; at the end, pool together all output values from all the calls of g.

This is an operational model, meaning it tells you how to execute things. Many people love operational models as a first step towards an understanding. But operational models are hard to keep track of in the head once we lengthen the chain: >> f1 >>> f2 >>> f3 >>> f4 >>> ... ]]> The cascade of multiple values is harder to imagine.

The strategy for the task of this lesson suggests an abstract model, tractable for long chains: the chain selects paths in the tree and walks them. Some of the arrows in the chain, such as getChildren, jump hops; some others, such as isElem, decide whether to continue or not. All in all, the chain is a path specification, and it applies to those paths in the tree that satisfy the conditions on the chain. This model invites you to think of one path at a time, and so it is easier to reason with; it also fits the path-oriented paradigm of XML querying.

So for example, the solution program in this lesson picks out paths of the pattern head -> title -> text node ]]> Then you may like to ask: what if the XML file contains many such paths? The program will match all of them and report all of the strings found (recall that the very end result is a list of strings anyway). Similarly if the file contains no such path. I encourage you to modify the XML file to contain more or fewer matching paths, or the program to match some other paths, and verify the result.

The next question you may ask is: you don't like this, how to modify the program to reject files with none or too many matching paths? There are two ways. One is to write a DTD dictating existence and uniqueness, and tells readDocument to validate. Another way will be covered in a later lesson.

New Friends from This Lesson

Name from Module Summary
getChildren Control.Arrow.ArrowTree outputs child nodes/subtrees
isElem Text.XML.HXT.Arrow.XmlArrow lets through elements only
hasName Text.XML.HXT.Arrow.XmlArrow lets through nodes with the given name only
getText Text.XML.HXT.Arrow.XmlArrow extracts the text in the input text node (if only it is a text node)

There are more tree-traversing arrows in Control.Arrow.ArrowTree; they are not XML-specific. There are more XML-specific arrows in Text.XML.HXT.Arrow.XmlArrow: those named isXXX and hasXXX are filters, and those named getXXX extract data (and usually double as filters too).