3 Non-Linear Plumbing

Task: Attribute Names And Values

In this lesson, we extract and report slightly more structured data. It sounds like we have been extracting data just fine in the last lesson, but in a moment you will see this time we need something new. In a file like file-a.xml, there are meta elements: <html ... > <head> <meta attr="value" ... /> ... We want to fetch the attribute name and the value, turn the name into uppercase, and put them in a tuple. (In general there may be none or there may be many, so we actually return a list of tuples. But HXT arrows do this for free anyway.)

Each HXT arrow we have learned before returns or processes just one datum. There is one arrow that returns a name (works for both tags and attributes). There is one arrow that returns an attribute value. If we just wanted one of them, the technique in the last lesson would suffice. But now we want both of them at the same time, and we want to process them both and combine them into a tuple. How to do that? This requires new operators. The purpose of this lesson is to introduce the new operators.

The following program performs this task. You can also download it as lesson-3.hs.

 IOSArrow XmlTree (String,String)
processor filename =
    readDocument [withValidate no] filename >>>
    getChildren >>>
    isElem >>> hasName "html" >>>
    getChildren >>>
    isElem >>> hasName "head" >>>
    getChildren >>>
    isElem >>> hasName "meta" >>>
    getTuple

getTuple :: IOSArrow XmlTree (String,String)
getTuple =
    getAttrl >>>
    getName &&& (getChildren >>> getText) >>>
    arr (map toUpper) *** returnA
]]>

How to run this program? At a GHCi prompt, Prelude> :load lesson-3.hs *Main> play "file-a.xml"

Anatomy of The Program

Up to the call to getTuple we are just inputting a file and walking paths to meta nodes. This is covered in previous lessons. I now describe what's new.

>> ...
]]>

getAttrl gets all the attributes of the current node (which is a meta node in this case). As usual, the >>> operator causes the next arrow to receive these attributes, one at a time.

>> getText) >>> ...
]]>

To get the name of an attribute, getName on the attribute node does the job. To get the value of an attribute, it is stored further down in a child node that is also a text element, and so getChildren >>> getText on the attribute node does the job. Now, how do we do both to the same node? That is exactly what the &&& operator does. IOSArrow x y1 -> IOSArrow x (y0,y1) ]]> It calls the arrow on its left, then it calls the arrow on its right — both are called with the same input; then it tuples up the two results.

In general usage, each of the two arrows may output many results (although in this lesson we get single results). &&& multiplies up all combinations. E.g., if the left arrow produces ["a","b"], and the right arrow ["x"], the overall result is [("a","x"),("b","x")]; likewise, if one of the arrows produces the empty list, the overall result is also the empty list.

Note: even if one of the two arrows produces the empty list, both arrows are still called regardless. The empty result of one of them does not short-circuit the other. This is important to know when you use HXT arrows with side effects.

Now we have a tuple with the name and the value. We still want to change the name component before outputting the result. The essence of this is calling two arrows for the two components respectively, so that one arrow changes the name to uppercase, and the other passes through the value unchanged. This is done by the *** operator: IOSArrow x1 y1 -> IOSArrow (x0,x1) (y0,y1) ]]> It calls the arrow on the left with the left component as input, and the arrow on the right with the right component as input; then it tuples up the two results. So we just have to construe an arrow for uppercasing a string (it can be done by a pure function, so we just lift that to the arrow level), another to change nothing, and combine them with ***. That is,

In general usage, the two component arrows output none or multiple results, and *** mutiplies up all combinations, without short-circuiting.

In practice, you probably go much further than this toy. Maybe you compute something based on the name and the value; maybe you pass them to another function or arrow for further processing; maybe in some stage you need both of them and in some other stage you process them separately. Still, the &&& and *** operators are an essential stepping stone: they fork off dataflow, and then you can do whatever you want.

New Friends from This Lesson

Name from Module Summary
&&& Control.Arrow (GHC) fork processing
*** Control.Arrow (GHC) separate processing
getAttrl Text.XML.HXT.Arrow.XmlArrow outputs attribute nodes