Top

Appendix

# Parsec Generally

Albert Y. C. Lai, trebla [at] vex [dot] net

Let's begin with a very basic Parsec example. It parses a lot of numbers with + signs between them, e.g., 3+1+4, and computes the sum.

```import Text.Parsec
import Text.Parsec.String

play :: String -> Either ParseError Integer
play s = parse pmain "parameter" s

pmain :: Parser Integer
pmain = do
x <- pnum `chainl1` pplus
eof
return x

pnum = read `fmap` many1 digit

pplus = char '+' >> return (+)
```

`Parser` is a special case. There are 3 directions of generalization:

• user state (like StateT): `()` here
• input type: `[Char]` here
• combine with another monad: `Identity` here

And so Text.Parsec.String defines

```type Parser a = ParsecT [Char] () Identity a
```

The next 3 sections show how to use other states, other input types, and other monads.

## User State

This example shows using user state. I use `Int` for my state, and it counts how many numbers are successfully parsed.

```import Text.Parsec

play :: String -> Either ParseError (Integer,Int)
play s = runParser pmain 0 "parameter" s  -- 0 is the initial value

-- type Parsec s u a = ParsecT s u Identity a
pmain :: Parsec [Char] Int (Integer,Int)
pmain = do
x <- pnum `chainl1` pplus
eof
n <- getState  -- obtain the user state value
return (x,n)

pnum = do
x <- read `fmap` many1 digit
modifyState (1 +)  -- change the user state value
return x

pplus = char '+' >> return (+)
```

## Input Type

### Text and Bytestring

Besides `[Char]` as input type, `Text` (both lazy and non-lazy) and `Bytestring` (both lazy and non-lazy) are also supported out of the box. This example shows non-lazy `Text`; feel free to try `Bytestring` and lazy variants too.

```import Text.Parsec
import Text.Parsec.Text
import Data.Text

play :: Text -> Either ParseError Integer
play s = parse pmain "parameter" s

example :: Text
example = pack "3+1+4"

-- from Text.Parsec.Text: type Parser a = ParsecT Text () Identity a
pmain :: Parser Integer
pmain = do
x <- pnum `chainl1` pplus
eof
return x

pnum = read `fmap` many1 digit

pplus = char '+' >> return (+)
```

Almost no change from the basic example! This is because Text.Parsec.Text provides an appropriate `Stream` instance:

```instance (Monad m) => Stream Text m Char
```

This instance says that `Text` is an input type with basic unit `Char`. Then `digit` and `char` work because they only require the basic unit to be `Char`, e.g., `digit` has type:

```digit :: Stream s m Char => ParsecT s u m Char
```

### User Token Type

Another generalization: the basic unit of parsing does not have to be `Char`; it can be your own token type. Parsec already provides this general `Stream` instance:

```instance Monad m => Stream [tok] m tok
```

Setting `tok=Char` is exactly how `[Char]` is a supported input type with basic unit `Char`. We can set `tok=Token` for our own `Token` type, and the input type is then `[Token]`.

Our `Token` type will have two cases: numbers and plus signs. It also records line numbers and column numbers: Presumably, a tokenizer converts a string or a file to these tokens, which is where line numbers and column numbers come from. We have to record them because we have to pass them to Parsec for error message purposes. Here is our `Token` type:

```data Token = N {lin,col :: Int, val :: Integer}
| Plus {lin,col :: Int}
instance Show Token where
show N{val=x} = "number " ++ show x
show Plus{} = "plus"
-- instance Show Token is needed for eof so it can produce error messages
```

With our own basic unit instead of `Char`, we cannot use the nice basic parsers in Text.Parsec.Char; we must write our own. But we can learn from how Text.Parsec.Char does it: by calling `tokenPrim`.

`tokenPrim` parses one token. It is customized by 3 parameters:

1. a function for showing a token, for error messages
2. a function for updating Parsec's internal state of file position (filename, line number, column number); this function receives the current position, the current token, and the input after; it should return the position of the next (not current) token
3. a function to determine whether a token is accepted or reject, and if accepted, what value to return; so the function returns `Nothing` to reject, a `Just` to accept
```tokenPrim :: (Stream s m t) =>
(t -> String)
-> (SourcePos -> t -> s -> SourcePos)
-> (t -> Maybe a)
-> ParsecT s u m a
```

We use `tokenPrim` to re-implement `pnum` and `pplus`. `pnum` accepts an `N` token and extracts its number value. `pplus` accepts a `Plus` token and returns `(+)`.

Here is the complete code:

```import Text.Parsec

data Token = N {lin,col :: Int, val :: Integer}
| Plus {lin,col :: Int}
instance Show Token where
show N{val=x} = "number " ++ show x
show Plus{} = "plus"
-- instance Show Token is needed for eof so it can produce error messages

example = [N 1 3 111, Plus 1 10, N 2 1 222]
wrong = [N 1 3 111, Plus 1 10, N 2 1 222, N 2 5 333]
bad = [N 1 3 111, Plus 1 10, N 2 1 222, Plus 2 5]

play :: [Token] -> Either ParseError Integer
play s = parse pmain "parameter" s

-- type Parsec s u a = ParsecT s u Identity a
pmain :: Parsec [Token] () Integer
pmain = do
x <- pnum `chainl1` pplus
eof
return x

pnum = tokenPrim show update_pos get_num where
get_num N{val=x} = Just x
get_num _ = Nothing

pplus = tokenPrim show update_pos is_plus where
is_plus Plus{} = Just (+)
is_plus _ = Nothing

update_pos :: SourcePos -> Token -> [Token] -> SourcePos
update_pos pos _ (tok:_) = setSourceLine (setSourceColumn pos (col tok)) (lin tok)
update_pos pos _ [] = pos
{-
If there is no next token, I keep the position unchanged for simplicity.
A better answer: use the length of the current token to deduce the ending
position.
-}
```

### User Input Type

We now see how to support more input types. Still using `Token` as the basic unit, let's just switch the container type: let's use `Vector`. We just need to provide a `Stream` instance. Its method `uncons` simply needs to tell: Is the input empty? If not, what are the head and the tail?

```instance (Monad m) => Stream (Vector a) m a where
uncons v | V.null v = return Nothing
| otherwise = return (Just (V.head v, V.tail v))
```

With this instance, the parser needs little change, since it is not too sensitive to the input type. Here is the complete code:

```{-# LANGUAGE MultiParamTypeClasses, FlexibleInstances #-}
import Text.Parsec
import qualified Data.Vector as V
import Data.Vector(Vector)

data Token = N {lin,col :: Int, val :: Integer}
| Plus {lin,col :: Int}
instance Show Token where
show N{val=x} = "number " ++ show x
show Plus{} = "plus"
-- Show Token is needed for eof so it can produce error messages

example, wrong, bad :: Vector Token
example = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222]
wrong = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222, N 2 5 333]
bad = V.fromList [N 1 3 111, Plus 1 10, N 2 1 222, Plus 2 5]

play :: Vector Token -> Either ParseError Integer
play s = parse pmain "parameter" s

-- type Parsec s u = ParsecT s u Identity
pmain :: Parsec (Vector Token) () Integer
pmain = do
x <- pnum `chainl1` pplus
eof
return x

pnum = tokenPrim show update_pos get_num where
get_num N{val=x} = Just x
get_num _ = Nothing

pplus = tokenPrim show update_pos is_plus where
is_plus Plus{} = Just (+)
is_plus _ = Nothing

update_pos :: SourcePos -> Token -> Vector Token -> SourcePos
update_pos pos _ v
| V.null v = pos  -- for simplicity; there is a better answer
| otherwise = setSourceLine (setSourceColumn pos (col tok)) (lin tok)

instance (Monad m) => Stream (Vector a) m a where
uncons v | V.null v = return Nothing
| otherwise = return (Just (V.head v, V.tail v))
```

Parsec's monad is actually a monad transformer. In common usage it just transforms `Identity` to give a plain monad, but we can make it transform other monads for extra features.

The first example transforms `IO`. The parser can now print messages after parsing each number. `ParsecT` is also an instance of `MonadIO` when used this way, so we can just use `liftIO`.

```import Text.Parsec

play :: String -> IO (Either ParseError Integer)
play s = runParserT pmain () "parameter" s

pmain :: ParsecT [Char] () IO Integer
pmain = do
x <- pnum `chainl1` pplus
eof
return x

pnum = do
x <- read `fmap` many1 digit
liftIO (putStrLn "bling!")
return x

pplus = char '+' >> return (+)
```

If you know me, you know that I always sneak in `ContT` whenever I get to choose a monad. Here it is: I make Parsec transform `Cont` (`ContT` of `Identity`). `ParsecT` is also an instance of `MonadCont` when used this way, so we can directly use `callCC`.

In the following example, we add some bad code to `pnum`, but we use `callCC` to skip the bad code. The parser behaves as original.

```import Text.Parsec

play :: String -> Either ParseError Integer
play s = runCont (runParserT pmain () "parameter" s) id

pmain :: ParsecT [Char] () (Cont r) Integer
pmain = do
x <- pnum `chainl1` pplus
eof
return x

pnum :: ParsecT [Char] () (Cont r) Integer
pnum = callCC \$ \k -> do
x <- read `fmap` many1 digit
k x
-- the following two lines are skipped
many1 letter
return (x+10000)

pplus :: ParsecT [Char] () (Cont r) (Integer -> Integer -> Integer)
pplus = char '+' >> return (+)
```

### Yielding Parser: Introduction

With call/cc and mutable references, we can make and use generators (the yield command), which is in many people's Parsec wish list! Because in common usage, Parsec parses the input completely and returns one complete answer, or parses until a parse error and returns no answer, rather than incrementally reporting intermediate results. This is not what some applications want: they may want the intermediate results up to the parse error, or they may want to avoid spending memory for the complete, huge answer. But now we wield the power to yield!

My basic implementation of generators is in the Appendix. Here I just say how to use it. Use `mkgen` to create a generator from a body, like this:

```g <- mkgen (\yield c0 -> do
c1 <- yield a0
c2 <- yield a1
return b
)
```

The body receives its `yield` command and an initial value as function parameters. Calling `mkgen` does not start the body yet; it only returns the generator. Calling the generator `g` starts it (providing the initial value `c0`) and runs until hitting `yield`. Calling the generator again resumes (providing the return values of `yield`, here `c1` or `c2`). Like this:

```dot0 <- g c0  -- dot0 = More a0
dot1 <- g c1  -- dot1 = More a1
dot2 <- g c2  -- dot2 = End b
```

`g` takes a parameter to be communicated to the body, and returns a `More` or `End` depending on whether the body is yielding or finishing. `More` and `End` are defined in the Yield module.

Here is a simple Parsec example with yield. The parser normally fetches a character, yields it, increments a count, and repeats; but if it is told `False` when resuming, it finishes and returns the count as the final answer. The caller of the parser just calls the generator a few times, with the last time passing in `False` and getting the count. Therefore, though we use the infinite file /dev/zero as input, the parser reads and finishes finitely.

```import Text.Parsec
import Yield

main :: IO (Either ParseError Int)
main = runContT cmain return

cmain :: ContT r IO (Either ParseError Int)
cmain = do
g <- mkgen (\yield x -> runParserT (pmain (lift . yield) x) () "/dev/zero" s)
g True >>= liftIO . print
g True >>= liftIO . print
End z <- g False
return z

pmain yield b = loop b 0 where
loop False n = return n
loop True n = do
c <- anyChar
b <- yield c
loop b \$! (n+1)
```

You may note that the `yield` command and the generator do not live in `ParsecT`, but rather simply in `ContT r IO`. Accordingly, I need to add a level of `lift` to the `yield` command for use inside the parser. I do this because it suffices for my purpose. It can be done either way; there is no problem creating and using a generator at the `ParsecT` level, but it is usually unnecessary because we usually do not use Parsec operations between two invocations of the generator.

### Yielding Parser: Tree Application

And now, we put Yield to good use. When we talk about incremental yielding parsers, we likely have XML parsing in mind! So here is a language much simpler than XML but with the same basic idea:

A tree is a tag, optionally followed by subtrees; subtrees must be enclosed in square brackets. A tag is one or more letters and/or digits. Spacing is usually optional, except that naturally two consecutive tags need spaces to separate.

Tree ::= Tag [ '`[`' {Tree} '`]`' ]
Tag ::= alphanum{alphanum}

An example is “`f[x g[a b]]`”. The root tree has tag f and two children. The first child has tag x and is a leaf. The second child has tag g and two leaves, a and b.

The parser yields two kinds of events to mark the beginning and the end of each subtree: after parsing a tag, yield a `Start` event giving that tag; after detecting the end of a subtree, yield a `Stop` event giving that subtree.

Here is the complete example:

```import Control.Applicative((<*))
import Data.List(intercalate)
import Text.Parsec
import Yield(mkgen)

data Tree = Tree { tag :: String, sub :: [Tree] }
instance Show Tree where
show Tree{tag=t, sub=s} = case s of
[] -> t
_ -> t ++ "[" ++ intercalate " " (map show s) ++ "]"

data Event = Start String | Stop Tree deriving Show

pmain yield = spaces >> (ptree <* eof)
where
ptree = do
name <- ptag
yield (Start name)
cs <- option [] (between popen pclose (many ptree))
let tree = Tree name cs
yield (Stop tree)
return tree
ptag = (many1 alphaNum <?> "tagname") <* spaces
popen = char '[' <* spaces
pclose = char ']' <* spaces

main :: IO ()
main = runContT cmain return

cmain :: ContT r IO ()
cmain = do
let s = "f [ x g[a b]" ++ cycle "huge["
g <- mkgen (\yield _ -> runParserT (pmain (lift . yield)) () "hardcoded" s)
replicateM_ 12 (g () >>= liftIO . print)
```

Despite the infinite input, the whole program finishes because it does not resume the parser infinitely. It just obtains the first few events and quit.

This parser still spends memory to build the whole tree. You can change this if you want and if the caller can live with it: do not build any subtree, do not return any tree (return `()` instead), and at `Stop` events, give just the tag and not a subtree.

## Appendix: Generator and Yield

Here is the Yield module. It is explained in my Cont Monad article.

```{-| Yield-style generators. -}
module Yield where

import Data.IORef

{-|
Type of return values of generators. @More@ means the generator yields. @End@
means the generator finishes. See below for examples.
-}
data Dot a b = More a | End b deriving Show

{-|
Create a yield-style generator from a body. Example:

> g <- mkgen (\yield c0 -> do
>                c1 <- yield a0
>                c2 <- yield a1
>                return b
>            )

Then we can use:

> dot0 <- g c0  -- dot0 = More a0
> dot1 <- g c1  -- dot1 = More a1
> dot2 <- g c2  -- dot2 = End b

Further calls to @g@ return @End b@ too.

Although usually @m=n@, i.e., @g@ and the body are in the same monad as the
@mkgen@ call, technically they can be different. The @mkgen@ call
is in m with MonadIO for allocating IORef. The body and @g@ are in n with
-}
((a -> n c) -> c -> n b) -> m (c -> n (Dot a b))
mkgen body = do
inside <- liftIO (newIORef undefined)
outside <- liftIO (newIORef undefined)
let yield y = do
callCC (\ki -> do
liftIO (writeIORef inside ki)
o (More y)
)
next x = do
callCC (\ko -> do
liftIO (writeIORef outside ko)
i x
)
start x = do
e <- body yield x
liftIO (writeIORef inside (\_ -> return (End e)))