The basic type of a monadic parser is a function from an input string to a parse result, along with the rest of the string to be parsed.
One example is ReadS a = String -> [(a,String)], where returning a null list indicates no parse, and returning multiple values allows a parser to indicate ambiguity.
There are a few deficiencies in this data structure:
- The representation of ambiguity can lead to large space leaks, as the traditional combinators allow for unlimited backtracking
- It is difficult to wedge good error reporting into this setup.
The parsec parser, instead of returning a list of parses returns one of four results of the parse:
- I errored and did not consume input
- I errored and did consume input
- I succeded and did not consume input
- I succeded and did consume input
The idea is that once a parser consumes any input (that is, looks ahead more than one charecter) we prohibit backtracking. This limitation on backtracking means that we can drop the beging of the token-stream sooner, allowing it to be cleaned up by the garbage collector. There are also advantages to error reporting to making these distinctions.
Partridge & Wright seem to be the first to have introduced this splitting of the return value of a parse, but I can't find a non-paywalled version of their paper "Predictive parser combinators need four values to report errors." It seems like exactly the sort of paper I should be reading, but I'm also not sure on how to get in touch with the authors. Edit: Thanks to Chung-chieh Shan for sending me a copy of Partridge & Wright.
We still have a problem - we still need to hang on to the input string until we know that the parser succeeded or failed. In many cases we know that the parser consumes input long before we know that it was successful.
Parsec combines two data structures to return these four values:
data Consumed a = Consumed a | Empty a
data Reply a = Ok a State Message | Error Message
type Parser a = State -> Cosumed (Reply a)
Now our choice combinator can determine if a parser consumes input before we finish the parse - this means that we allow the GC to drop the head of the input as soon as the parser consumes any of it, solving the mentioned memory leak.
Part II - My obsession with functionalizing monad transformers
I have a bit of an obsession with the dual nature of data and functions, and converting algebraic data structures to their equivalent function form to see what I can uncover.
For example, in the mtl the ErrorT monad transformer is defined as follows:
newtype ErrorT e m a = ErrorT {
runErrorT :: m (Either e a)
}
throwError :: e -> ErrorT e m a
throwError = ErrorT . return . Left
When we give this monad semantics, in between each (>>=) we check the value on the LHS, ad if it's in Left we short circuit what's on the RHS so that the error value returns all the way to the end of the computation. This means at each stage of the computation we have to do a case analysis of the inner Eitehr value. Since the inner type is 'm (Either e a)' we also need to perform a (>>=) operation in the inner monad to perform (>>=) in ErrorT.
An equivalent type is:
newtype ErrorT e m a = ErrorT {
unError :: forall r . (e -> m r) -- error!
-> (a -> m r) -- success!
-> m r
}
The second function is our success continuation - it's passed in to an action, and called if the action successfully returns a value -
return :: a -> ErrorT e m a
return x = ErrorT $ \_ successK ->
successK x
The first function is the error continuation, and is called if an action needs to terminate abnormally:
throwError :: e -> ErrorT e m a
throwError e = ErrorT $ \errorK _ ->
errorK e
We then weave the continuations together in the implementation of (>>=) to get the
short-curcuiting we want:
(>>=) :: ErrorT e m a -> (a -> ErrorT e m b) -> ErrorT e m b
m >>= f = ErrorT $ \topErrorK topSuccessK->
unError m topErrorK $ \x ->
unError (f x) topErrorK topSuccessK
The LHS is given our top-level erorr handler to call if it errors out. If it succeds, it calls a lambda which evaluates the RHS. The RHS is given the same error handler as the LHS, and the success continuation for the RHS is the success continuation for the expression as a whole. So successes move from left to right, but if there's an error it can only go to one place.
Interesting points to note:
- There's no case analysis on data structures - short circuiting works because every action gets passed the same error continuation (unless we implement a 'catch' combinator).
- There are no constraints on the nature of the 'm' type variable. ErrorT is a monad independent of whatever it's wrapping.
I haven't benchmarked whether or not this is faster for anything, but I find the whole thing a lot of fun.
Part III - A faster Parsec 3
Parsec version 3 was released on hackage a bit back, and it improved on the prior version in two ways:
1) It was a monad transformer. This is pretty fun - as an exercise I wrote a unification engine in a monad, and then wrapped parsec around it. When it hit a syntax error in equality terms it could print the in-progress results of performing unification up until that point. Very fun.
2) It was parameterized over the input type. The previous version required a list of tokens as input.
The downside is that (when using the non-transformer compatibility layer) parsec-3 is 1.8x slower in some benchmarks as parsec-2.
But how can it be made faster without losing the new abstractions? The first thing I tried was to have parsec be split into two parsers - one transformer and one not, with two implementations of the core combinators. The core combinators would be moved into a type-class, and the higher-level combinators would then be polymorphic over either of them. Foks writing parsers could write them polymorphic, and then the folks running the parsers only pay for the new abstractions if they need it.
It worked, but the type-class itself didn't make any sense - it was more of a "put what I need over in this bucket" job than a careful consideration of what the core primitives of a monadic parser like parsec truly are. I also seem to remember that in introduced a problem in the compatibility layer, but it's been a while since I did that testing. You can see the haddocks here:
Text.Parsec.Class
Text.Parsec.Core
Text.Parsec.Combinator
This is also a radical restructuring of the parsec API - new constraints and new modules, and changing the meaning of another. Lots of fun.
Another approach which is a much smaller change to the visible part of parsec but is a much more radical change to the inner workings is to do to ParsecT what we did to ErrorT - return via continuations rather than an algebraic data type. Where ErrorT needed two continuations, ParsecT requires four:
newtype ParsecT s u m a
= ParsecT {unParser :: forall b .
State s u
-> (a -> State s u -> ParseError -> m b) -- consumed ok
-> (ParseError -> m b) -- consumed error
-> (a -> State s u -> ParseError -> m b) -- empty ok
-> (ParseError -> m b) -- empty error
-> m b
}
When the parser errors we call one of the error continuations with the information about the error. When the parse succeeds we call one of the success continuations, passing along the parsed value and the new state of the parse.
And this is practically as fast as parsec-2 for folks not using the new abstractions in parsec-3. I believe it's because we no longer pay for the abstraction in the core cobinators - none of the primitives or combinators place any constraints on the type of 'm', just as for the continuation-based ErrorT transformer.
But what of Patridge & Wright's space leak? Where is the laziness introduced in the Parsec technical report? I've gone from the nested structure back to a flat structure. How can this be as fast as the Parsec of the report if I've re-introduced the space leak?
It was the case analisys on the not-lazy-enough return value of the parser which introduced the space leak, but we don't have have that. We just pass continuations in to the parsers, which may call them as they will. As long as the core primitves aren't capable of calling the "I haven't consummed input" continuations after they have consumed input then we're free to garbage collect those continuations as well as the consummed bits of input. Space leak averted.
The darcs repo for the continuation based parsec is here: http://community.haskell.org/~aslatter/code/parsec/cps/
Appendix A: Further adventures in ErrorT
In case you needed convincing that the above formulation of ErrorT is equivalent to that in the mtl:
catch :: ErrorT e m a -> (e -> ErrorT e m a) ->ErrorT e m a
catch m handler = ErrorT $ \topErrorK topSuccessK ->
let errorK e = unError (handler e) topErrorK topSuccessK
in unError m errorK topSuccessK
runErrorT :: Monad m => ErrorT e m a -> m (Either e a)
runErrorT (ErrorT f) = f errorK successK
where successK = return . Right
errorK = return . Left
