FS

Monday, December 27, 2010

Editing XML in Haskell

edit: Here's an hpaste for the solution I eventually came up with: http://hpaste.org/42628/xml_cursor_monad

I recently found myself wanting to make small tweaks to an XML file not under my control before processing it in my Haskell app.

My end goal would be a way to a) declare what in the XML I want to edit and then b) apply a function to the located element and have it update in place.

The hxt (Haskell Xml Toolkit) has a lot of pieces, and includes and XPath parser, so I looked at it first. But like XSLT, hxt seems geared towards XML processing - not editing. It has tools for applying an translation recursively through a tree, but that requires finding an element based on a predicate on the element, not based on it's location within the document. hxt does let you extract elements based on location, but I don't see how to put the original document back together again. Maybe I'm missing something.

Then I noticed that the xml package (sometimes referred to as xml-light) has a Cursor written for it! So all I need to do is navigate the cursor down to where I need it, apply the update function and then I'm done. That's declarative enough for me.

There were two problems with this:

A lot of the cursor manipulation functions return maybe types
I didn't feel comfortable composing functions on cursors - if a sub-function goes off the deep end it could have left the cursor anywhere in the document DOM

So I did what any other Haskell programmer would do - I wrote my own XML editing monad to fix and then encapsulate the problems above.

> data Update a = ...

The Update type has instances for Monad, Functor, and Applicative - and to handle the failing traversal functions it has instances for Alternative and MonadPlus.

It has the primitive operations:


> runUpdate :: Cursor -> Update a -> Maybe (a, Cursor)

> perform :: (Cursor -> Maybe Cursor) -> Update ()

> asks :: (Cursor -> a) -> Update a

These are used to wrap up all of the functions from Text.XML.Light.Cursor in a straightforward way.

To give me more control over the composition of cursor update there are also the following primitives:


> sandbox :: Update a -> Update a

> run :: Update a -> Update ()

The function 'sandbox' executes the passed in action, but contains it to the current scope of the cursor - the action may not access the parent or siblings of the current node. In addition, the cursor is returned to its current position regardless of where the passed in action left it.

The function 'run' is the same as 'sandbox' - except that if the passed action fails we pretend that nothing happened.

So go nuts - you can now declare edits and traversals without worrying about how to fit them in to the bigger picture. We have combinators for that.

Am I off the deep end with this? Are there other tools on Hackage I should be using?

Monday, December 28, 2009

Adventures in Parsec

Part I - An introduction to Parsec

The basic type of a monadic parser is a function from an input string to a parse result, along with the rest of the string to be parsed.

One example is ReadS a = String -> [(a,String)], where returning a null list indicates no parse, and returning multiple values allows a parser to indicate ambiguity.

There are a few deficiencies in this data structure:

The representation of ambiguity can lead to large space leaks, as the traditional combinators allow for unlimited backtracking
It is difficult to wedge good error reporting into this setup.

The parsec parser, instead of returning a list of parses returns one of four results of the parse:

I errored and did not consume input
I errored and did consume input
I succeded and did not consume input
I succeded and did consume input

The idea is that once a parser consumes any input (that is, looks ahead more than one charecter) we prohibit backtracking. This limitation on backtracking means that we can drop the beging of the token-stream sooner, allowing it to be cleaned up by the garbage collector. There are also advantages to error reporting to making these distinctions.

Partridge & Wright seem to be the first to have introduced this splitting of the return value of a parse, but I can't find a non-paywalled version of their paper "Predictive parser combinators need four values to report errors." It seems like exactly the sort of paper I should be reading, but I'm also not sure on how to get in touch with the authors. Edit: Thanks to Chung-chieh Shan for sending me a copy of Partridge & Wright.

We still have a problem - we still need to hang on to the input string until we know that the parser succeeded or failed. In many cases we know that the parser consumes input long before we know that it was successful.

Parsec combines two data structures to return these four values:


data Consumed a = Consumed a | Empty a

data Reply a = Ok a State Message | Error Message

type Parser a = State -> Cosumed (Reply a)

Now our choice combinator can determine if a parser consumes input before we finish the parse - this means that we allow the GC to drop the head of the input as soon as the parser consumes any of it, solving the mentioned memory leak.

Part II - My obsession with functionalizing monad transformers

I have a bit of an obsession with the dual nature of data and functions, and converting algebraic data structures to their equivalent function form to see what I can uncover.

For example, in the mtl the ErrorT monad transformer is defined as follows:


newtype ErrorT e m a = ErrorT {
  runErrorT :: m (Either e a)
}

throwError :: e -> ErrorT e m a
throwError = ErrorT . return . Left

When we give this monad semantics, in between each (>>=) we check the value on the LHS, ad if it's in Left we short circuit what's on the RHS so that the error value returns all the way to the end of the computation. This means at each stage of the computation we have to do a case analysis of the inner Eitehr value. Since the inner type is 'm (Either e a)' we also need to perform a (>>=) operation in the inner monad to perform (>>=) in ErrorT.

An equivalent type is:


newtype ErrorT e m a = ErrorT {
  unError :: forall r . (e -> m r)  -- error!
                       -> (a -> m r)  -- success!
                       -> m r
}

The second function is our success continuation - it's passed in to an action, and called if the action successfully returns a value -


return :: a -> ErrorT e m a
return x = ErrorT $ \_ successK ->
   successK x

The first function is the error continuation, and is called if an action needs to terminate abnormally:


throwError :: e -> ErrorT e m a
throwError e = ErrorT $ \errorK _ ->
              errorK e

We then weave the continuations together in the implementation of (>>=) to get the
short-curcuiting we want:


(>>=) :: ErrorT e m a -> (a -> ErrorT e m b) -> ErrorT e m b
m >>= f = ErrorT $ \topErrorK topSuccessK->
   unError m topErrorK $ \x ->
   unError (f x) topErrorK topSuccessK

The LHS is given our top-level erorr handler to call if it errors out. If it succeds, it calls a lambda which evaluates the RHS. The RHS is given the same error handler as the LHS, and the success continuation for the RHS is the success continuation for the expression as a whole. So successes move from left to right, but if there's an error it can only go to one place.

Interesting points to note:

There's no case analysis on data structures - short circuiting works because every action gets passed the same error continuation (unless we implement a 'catch' combinator).
There are no constraints on the nature of the 'm' type variable. ErrorT is a monad independent of whatever it's wrapping.

I haven't benchmarked whether or not this is faster for anything, but I find the whole thing a lot of fun.

Part III - A faster Parsec 3

Parsec version 3 was released on hackage a bit back, and it improved on the prior version in two ways:

1) It was a monad transformer. This is pretty fun - as an exercise I wrote a unification engine in a monad, and then wrapped parsec around it. When it hit a syntax error in equality terms it could print the in-progress results of performing unification up until that point. Very fun.

2) It was parameterized over the input type. The previous version required a list of tokens as input.

The downside is that (when using the non-transformer compatibility layer) parsec-3 is 1.8x slower in some benchmarks as parsec-2.

But how can it be made faster without losing the new abstractions? The first thing I tried was to have parsec be split into two parsers - one transformer and one not, with two implementations of the core combinators. The core combinators would be moved into a type-class, and the higher-level combinators would then be polymorphic over either of them. Foks writing parsers could write them polymorphic, and then the folks running the parsers only pay for the new abstractions if they need it.

It worked, but the type-class itself didn't make any sense - it was more of a "put what I need over in this bucket" job than a careful consideration of what the core primitives of a monadic parser like parsec truly are. I also seem to remember that in introduced a problem in the compatibility layer, but it's been a while since I did that testing. You can see the haddocks here:

Text.Parsec.Class
Text.Parsec.Core
Text.Parsec.Combinator

This is also a radical restructuring of the parsec API - new constraints and new modules, and changing the meaning of another. Lots of fun.

Another approach which is a much smaller change to the visible part of parsec but is a much more radical change to the inner workings is to do to ParsecT what we did to ErrorT - return via continuations rather than an algebraic data type. Where ErrorT needed two continuations, ParsecT requires four:


newtype ParsecT s u m a
  = ParsecT {unParser :: forall b .
               State s u
            -> (a -> State s u -> ParseError -> m b) -- consumed ok
            -> (ParseError -> m b)                   -- consumed error
            -> (a -> State s u -> ParseError -> m b) -- empty ok
            -> (ParseError -> m b)                   -- empty error
            -> m b
           }

When the parser errors we call one of the error continuations with the information about the error. When the parse succeeds we call one of the success continuations, passing along the parsed value and the new state of the parse.

And this is practically as fast as parsec-2 for folks not using the new abstractions in parsec-3. I believe it's because we no longer pay for the abstraction in the core cobinators - none of the primitives or combinators place any constraints on the type of 'm', just as for the continuation-based ErrorT transformer.

But what of Patridge & Wright's space leak? Where is the laziness introduced in the Parsec technical report? I've gone from the nested structure back to a flat structure. How can this be as fast as the Parsec of the report if I've re-introduced the space leak?

It was the case analisys on the not-lazy-enough return value of the parser which introduced the space leak, but we don't have have that. We just pass continuations in to the parsers, which may call them as they will. As long as the core primitves aren't capable of calling the "I haven't consummed input" continuations after they have consumed input then we're free to garbage collect those continuations as well as the consummed bits of input. Space leak averted.

The darcs repo for the continuation based parsec is here: http://community.haskell.org/~aslatter/code/parsec/cps/

Appendix A: Further adventures in ErrorT

In case you needed convincing that the above formulation of ErrorT is equivalent to that in the mtl:


catch :: ErrorT e m a -> (e -> ErrorT e m a) ->ErrorT e m a
catch m handler = ErrorT $ \topErrorK topSuccessK ->
 let errorK e = unError (handler e) topErrorK topSuccessK
 in unError m errorK topSuccessK

runErrorT :: Monad m => ErrorT e m a -> m (Either e a)
runErrorT (ErrorT f) = f errorK successK
 where successK = return . Right
       errorK = return . Left

Saturday, April 18, 2009

Using Haskeline

Earlier today I decided to unearth an old project of mine - figuring that the best way to learn two languages was to implement one in the other, I wrote a MUMPS interpreter in Haskell. I was learning MUMPS for work, and Haskell for fun.

Back when I wrote it, I used readline in the REPL part of the interpreter - during the cleanup I wanted to move away from readline as GHC doesn't ship with it any more, and sometimes it can be a pain to install on its own. So I switched to Haskeline. It doesn't ship with GHC either, but it's proven easier for me to install.

Haskeline has got a really friendly API, with all of the functions operating inside the InputT m monad transformer. "Great," I think, "I can just pile this on top of my existing monad transformers stack in the interpreter!"

All was not so simple, as InputT has it's own instance of MonadState and MonadReader which allows the user of the library to peer into the guts of the implementation. But I didn't want to monkey with my text entry, I just wanted it to work and get out of my way, and I wanted the rest of my code to use the MonadState instance further down the stack that I had already set up.

So I wrote a small wrapper for Haskeline that's more friendly to mtl-style monad transformer composition. As written, it only composes with MonadIO and MonadState, but it would be straightforward to do more.

My wrapper uses HaskelineT instead of InputT, and exposes the same core functions as Haskeline (except for withInterrupt). It doesn't do anything I couldn't do by peppering lifts all over the place, but this way feels a but cleaner to me - Haskeline keeps its workings to itself, and I don't have to think about the order of the layered monad transformers.


{-# LANGUAGE FlexibleInstances
           , MultiParamTypeClasses
           , UndecidableInstances
           , GeneralizedNewtypeDeriving
  #-}

import qualified System.Console.Haskeline as H
import System.Console.Haskeline.Completion
import System.Console.Haskeline.MonadException

import Control.Applicative
import Control.Monad.State

newtype HaskelineT m a = HaskelineT {unHaskeline :: H.InputT m a}
 deriving (Monad, Functor, Applicative, MonadIO, MonadException, MonadTrans, MonadHaskeline)

runHaskelineT :: MonadException m => H.Settings m -> HaskelineT m a -> m a
runHaskelineT s m = H.runInputT s (unHaskeline m)

runHaskelineTWithPrefs :: MonadException m => H.Prefs -> H.Settings m -> HaskelineT m a -> m a
runHaskelineTWithPrefs p s m = H.runInputTWithPrefs p s (unHaskeline m)

class MonadException m => MonadHaskeline m where
    getInputLine :: String -> m (Maybe String)
    getInputChar :: String -> m (Maybe Char)
    outputStr :: String -> m ()
    outputStrLn :: String -> m ()


instance MonadException m => MonadHaskeline (H.InputT m) where
    getInputLine = H.getInputLine
    getInputChar = H.getInputChar
    outputStr = H.outputStr
    outputStrLn = H.outputStrLn


instance MonadState s m => MonadState s (HaskelineT m) where
    get = lift get
    put = lift . put

instance MonadHaskeline m => MonadHaskeline (StateT s m) where
    getInputLine = lift . getInputLine
    getInputChar = lift . getInputChar
    outputStr = lift . outputStr
    outputStrLn = lift . outputStrLn

Sunday, February 15, 2009

MaybeT - The CPS Version

> {-# LANGUAGE Rank2Types #-}
> import Control.Monad

I think I finally understand writing code into continuation passing style. I've understood it at an academic level for some time - but that's different from being able to write the code.

This post presents a different implementation of the Maybe monad transformer - usually presented as so:


data MaybeT m a = MaybeT {runMaybeT :: m (Maybe a)}

which can be used to add the notion of short-circuiting failure to any other monad (sortof a simpler version of ErrorT from the MTL).
I first came across MaybeT in a page on the Haskell Wiki.
This presentation of MaybeT uses the Church encoding of the data-type:

> newtype MaybeT m a = MaybeT {unMaybeT :: forall b . m b -> (a -> m b) -> m b}

Note the similarity to the Prelude function maybe. We can unwrap the transformer like so:


> runMaybeT :: Monad m => MaybeT m a -> m (Maybe a)
> runMaybeT m = unMaybeT m (return Nothing) (return . Just)

This runMaybeT should be a drop-in replacement for the old one.
The advantage here is that we can write the Monad and MonadPlus instances without calling bind or return in the underlying monad m, and without doing any case analysis on Just or Nothing values:

> instance Monad (MaybeT m) where
>     return x = MaybeT $ \_ suc -> suc a
> 
>     m >>= k = MaybeT $ \fail suc ->
>               unMaybeT m fail $ \x ->
>               unMaybeT (k x) fail suc
>
>     fail _ = mzero

> instance MonadPlus (MaybeT m) where
>     mzero = MaybeT $ \fail _ -> fail
>
>     m `mplus` n = MaybeT $ \fail suc ->
>                   unMaybeT m (unMaybeT n fail suc) suc

It's just a matter of threading the failure and success continuations to the right place at the right time.
To show that this is equivalent to the old implementation, here's a re-write of the old MaybeT data constructor from above:


> fromMaybe :: Monad m => m (Maybe a) -> MaybeT m a
> fromMaybe m = MaybeT $ \fail suc -> do
>                res <- m
>                case res of
>                  Nothing -> fail
>                  Just x -> suc x

So anything you can do with the other version, you can do with this version. And for most things it should be a drop-in replacement.

Thursday, February 12, 2009

Dependencies in Hackage, revisited

In a previous post I described how to scrape the Hackage website to do reverse lookups on package dependency data for packages hosted on Hackage.

With the release of the new HTTP library (version 4000) that code doesn't work anymore. This post presents a different solution to the problem.

Instead of pulling data out of html documents, we instead load and parse the local .tar file that cabal-install uses to do its own dependency chasing.

You'll need tar and utf8-string from Hackage.

First, the necessary imports:


> import Data.Maybe
> import Data.List

> import Codec.Archive.Tar

> import Data.ByteString.Lazy (ByteString)
> import qualified Data.ByteString.Lazy as BS

> import qualified Data.ByteString.Lazy.UTF8 as UTF8

> import System.IO
> import System.Environment

> import Distribution.Text
> import Distribution.Package
> import Distribution.PackageDescription
> import Distribution.PackageDescription.Parse

And now the 'main' method.

The first argument is the name of the package you're checking dependencies for, and the second argument is the path to the Hackage index tar-file (for me this is ~/.cabal/packages/hackage.haskell.org/00-index.tar).


> main :: IO ()
> main = do
>  [target,tarball] <- getArgs
>  withFile tarball ReadMode $ \h -> do
>        contents <- BS.hGetContents h
>        let matches = matchesFromIndex contents (== target)
>        sequence_ $ map (print . disp) matches

And then we have the function which, given the contents of the tar-file as a ByteString, returns back the list of PackageIds which depend on the indicated package.


> matchesFromIndex :: ByteString -> (String -> Bool) -> [PackageId]
> matchesFromIndex index p =

>  let tarchive = readTarArchive index
>      cabalFiles = map UTF8.toString $ findCabalEntries tarchive
>      parseResults = map parsePackageDescription cabalFiles
>      gPckgDiscs = okayOnly parseResults
>      matches = filter (match p) gPckgDiscs

>  in map packageId matches

> okayOnly :: [ParseResult a] -> [a]
> okayOnly = mapMaybe fromOkay
>  where fromOkay (ParseOk _ a) = Just a
>        fromOkay _ = Nothing


> -- Does this package have a dependency which matches our
> -- query?
> match ::  (String -> Bool) -> GenericPackageDescription -> Bool
> match p pkg = any (matchDep p) (gPckgDeps pkg)

> -- Does this dependency match our query?
> matchDep :: (String -> Bool) -> Dependency -> Bool
> matchDep p (Dependency (PackageName name) _) = p name

There's a bit of black-magic going on here - I don't entirely understand the structure of the new 'library' and 'executable' sections of the .cabal file, but I scrape everything out which has the right type.


> gPckgDeps :: GenericPackageDescription -> [Dependency]
> gPckgDeps pkg = normalDeps ++ libDeps ++ execDeps
>  where
>   normalDeps = buildDepends $ packageDescription pkg

>   libDeps = case condLibrary pkg of
>               Nothing -> []
>               Just cndTree -> depsFromCndTree exLibDeps cndTree

>   execDeps = concatMap (depsFromCndTree exExecDeps . snd)
>                        (condExecutables pkg)

>   exLibDeps = pkgconfigDepends . libBuildInfo
>   exExecDeps = pkgconfigDepends . buildInfo

> depsFromCndTree f tree =
>   let x = condTreeData tree
    
>       parts = condTreeComponents tree
>       mdlTrees = map mdl parts
>       thrdTrees = mapMaybe thrd parts

>       trees = mdlTrees ++ thrdTrees


>   in f x ++
>      condTreeConstraints tree ++
>      concatMap (depsFromCndTree f) trees

>  where mdl (_,x,_) = x
>        thrd (_,_,x) = x

And this is the bit which takes the decoded tar-file and returns back the entries which look like they could be .cabal files.


> findCabalEntries :: TarArchive -> [ByteString]
> findCabalEntries TarArchive{archiveEntries = xs} = mapMaybe go xs

>  where go :: TarEntry -> Maybe ByteString
>        go x | fileType x /= TarNormalFile = Nothing
>             | isBoringName (fileName x) = Nothing
>             | otherwise = Just $ entryData x

>        fileType = tarFileType . entryHeader
>        fileName = tarFileName . entryHeader

>        isBoringName = not . isSuffixOf ".cabal"

Not too shabby.

Saturday, June 21, 2008

Haskell Snippets

I'm a huge fan of the function mapMaybe, but once I move from the 'Maybe' monad into something more complex (such as ReaderT r Maybe) things become tricky.

First, what is mapMaybe?

Its type is: (a -> Maybe b) -> [a] -> [b]

It maps the input function over the list, and drops any values which evaluate to nothing. It's like a combination of map and filter, where the input function is given the option to either transform the input or filter it out.

But then I needed more information threaded around in my functions, and the types went from a -> Maybe b to a -> ReaderT r Maybe b.

So I needed:

> mapAlt :: Alternative f => (a -> f b) -> [a] -> f [b]

It's just like mapMaybe, except it works for any Alternative functor.

The output is still in the functor f so I can have it work for effectful monads and such, but it will always return a value (even if it's the empty list).

Here's the implementation:

> mapAlt f xs = go xs
>  where go [] = pure []
>        go (y:ys) = (pure (:) <*> f y <*> go ys)
>                <|> go ys

Links:

Hurrah for simple, useful functions.

Sunday, February 10, 2008

HTML Scraping with TagSoup

Earlier today I wanted to know the packages on Hackage which stated a dependency on Parsec, so I wrote a command-line utility to do it. This post presents the utility.

The plan is simple: grab http://hackage.haskell.org/packages/archive/pkg-list.html, extract all of the links which look like links to packages, and then for each of the package-description pages find out if the dependency list includes parsec. If it does, print the package name.

First, a few preliminaries:


> import Data.Maybe
> import Network.HTTP
> import Network.URI
> import System.Environment
> import Text.HTML.TagSoup
> import Text.Regex.Base
> import Text.Regex.Posix.String
> import Text.Regex.Posix.Wrap

You could probably use a different Regex package if you wanted to without too much trouble.

First up, a few strings broken out of the body of the program for convenience should they need changing.


> name = "hackage-dep"
> version = "0.1.0"


> baseURIString = "http://hackage.haskell.org"
> packagesURI =
>    fromJust $ parseURI $ baseURIString ++ "/packages/archive/pkg-list.html"
> basePath = "/cgi-bin/hackage-scripts/package/"

The function parseURI comes from the Network.URI module. It converts a String to the URI datatype used by the Network.* modules.

Next, I need a few functions to fetch an HTML document given a URI:


> mkSimpleGet :: URI -> Request
> mkSimpleGet uri =
>   Request uri GET [Header HdrUserAgent (name ++ " v" ++ version)] []

> simpleGet :: URI -> IO (Result Response)
> simpleGet = simpleHTTP . mkSimpleGet

> body :: Result Response -> Either String String
> body (Right (Response (2,_,_) _ _ str)) = Right str
> body (Right (Response code _ _ _)) = Left $ printCode code
> body (Left e) = Left $ show e

> printCode :: ResponseCode -> String
> printCode (a,b,c) = show a ++ show b ++ show c

> errorString :: String -> String -> String
> errorString uri err =
>   "Error getting " ++ uri ++ "\n" ++ "Error: " ++ err

The two interesting functions here are simpleGet and body: simpleGet performs an HTTP GET request with the passed-in URI, and body extracts the body from the response if it was successful.
Now we can start on the HTML manipulation.


> type HTML = String

> links :: HTML -> [Tag]
> links = filter (~== TagOpen "a" []) . parseTags

links converts an HTML document into a list of link tags, using TagSoup.

And then the function packageInfo extracts the package name from a link to that package.


> type Package = String

> packageInfo :: Tag -> Maybe Package
> packageInfo (TagOpen "a" [])      = Nothing
> packageInfo t@(TagOpen "a" attrs) =
>     case fromAttrib "href" t of
>       [] -> Nothing
>       path -> info path
> packageInfo _ = Nothing

> packageName = "^" ++ basePath ++ "(.+)$"

> info :: String -> Maybe Package
> info str =
>     case str =~ packageName of
>       (_,_,_,[]) -> Nothing
>       (_,_,_,[package]) -> Just package
>       (_::(String,String,String,[String])) -> Nothing

And once I have a list of package names, I'll want to grab the web-page describing the package:


> packageURI :: Package -> URI
> packageURI =
>   fromJust . parseURI . ((baseURIString ++ basePath) ++)

> packageGet :: Package -> IO (Result Response)
> packageGet = simpleGet . packageURI

The idea is that I can call packageGet on an extracted Package, and then I can use the previously defined body function to get the HTML out of the HTTP response.

Now, let's get on with the main function:


> main :: IO ()
> main = do
>   arg <- (do {[arg] <- getArgs; return arg})
>      `catch`
>      (\_ -> error "Requires a single command line argument")
>   res <- simpleGet packagesURI
>   case body res of
>     Left str -> putStrLn $ errorString (show packagesURI) str
>     Right html -> findDeps (=~ arg) $ filterJust $ map packageInfo $ links html

The filterJust $ map packageInfo $ links html bit extracts a list of package names from the HTML list pulled off of hackage. The function findDeps takes this list along with a passed in testing function and prints out which packages depend on the package specified at the command line. The passed-in testing function is just a regex-match based on the single command-line argument.


> filterJust :: [Maybe a] -> [a]
> filterJust xs = [x | Just x <- xs]

> findDeps :: (String -> Bool) -> [Package] -> IO ()
> findDeps p ps = mapM_ (printIfDep p) ps

> printIfDep :: (String -> Bool) -> Package -> IO ()
> printIfDep p pTest = do
>   res <- packageGet pTest
>   case body res of
>     Left e     -> putStrLn $ errorString pTest e
>     Right html ->
>         if hasDep html p
>         then putStrLn pTest
>         else return ()

The function hasDep picks the "Dependencies" field out of the passed-in HTML text, and then returns true if the passed-in test returns true on any bit of string in the dependencies field.


> hasDep :: HTML -> (String -> Bool) -> Bool
> hasDep html p =
>     let tags = parseTags html
>         depTags = takeWhile (~/= (TagClose "tr")) $
>                   drop 1 $
>                   dropWhile (~/= (TagText "Dependencies")) $
>                   tags
>         depText = filterText depTags
>
>         filterText xs = [x | TagText x <- xs] :: [String]
>     in any p depText

After saving and compiling, executing ./Main parsec will (slowly) list all of the packages on Hackage which depend on Parsec. Success!

Exercise for the reader: Implement the above functionality by grabbing the 00-index.tar.gz off of Hackage instead of scraping HTML pages. This file contains all of the .cabal files for every version of every package hosted on Hackage. For bonus points cache the index on disk between calls.

FS

Blog Archive