Blog Archive

Sunday, February 10, 2008

HTML Scraping with TagSoup

Earlier today I wanted to know the packages on Hackage which stated a dependency on Parsec, so I wrote a command-line utility to do it. This post presents the utility.

The plan is simple: grab http://hackage.haskell.org/packages/archive/pkg-list.html, extract all of the links which look like links to packages, and then for each of the package-description pages find out if the dependency list includes parsec. If it does, print the package name.

First, a few preliminaries:

> import Data.Maybe
> import Network.HTTP
> import Network.URI
> import System.Environment
> import Text.HTML.TagSoup
> import Text.Regex.Base
> import Text.Regex.Posix.String
> import Text.Regex.Posix.Wrap

You could probably use a different Regex package if you wanted to without too much trouble.

First up, a few strings broken out of the body of the program for convenience should they need changing.

> name = "hackage-dep"
> version = "0.1.0"

> baseURIString = "http://hackage.haskell.org"
> packagesURI =
> fromJust $ parseURI $ baseURIString ++ "/packages/archive/pkg-list.html"
> basePath = "/cgi-bin/hackage-scripts/package/"
The function parseURI comes from the Network.URI module. It converts a String to the URI datatype used by the Network.* modules.

Next, I need a few functions to fetch an HTML document given a URI:

> mkSimpleGet :: URI -> Request
> mkSimpleGet uri =
> Request uri GET [Header HdrUserAgent (name ++ " v" ++ version)] []

> simpleGet :: URI -> IO (Result Response)
> simpleGet = simpleHTTP . mkSimpleGet

> body :: Result Response -> Either String String
> body (Right (Response (2,_,_) _ _ str)) = Right str
> body (Right (Response code _ _ _)) = Left $ printCode code
> body (Left e) = Left $ show e

> printCode :: ResponseCode -> String
> printCode (a,b,c) = show a ++ show b ++ show c

> errorString :: String -> String -> String
> errorString uri err =
> "Error getting " ++ uri ++ "\n" ++ "Error: " ++ err

The two interesting functions here are simpleGet and body: simpleGet performs an HTTP GET request with the passed-in URI, and body extracts the body from the response if it was successful.
Now we can start on the HTML manipulation.

> type HTML = String

> links :: HTML -> [Tag]
> links = filter (~== TagOpen "a" []) . parseTags
links converts an HTML document into a list of link tags, using TagSoup.

And then the function packageInfo extracts the package name from a link to that package.

> type Package = String

> packageInfo :: Tag -> Maybe Package
> packageInfo (TagOpen "a" []) = Nothing
> packageInfo t@(TagOpen "a" attrs) =
> case fromAttrib "href" t of
> [] -> Nothing
> path -> info path
> packageInfo _ = Nothing

> packageName = "^" ++ basePath ++ "(.+)$"

> info :: String -> Maybe Package
> info str =
> case str =~ packageName of
> (_,_,_,[]) -> Nothing
> (_,_,_,[package]) -> Just package
> (_::(String,String,String,[String])) -> Nothing
And once I have a list of package names, I'll want to grab the web-page describing the package:

> packageURI :: Package -> URI
> packageURI =
> fromJust . parseURI . ((baseURIString ++ basePath) ++)

> packageGet :: Package -> IO (Result Response)
> packageGet = simpleGet . packageURI
The idea is that I can call packageGet on an extracted Package, and then I can use the previously defined body function to get the HTML out of the HTTP response.

Now, let's get on with the main function:

> main :: IO ()
> main = do
> arg <- (do {[arg] <- getArgs; return arg})
> `catch`
> (\_ -> error "Requires a single command line argument")
> res <- simpleGet packagesURI
> case body res of
> Left str -> putStrLn $ errorString (show packagesURI) str
> Right html -> findDeps (=~ arg) $ filterJust $ map packageInfo $ links html

The filterJust $ map packageInfo $ links html bit extracts a list of package names from the HTML list pulled off of hackage. The function findDeps takes this list along with a passed in testing function and prints out which packages depend on the package specified at the command line. The passed-in testing function is just a regex-match based on the single command-line argument.

> filterJust :: [Maybe a] -> [a]
> filterJust xs = [x | Just x <- xs]

> findDeps :: (String -> Bool) -> [Package] -> IO ()
> findDeps p ps = mapM_ (printIfDep p) ps

> printIfDep :: (String -> Bool) -> Package -> IO ()
> printIfDep p pTest = do
> res <- packageGet pTest
> case body res of
> Left e -> putStrLn $ errorString pTest e
> Right html ->
> if hasDep html p
> then putStrLn pTest
> else return ()

The function hasDep picks the "Dependencies" field out of the passed-in HTML text, and then returns true if the passed-in test returns true on any bit of string in the dependencies field.

> hasDep :: HTML -> (String -> Bool) -> Bool
> hasDep html p =
> let tags = parseTags html
> depTags = takeWhile (~/= (TagClose "tr")) $
> drop 1 $
> dropWhile (~/= (TagText "Dependencies")) $
> tags
> depText = filterText depTags
>
> filterText xs = [x | TagText x <- xs] :: [String]
> in any p depText


After saving and compiling, executing ./Main parsec will (slowly) list all of the packages on Hackage which depend on Parsec. Success!

Exercise for the reader: Implement the above functionality by grabbing the 00-index.tar.gz off of Hackage instead of scraping HTML pages. This file contains all of the .cabal files for every version of every package hosted on Hackage. For bonus points cache the index on disk between calls.

1 comment:

Dougal said...

The filterJust function is the catMaybes function from Data.Maybe. :-)

Listening:

Watching:

  • House
  • Ride Back