HTML Parsing in Clojure using HtmlCleaner

As part of a Clojure project I’m working on, I needed to parse some HTML pages and extract their titles and tag-stripped (textual) content, which would then be fed to Solr.

Now, there are many many HTML parsers available for Java. (One cannot just parse HTML from the World Wild Web by treating it as XML — most of the HTML there is not even well-formed. HTML parsers are able to deal with the kludge and present a clean picture of the page to the consumer.) TagSoup and HtmlCleaner are two of the more popular ones. I like HtmlCleaner better.

(TagSoup provides a SAX interface and I find SAX to be a PITA when dealing with HTML. HtmlCleaner, on the other hand, provides a simple DOM interface; it also has helpful options like omitComments, pruneTags etc. Besides, I’ve found that, in practice, HtmlCleaner is usually able to do a better job of handling badly written HTML than TagSoup; some others also agree with me. If all this wasn’t enough reason to pick HtmlCleaner, it is in general faster than TagSoup.)

So, here is the code to parse HTML with HtmlCleaner and extract the title and content of the page:

You’ll notice that I have used Apache commons-lang’s StringEscapeUtils to ‘unescape’ HTML entities (convert & to & etc.) — HtmlCleaner does not do this automatically.

I couldn’t find a Maven repository for HtmlCleaner, so I uploaded it to clojars. If you are going to use this in a lein project, you’ll want to add it to your dependencies:

Happy parsing!

Tags:

6 Responses to “HTML Parsing in Clojure using HtmlCleaner”

  1. [...] This post was mentioned on Twitter by Siddhartha Reddy. Siddhartha Reddy said: Just published a blog post on how to do HTML parsing in Clojure: http://bit.ly/cWwDSV [...]

  2. [...] so there. You could use a Java-based HTML parser, such as HtmlCleaner. There was recently an excellent article about it. But lets say, that you would prefer to do it in a more functional style. Well, this is [...]

  3. [...] HTML Parsing in Clojure using HtmlCleaner – @ infinity, plus 1 (tags: clojure html parsing) [...]

  4. You can write this code with Enlive:
    http://gist.github.com/393194

  5. sids says:

    Christophe,

    Thanks a ton for posting this, this is so much nicer than using HtmlCleaner. I’ve put off going through the Enlive tutorial for far too long now; after seeing this little snippet, I’m unwilling to put it off any more.

  6. Vladimir says:

    Does someone have an idea how to modify parse-page function to access the value of href atribute. For example if I have:

    Visit W3Schools

    I want to extract http://www.w3schools.com ?

    Thanks.