Posts Tagged ‘clojure’

`lein run` plugin

Friday, May 7th, 2010

Did I ever mention that leiningen, the Clojure build tool, is awesome? Anything that spares a programmer in the Java world from the hell that is XML config files (I’m looking at you, ant and maven, the gatekeepers of hell) will find itself being declared as much. To top it off, leiningen (lein, for short) uses Maven under the hood, thus utilizing a lot of the existing infrastructure (repositories, dependency trees etc.)

Custom “tasks” in leiningen are simple Clojure functions. Oh the joy! I’ve written several custom tasks for several different projects over the last couple of months. But the one I wrote most recently — lein run — is the task that has the most utility outside of those projects. So I created a plugin out of it:

From the README:

A leiningen plugin to call a function in a new process or run a .clj file.

lein-run is extremely useful when you want to launch long-running Clojure process from the command line. For example, it can be used to start a server (a web server like Compojure) or to start a process that will run in an infinite loop (a process waiting for messages from a message queue, a twitter client etc.)

HTML Parsing in Clojure using HtmlCleaner

Thursday, May 6th, 2010

As part of a Clojure project I’m working on, I needed to parse some HTML pages and extract their titles and tag-stripped (textual) content, which would then be fed to Solr.

Now, there are many many HTML parsers available for Java. (One cannot just parse HTML from the World Wild Web by treating it as XML — most of the HTML there is not even well-formed. HTML parsers are able to deal with the kludge and present a clean picture of the page to the consumer.) TagSoup and HtmlCleaner are two of the more popular ones. I like HtmlCleaner better.

(TagSoup provides a SAX interface and I find SAX to be a PITA when dealing with HTML. HtmlCleaner, on the other hand, provides a simple DOM interface; it also has helpful options like omitComments, pruneTags etc. Besides, I’ve found that, in practice, HtmlCleaner is usually able to do a better job of handling badly written HTML than TagSoup; some others also agree with me. If all this wasn’t enough reason to pick HtmlCleaner, it is in general faster than TagSoup.)

So, here is the code to parse HTML with HtmlCleaner and extract the title and content of the page:

You’ll notice that I have used Apache commons-lang’s StringEscapeUtils to ‘unescape’ HTML entities (convert & to & etc.) — HtmlCleaner does not do this automatically.

I couldn’t find a Maven repository for HtmlCleaner, so I uploaded it to clojars. If you are going to use this in a lein project, you’ll want to add it to your dependencies:

Happy parsing!


Monday, May 3rd, 2010

A few months ago, I had mentioned that I was planning to learn a Functional Programming language. In that blog post I listed out the different functional languages I was looking at and outlined my thought process for picking one of them. I picked Clojure.

Was that a good decision? No. It was a great decision. (Pardon the corniness.)

I have learned more while programming in Clojure than while programming in any other language since I first learned C some 12 years ago. Alan Perlis once said: “A language that doesn’t affect the way you think about programming, is not worth knowing”. Amen.

Over the last decade, I have dabbled in several programming languages including C, C++, VB, C#, Java, Perl, Python, Ruby… Now, these are all more similar than they are different. I know, you won’t truly believe that statement unless you’ve some experience with a language which is drastically different from them; ignorance is bliss. Clojure is different. Very Different. Very.

So different indeed that I have taken more time to become sort-of-comfortable with Clojure than I have ever taken to reach the same state for any other language I’ve learned. Being homoiconic, Clojure actually has very little syntax. But the problem lied in retraining my OOP-corrupted brain to think functionally and in getting comfortable working with immutable data structures.

Luckily, SICP rescued me. (SICP stands for Structure and Interpretation of Computer Programs. It is one of the most legendary books in the history of Computer Science. That many people I talk to in India have never even heard of it tells a sad story about the state of Computer Science education here.) Quite simply, it is the best programming book I’ve ever read. And I’ve only read — and worked out the exercises of — the first chapter (there are only five chapters in all.) SICP uses Scheme (a Lisp dialect) as the language of instruction, but it is quite trivial to translate the programs to Clojure. Although, SICP is not teaching a programming language, it is teaching programming. Big difference. Take a moment to think about that. I highly recommend you read at least the first chapter of SICP, whether you care to learn to program in Lisp or not. If you’re in India, you can order a copy of it from here (the book is also available as a free download from its website.)

Now I’ve reached a stage where, when I’m programming in any other language — even Ruby, the most beautiful of all the other languages I know — I feel as if I’m doing it wrong. All wrong. I’ve slowly but surely started to apply some of the concepts I’ve learned while programming Clojure when programming in other languages too. See Alan Perlis’ quote above.

I’m working on a couple of open source libraries in Clojure, which I hope to announce soon. Almost all of the weekend hacks I do these days are also in Clojure. I’m addicted!

I’m planning to blog some adventures and neat tricks I’m conjuring up using Clojure. If you are interested, you can subscribe to all my Clojure posts.

I’ll end this with a few interesting quotes from SICP:

The acts of the mind, wherein it exerts its power over simple ideas, are chiefly these three: 1. Combining several simple ideas into one compound one, and thus all complex ideas are made. 2. The second is bringing two ideas, whether simple or complex, together, and setting them by one another so as to take a view of them at once, without uniting them into one, by which it gets all its ideas of relations. 3. The third is separating them from all other ideas that accompany them in their real existence: this is called abstraction, and thus all its general ideas are made.

John Locke, An Essay Concerning Human Understanding (1690)

This is the preamble to chapter 1. “Programming is abstraction.”


We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.

A computational process is indeed much like a sorcerer’s idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions. It can affect the world by disbursing money at a bank or by controlling a robot arm in a factory. The programs we use to conjure processes are like a sorcerer’s spells. They are carefully composed from symbolic expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform.

A computational process, in a correctly working computer, executes programs precisely and accurately. Thus, like the sorcerer’s apprentice, novice programmers must learn to understand and to anticipate the consequences of their conjuring. Even small errors (usually called bugs or glitches) in programs can have complex and unanticipated consequences.

These are the first three paragraphs of chapter 1. What a cool metaphor! Sadly, the use of this metaphor is limited to these paragraphs.


“I think that it’s extraordinarily important that we in computer science keep fun in computing. When it started out, it was an awful lot of fun. Of course, the paying customers got shafted every now and then, and after a while we began to take their complaints seriously. We began to feel as if we really were responsible for the successful, error-free perfect use of these machines. I don’t think we are. I think we’re responsible for stretching them, setting them off in new directions, and keeping fun in the house. I hope the field of computer science never loses its sense of fun. Above all, I hope we don’t become missionaries. Don’t feel as if you’re Bible salesmen. The world has too many of those already. What you know about computing other people will learn. Don’t feel as if the key to successful computing is only in your hands. What’s in your hands, I think and hope, is intelligence: the ability to see the machine as more than when you were first led up to it, that you can make it more.”

Alan J. Perlis (April 1, 1922-February 7, 1990)

Alan Perlis isn’t kidding (does he, ever?) — programming is fun again. Thank you, Clojure.

Beginning Functional Programming

Tuesday, October 20th, 2009

Warning: Geekspeak ahead.

I’ve been wanting to learn a functional programming (FP) language for quite a while now.

Ruby being my current fav programming language while also being malleable to FP, I thought I will dabble with some FP in Ruby. But that was not to be. While Ruby does have some basic features like closures that are necessary for FP (although it does not boast of a a lot of clarity), one has to go through a lot of hooplas to achieve immutability — the language offers no assistance here. The following video provides a good overview of what it takes to do FP using Ruby: Better Ruby through Functional Programming by Dean Wampler.

With Ruby out, I now needed a language that embraced FP wholeheartedly. And there is no dearth of choices here; see the Functional Languages category on Wikipedia. I decided to narrow down the list by picking up the more popular and/or adopted languages. This left me with Clojure, Common Lisp, Erlang, F Sharp, Haskell and Scala (but many don’t seem to consider Scala to be truly functional). That’s still too many languages. But the list is small enough that I could learn enough about each of these languages to be able to pick one that I want to learn right now.

Clojure: Lisp dialect, dynamically typed, runs on JVM (and embraces it; implying that any Java library can be used with Clojure with little extra effort), great community. Moreover, it is embraced by my company (one of our very important backend platforms is written mostly in Clojure). More on Clojure later.

Common Lisp: The oldest of them all, dynamically typed, really cool. But too many parens for my liking!

Erlang: Runs on its own rock solid concurrent runtime system, offers distributed concurrency for free, dynamically typed, allows hot swapping and…

Meanwhile, back at Ericsson, some Erlang-based products that were already in progress when the “ban” went into effect came to market, including the AXD 301, an ATM switch with 99.9999999 percent reliability (9 nines, or 31 ms. downtime a year!), which has captured 11% of the world market. The AXD 301 system includes 1.7 million lines of Erlang: This isn’t just some academic language.

(from: Erlang in; emphasis mine.)

F Sharp: Runs on the CLR (.NET Framework).

Haskell: Pure FP language, static typing.

Scala: Runs on JVM, static typing. Very complicated type system, from what I hear.

The above is obviously not a very thorough overview of the languages. It was not my intention to develop an overview at all; I just gathered enough information to be able to make a decision.

I like dynamic typing, working on Linux. My company’s main platform is the JVM. I mostly write Information Retrieval (IR) software and a lot of IR libraries are in Java. Clojure!

But I also like Erlang. I’ll get to it after I have tamed Clojure. And I also want to learn R programming language so that I can implement statistical algorithms with ease.

But what’s the point of this blog post? Nothing. I plan to blog a bit about FP in general and Clojure in particular and thought that it would be nice to provide some context. Also, I wanted to document at least part of the decision making process I have been through for picking up Clojure.