`lein run` plugin   ★

Did I ever mention that leiningen, the Clojure build tool, is awesome? Anything that spares a programmer in the Java world from the hell that is XML config files (I’m looking at you, ant and maven, the gatekeepers of hell) will find itself being declared as much. To top it off, leiningen (lein, for short) uses Maven under the hood, thus utilizing a lot of the existing infrastructure (repositories, dependency trees etc.)

Custom “tasks” in leiningen are simple Clojure functions. Oh the joy! I’ve written several custom tasks for several different projects over the last couple of months. But the one I wrote most recently — lein run — is the task that has the most utility outside of those projects. So I created a plugin out of it: http://github.com/sids/lein-run/.

From the README:

A leiningen plugin to call a function in a new process or run a .clj file.

lein-run is extremely useful when you want to launch long-running Clojure process from the command line. For example, it can be used to start a server (a web server like Compojure) or to start a process that will run in an infinite loop (a process waiting for messages from a message queue, a twitter client etc.)

HTML Parsing in Clojure using HtmlCleaner   ★

As part of a Clojure project I’m working on, I needed to parse some HTML pages and extract their titles and tag-stripped (textual) content, which would then be fed to Solr.

Now, there are many many HTML parsers available for Java. (One cannot just parse HTML from the World Wild Web by treating it as XML — most of the HTML there is not even well-formed. HTML parsers are able to deal with the kludge and present a clean picture of the page to the consumer.) TagSoup and HtmlCleaner are two of the more popular ones. I like HtmlCleaner better.

(TagSoup provides a SAX interface and I find SAX to be a PITA when dealing with HTML. HtmlCleaner, on the other hand, provides a simple DOM interface; it also has helpful options like omitComments, pruneTags etc. Besides, I’ve found that, in practice, HtmlCleaner is usually able to do a better job of handling badly written HTML than TagSoup; some others also agree with me. If all this wasn’t enough reason to pick HtmlCleaner, it is in general faster than TagSoup.)

So, here is the code to parse HTML with HtmlCleaner and extract the title and content of the page:

You’ll notice that I have used Apache commons-lang’s StringEscapeUtils to ‘unescape’ HTML entities (convert & to & etc.) — HtmlCleaner does not do this automatically.

I couldn’t find a Maven repository for HtmlCleaner, so I uploaded it to clojars. If you are going to use this in a lein project, you’ll want to add it to your dependencies:

Happy parsing!

Clojure.   ★

A few months ago, I had mentioned that I was planning to learn a Functional Programming language. In that blog post I listed out the different functional languages I was looking at and outlined my thought process for picking one of them. I picked Clojure.

Was that a good decision? No. It was a great decision. (Pardon the corniness.)

I have learned more while programming in Clojure than while programming in any other language since I first learned C some 12 years ago. Alan Perlis once said: “A language that doesn’t affect the way you think about programming, is not worth knowing”. Amen.

Over the last decade, I have dabbled in several programming languages including C, C++, VB, C#, Java, Perl, Python, Ruby… Now, these are all more similar than they are different. I know, you won’t truly believe that statement unless you’ve some experience with a language which is drastically different from them; ignorance is bliss. Clojure is different. Very Different. Very.

So different indeed that I have taken more time to become sort-of-comfortable with Clojure than I have ever taken to reach the same state for any other language I’ve learned. Being homoiconic, Clojure actually has very little syntax. But the problem lied in retraining my OOP-corrupted brain to think functionally and in getting comfortable working with immutable data structures.

Luckily, SICP rescued me. (SICP stands for Structure and Interpretation of Computer Programs. It is one of the most legendary books in the history of Computer Science. That many people I talk to in India have never even heard of it tells a sad story about the state of Computer Science education here.) Quite simply, it is the best programming book I’ve ever read. And I’ve only read — and worked out the exercises of — the first chapter (there are only five chapters in all.) SICP uses Scheme (a Lisp dialect) as the language of instruction, but it is quite trivial to translate the programs to Clojure. Although, SICP is not teaching a programming language, it is teaching programming. Big difference. Take a moment to think about that. I highly recommend you read at least the first chapter of SICP, whether you care to learn to program in Lisp or not. If you’re in India, you can order a copy of it from here (the book is also available as a free download from its website.)

Now I’ve reached a stage where, when I’m programming in any other language — even Ruby, the most beautiful of all the other languages I know — I feel as if I’m doing it wrong. All wrong. I’ve slowly but surely started to apply some of the concepts I’ve learned while programming Clojure when programming in other languages too. See Alan Perlis’ quote above.

I’m working on a couple of open source libraries in Clojure, which I hope to announce soon. Almost all of the weekend hacks I do these days are also in Clojure. I’m addicted!

I’m planning to blog some adventures and neat tricks I’m conjuring up using Clojure. If you are interested, you can subscribe to all my Clojure posts.

I’ll end this with a few interesting quotes from SICP:

The acts of the mind, wherein it exerts its power over simple ideas, are chiefly these three: 1. Combining several simple ideas into one compound one, and thus all complex ideas are made. 2. The second is bringing two ideas, whether simple or complex, together, and setting them by one another so as to take a view of them at once, without uniting them into one, by which it gets all its ideas of relations. 3. The third is separating them from all other ideas that accompany them in their real existence: this is called abstraction, and thus all its general ideas are made.

John Locke, An Essay Concerning Human Understanding (1690)

This is the preamble to chapter 1. “Programming is abstraction.”

-

We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.

A computational process is indeed much like a sorcerer’s idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions. It can affect the world by disbursing money at a bank or by controlling a robot arm in a factory. The programs we use to conjure processes are like a sorcerer’s spells. They are carefully composed from symbolic expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform.

A computational process, in a correctly working computer, executes programs precisely and accurately. Thus, like the sorcerer’s apprentice, novice programmers must learn to understand and to anticipate the consequences of their conjuring. Even small errors (usually called bugs or glitches) in programs can have complex and unanticipated consequences.

These are the first three paragraphs of chapter 1. What a cool metaphor! Sadly, the use of this metaphor is limited to these paragraphs.

-

“I think that it’s extraordinarily important that we in computer science keep fun in computing. When it started out, it was an awful lot of fun. Of course, the paying customers got shafted every now and then, and after a while we began to take their complaints seriously. We began to feel as if we really were responsible for the successful, error-free perfect use of these machines. I don’t think we are. I think we’re responsible for stretching them, setting them off in new directions, and keeping fun in the house. I hope the field of computer science never loses its sense of fun. Above all, I hope we don’t become missionaries. Don’t feel as if you’re Bible salesmen. The world has too many of those already. What you know about computing other people will learn. Don’t feel as if the key to successful computing is only in your hands. What’s in your hands, I think and hope, is intelligence: the ability to see the machine as more than when you were first led up to it, that you can make it more.”

Alan J. Perlis (April 1, 1922-February 7, 1990)

Alan Perlis isn’t kidding (does he, ever?) — programming is fun again. Thank you, Clojure.

Regina Spektor – On The Radio   ★

What a beautiful song! Totally love the lyrics, especially in the second half. Thanks to Suman for introducing me to the artist.

Watch the video:

I love this phrase:

And everyone must breathe
Until their dying breath

And my favourite bit:

You peer inside yourself
You take the things you like
And try to love the things you took

And then you take that love you made
And stick it into some
Someone else’s heart
Pumping someone else’s blood

And walking arm in arm
You hope it don’t get harmed
But even if it does
You’ll just do it all again

Casual and Unprofessional: Not the same   ★

I’m peeved by a trend I’ve started to see among Indian startups: trying to be casual and ending up being unprofessional.

Being casual may be a good thing or a bad thing, depending on who you are targeting. But being unprofessional is never a good thing.

<Edit: I’ve taken off the rant about a particular website that I had here.>

The worrying thing is that this kind of unprofessional-ism is all too common among Indian startups. (Even my own company’s corporate website is an offender. At least we are not trying to use it as the forefront of our offerings. But that’s not an excuse.) We need to buckle up and get our act straightened out.

Now, I’m no spelling/grammar Nazi. Heck, my grammar sucks and I look up spellings on Google at least five times a day. But that’s no excuse for putting up a public website that is sprinkled with spelling mistakes and grammatical errors. Your website is an indicator of how professionally you handle everything else.

Here are some concrete suggestions (based on my observations):

I’ll keep updating this list. You can make suggestions in the comments.

It doesn’t take a lot to avoid these mistakes. Please do.

Beginning Functional Programming   ★

Warning: Geekspeak ahead.

I’ve been wanting to learn a functional programming (FP) language for quite a while now.

Ruby being my current fav programming language while also being malleable to FP, I thought I will dabble with some FP in Ruby. But that was not to be. While Ruby does have some basic features like closures that are necessary for FP (although it does not boast of a a lot of clarity), one has to go through a lot of hooplas to achieve immutability — the language offers no assistance here. The following video provides a good overview of what it takes to do FP using Ruby: Better Ruby through Functional Programming by Dean Wampler.

With Ruby out, I now needed a language that embraced FP wholeheartedly. And there is no dearth of choices here; see the Functional Languages category on Wikipedia. I decided to narrow down the list by picking up the more popular and/or adopted languages. This left me with Clojure, Common Lisp, Erlang, F Sharp, Haskell and Scala (but many don’t seem to consider Scala to be truly functional). That’s still too many languages. But the list is small enough that I could learn enough about each of these languages to be able to pick one that I want to learn right now.

Clojure: Lisp dialect, dynamically typed, runs on JVM (and embraces it; implying that any Java library can be used with Clojure with little extra effort), great community. Moreover, it is embraced by my company (one of our very important backend platforms is written mostly in Clojure). More on Clojure later.

Common Lisp: The oldest of them all, dynamically typed, really cool. But too many parens for my liking!

Erlang: Runs on its own rock solid concurrent runtime system, offers distributed concurrency for free, dynamically typed, allows hot swapping and…

Meanwhile, back at Ericsson, some Erlang-based products that were already in progress when the “ban” went into effect came to market, including the AXD 301, an ATM switch with 99.9999999 percent reliability (9 nines, or 31 ms. downtime a year!), which has captured 11% of the world market. The AXD 301 system includes 1.7 million lines of Erlang: This isn’t just some academic language.

(from: Erlang in BYTE.com; emphasis mine.)

F Sharp: Runs on the CLR (.NET Framework).

Haskell: Pure FP language, static typing.

Scala: Runs on JVM, static typing. Very complicated type system, from what I hear.

The above is obviously not a very thorough overview of the languages. It was not my intention to develop an overview at all; I just gathered enough information to be able to make a decision.

I like dynamic typing, working on Linux. My company’s main platform is the JVM. I mostly write Information Retrieval (IR) software and a lot of IR libraries are in Java. Clojure!

But I also like Erlang. I’ll get to it after I have tamed Clojure. And I also want to learn R programming language so that I can implement statistical algorithms with ease.

But what’s the point of this blog post? Nothing. I plan to blog a bit about FP in general and Clojure in particular and thought that it would be nice to provide some context. Also, I wanted to document at least part of the decision making process I have been through for picking up Clojure.

I’m NOT proud to be an Indian   ★

Summary: Stop being proud of your country’s history. Stop being proud of your countrymen’s achievements. Stop being proud of your country. Strive to make your country proud of you.

(Yes, I’m preaching. Only for today.)

Social networks — Facebook in particular — have created a super-culture of show-off. So when it’s the Indian Independence Day, everyone wants to show off how patriotic they are. In come shouts of “Proud to be an Indian.” The following ad by Airtel is an embodiment of all such sentiments:

(Make no mistake, they’re not talking about parental pride here. That I don’t have a problem with.)

This annoys me to no end. I just don’t get how anyone is able to take pride in something to which their own contribution has been an absolute zilch. I can understand a feeling of greatfulness. Or reverence. Or even happiness. But pride? How can you ever be proud of something you were merely born into? This applies to taking pride in a country as much as taking pride in a religion, caste or family.

PS: Lest somebody misunderstands me: I love India as much as anyone else, I’m happy to be here and would rather be here than anywhere else.

Cognitive Dissonance of the Comma   ★

Here is an excerpt from Jawaharlal Nehru’s Tryst with Destiny speech. Beautiful speech. But can you see what is wrong with how it has been reproduced here?

Long years ago we made a tryst with destiny , and now the time comes when we shall redeem our pledge , not wholly or in full measure , but very substantially . At the stroke of the midnight hour , when the world sleeps , India will awake to life and freedom . A moment comes , which comes but rarely in history , when we step out from the old to the new , when an age ends , and when the soul of a nation , long suppressed , finds utterance . Its fitting that at this solemn moment we take the pledge of dedication to the service of India and her people and to the still larger cause of humanity .

At the dawn of history India started on her unending quest , and trackless centuries are filled with her striving and the grandeur of her success and her failures . Through good and ill fortune alike she has never lost sight of that quest or forgotten the ideals which gave her strength . We end today a period of ill fortune and India discovers herself again . The achievement we celebrate today is but a step , an opening of opportunity , to the greater triumphs and achievements that await us . Are we brave enough and wise enough to grasp this opportunity and accept the challenge of the future ?

The glaringly obvious problem: there is a space before every comma, full stop and question mark. It is so obvious only because your brain wasn’t expecting it; the brain has certain ideas about how these should be used which clashes with how the above snippet uses them. I call this conflict cognitive dissonance of the comma.

(I do realize that cognitive dissonance may not be the most appropriate term to use here. However, it’s pretty close and makes it sound like I know what I’m talking about. Maybe I do.)

Now, I’m not suggesting that people are going around thinking “this is how a comma should be used” all the time. Neither am I suggesting that people will have any problem in understanding the above — our brain is capable of ignoring such aberrations to some extent, thus allowing us to understand what is being said. But years of setting expectations for the brain takes its toll. When expectation isn’t met the brain takes notice, at least at a sub-concious level. I can’t imagine a prose where you would want to divert the reader’s attention towards the punctuation — the only accomplishment would be to take away some of the focus from the message.

I have heard some very amusing reasons for why people continue to use punctuation like this. The common theme seems to be: “I like it this way. I find text more readable when it is this way.” Great, so your brain is wired differently from most other people; nothing wrong with that. But the question you need to ask yourself is: “Who am I writing for? For myself or for others?”

Another mistake in the above reproduction — one that is not as obvious but far more common — is the use of “its” where “it’s” should have been used. http://www.its-not-its.info/ does a great job at explaining the difference, so I’ll not attempt to repeat it here. In summary: “it’s” is just short for “it is” and “its” indicates possessiveness.

Yet another mistake that I notice being made very often (this is not present in the above snippet): using “few” when one means “a few.” This is a big problem because the meaning is the exact opposite! “Few” has a negative connotation, indicating an almost complete absence where as “a few” has a positive connotation, indicating the presence of something. This discussion at the EnglishForums.com offers some good examples: Difference between ‘few’ and ‘a few’. Here are a few posts from PluGGd.in, a popular blog on Indian startups, where this mistake has been made in the title of the post itself:

(My intention for pointing to PluGGd.in is only to show how common this mistake is, not to pick on them; Ashish, the guy who runs the blog, is a friend.)

When PluGGd.in says “Content Filtering: Few Great Lessons from Twitter and FriendFeed,” it seems like they want to say that Twitter and FriendFeed have done a terrible job and offer almost no lessons we can use, when, in fact, they list out some very good lessons they have gleaned out for us!

What I want to be when I grow up…   ★

(via)

They dont understand life.

Few people do understand life.

Twitter is down   ★

Twitter is down. I’m missing it. I’m addicted. Sigh.

PS: This post is shorter than 140 characters. It should have been a tweet.