Friday, October 26, 2007

Beautiful Code

I just finished reading Beautiful Code on the train this morning. After all the attention to writing algorithms and solving interesting programming problems in a beautiful way, what actually struck me most was the second-to-last chapter, "Code in Motion," by a couple of the developers of the Perforce source-code control system. The chapter is actually a discussion and elaboration of a whitepaper on the Perforce website called "Seven Pillars of Pretty Code." The beauty under discussion here has nothing to do with the elegance of algorithms, or domain models, or API's, though it has powerful indirect relationships to all these things. No, this paper is aimed purely at the stylistic beauty of source code, and the way it visually communicates its structure and intent. Or, to quote the opening sentence:

The essence of pretty code is that one can infer much about the code's structure from a glance, without completely reading it.

The paper doesn't need any additional explanation from me. I only want to point out a couple things that really crystallized for me while I read it.

First of all, I've always been fairly adamant about keeping code to 80 columns. This is not necessarily because I do a lot of coding in vi or anything, but for me personally, I tend to like to view multiple "panes" of code side-by-side while I'm working. (At least, that's the most persuasive reason I use against co-workers who use the monitor-width argument to defend their heinously long line-writing practices.) I've always intuited that narrow code is easier to mentally "parse" and comprehend, but I'd never before heard it expressed so beautifully:

... the left edge of the code holds the structure and the right side holds the detail, and big long lines mix zones of structure and detail, confusing the reader.

It was difficult not to jump up out of my seat on the train and shout, "Yes! That's it!" Of course, mindlessly making your code narrower in and of itself won't fix structure/detail problems; but long lines are almost guaranteed to confuse the two. I began to think about some of my stylistic quirks, and realized that almost without fail, they are driven by the structure/detail issue, even if the motivation has been a mostly subconscious one. For example: in general, anything in a comma-separated list always goes on its own line unless the entire list fits compactly and readably on one line. Breaking a list arbitrarily because you've hit the 80-column mark makes it impossible to see at a glance what your parameters are.

This discussion also segues nicely into the problem of indentation. Since I've been programming in Python for many years, this is an issue I've had a long time to think about. Most people coming to Python for the first time complain that the syntactically significant indentation is the one thing that stands in their way of really liking the language. It is frequently claimed by Python evangelists that code that uses syntactic indentation is actually easier to read than code that uses braces or do/end-type control statements to delineate blocks. I agree with this statement up to a point; but in a larger sense it's really beside the point. The more important point is that we should be going out of our way to eliminate it as much as possible:

Forcibly align the main flow of control down the left side, with one level of indentation for if/while/for/do/switch statements. Use break, continue, return, even 'goto' to coerce the code into left-side alignment. Rearrange conditionals so that the block with the quickest exit comes first, and then return (or break, or continue) so that the other leg can continue at the same indentation level.

Of course, it goes without saying, pull out conditional logic into smaller functions and methods wherever appropriate, with their own sensible indentation schemes.

What I love most about Python indentation isn't that short functions are easier to read; it's that long, deeply indented functions are impossible to read, and so you're virtually forced to refactor your code to make it comprehensible.

Thursday, October 25, 2007

Google Books and Google Feeds

I've really been enjoying playing around with Google Books lately. I've wanted to get my (ever-growing) home library into some kind of shared system for years, but the idea of entering hundreds of ISBN's into a spreadsheet just never seemed like how I wanted to spend my time. And even then, how to convert or import that data into something useful?

The first big win was that, with Google Books, I could perform advanced queries, such as by title and author, much the same way I'd perform an advanced web search. So, for example:
intitle:"agile web development with rails" inauthor:"hansson"

Even though the search would be slightly fuzzier than an exact ISBN search, if it meant not having to pull each and every book off the shelf to get the ISBN, I was willing to deal with a few errors and mismatches that I could correct by hand.

Since I was really itching to see my books show up in the library as soon as possible, I decided to go quick and dirty and write a Python screen-scraper. The basic idea is: for each book in the spreadsheet, submit the search query, then scrape the results looking for an "Add to my library" link. However, in order to do that, I first needed to log in to Google Books, and capture my User-Agent and Cookie headers that associated me to my session and my library. That logic comprises one of the only two interesting parts of the bot:
import urllib, urllib2

def getBookSearchRequest(title, author):
query = urllib.quote(
'intitle:"'+title+'" inauthor:"'+author+'"')
req = urllib2.Request(
'http://books.google.com/books'+
'?as_brr=0'+ #the advanced search flag
'&q='+query+
'&btnG=Search+Books')
req.add_header('Host', 'books.google.com')
req.add_header('User-Agent', HARDCODED_USER_AGENT)
req.add_header('Cookie', HARDCODED_GOOGLE_COOKIE)
return req

The other interesting part was scraping for the link:
import re, cgi

for book in books:
page = urllib2.urlopen(
getBookSearchRequest(
book.name, book.author)).read()
match = re.search(
r'<a href="([^"]*)">\s*Add to my library\s*</a>',
page,
re.DOTALL)
if not match:
continue
link = match.group(1)
# handle the result...

The rest basically boils down to HTTP retry logic and gracefully bailing out when no results are found. Anyway, before I had time to groan and say, "Ugh!" I had about 230 out of 250 books imported, most of which were actually not in the system as requested. Not too shabby.

Having a library at Google Books also gave me an opportunity to play with the Google Feed JavaScript API. With a simple JavaScript call, you can retrieve data from any RSS or Atom feed, and dynamically inject it into your page with your Ajax library of choice.

Here is my otherwise empty personal page with the five books most recently added to my library. And here's the JavaScript code to pull the feed:
var feed = new google.feeds.Feed(
'http://books.google.com/books?as_list='+
'BDToX1-EQuq7cjr6nqdzfARoU-HJfh-GeA1cvLGf59B-j5Y0JG3Y'+
'&output=rss');
feed.setNumEntries(5);
feed.load(function(result) {
if (!result.error) {
for (var i = 0; i < result.feed.entries.length; ++i) {
// handle the item...
}
}
});

Labels: , , , ,