Thursday, October 25, 2007

Google Books and Google Feeds

I've really been enjoying playing around with Google Books lately. I've wanted to get my (ever-growing) home library into some kind of shared system for years, but the idea of entering hundreds of ISBN's into a spreadsheet just never seemed like how I wanted to spend my time. And even then, how to convert or import that data into something useful?

The first big win was that, with Google Books, I could perform advanced queries, such as by title and author, much the same way I'd perform an advanced web search. So, for example:
intitle:"agile web development with rails" inauthor:"hansson"

Even though the search would be slightly fuzzier than an exact ISBN search, if it meant not having to pull each and every book off the shelf to get the ISBN, I was willing to deal with a few errors and mismatches that I could correct by hand.

Since I was really itching to see my books show up in the library as soon as possible, I decided to go quick and dirty and write a Python screen-scraper. The basic idea is: for each book in the spreadsheet, submit the search query, then scrape the results looking for an "Add to my library" link. However, in order to do that, I first needed to log in to Google Books, and capture my User-Agent and Cookie headers that associated me to my session and my library. That logic comprises one of the only two interesting parts of the bot:
import urllib, urllib2

def getBookSearchRequest(title, author):
query = urllib.quote(
'intitle:"'+title+'" inauthor:"'+author+'"')
req = urllib2.Request(
'http://books.google.com/books'+
'?as_brr=0'+ #the advanced search flag
'&q='+query+
'&btnG=Search+Books')
req.add_header('Host', 'books.google.com')
req.add_header('User-Agent', HARDCODED_USER_AGENT)
req.add_header('Cookie', HARDCODED_GOOGLE_COOKIE)
return req

The other interesting part was scraping for the link:
import re, cgi

for book in books:
page = urllib2.urlopen(
getBookSearchRequest(
book.name, book.author)).read()
match = re.search(
r'<a href="([^"]*)">\s*Add to my library\s*</a>',
page,
re.DOTALL)
if not match:
continue
link = match.group(1)
# handle the result...

The rest basically boils down to HTTP retry logic and gracefully bailing out when no results are found. Anyway, before I had time to groan and say, "Ugh!" I had about 230 out of 250 books imported, most of which were actually not in the system as requested. Not too shabby.

Having a library at Google Books also gave me an opportunity to play with the Google Feed JavaScript API. With a simple JavaScript call, you can retrieve data from any RSS or Atom feed, and dynamically inject it into your page with your Ajax library of choice.

Here is my otherwise empty personal page with the five books most recently added to my library. And here's the JavaScript code to pull the feed:
var feed = new google.feeds.Feed(
'http://books.google.com/books?as_list='+
'BDToX1-EQuq7cjr6nqdzfARoU-HJfh-GeA1cvLGf59B-j5Y0JG3Y'+
'&output=rss');
feed.setNumEntries(5);
feed.load(function(result) {
if (!result.error) {
for (var i = 0; i < result.feed.entries.length; ++i) {
// handle the item...
}
}
});

Labels: , , , ,

2 Comments:

At October 25, 2007 5:27 PM , Blogger Andy Maleh said...

Sweet! What web technology do you use for your personal page?

 
At October 26, 2007 9:01 AM , Blogger Frederick Polgardy said...

I've become a huge evangelist for jQuery. For DOM traversal and manipulation it's just a clean, gorgeous API.

 

Post a Comment

<< Home