Friday, July 30, 2010

Scraping with style: scrAPI toolkit for Ruby

There’s a lot of ways to scrape HTML.
There’s regular expression, they deal well with text. But HTML is not just text, it’s markup. So you have to deal with elements that are implicitly closed, or out of balance. Attributes are sometimes quoted, sometimes not. Nested lists and tables are a challenge. Good regular expressions take a lot of time to write, and are impossible to read.
Or you can clean up the HTML with Tidy, get a DOM and walk the tree. The DOM is much easier to work with, it’s a clean markup with a nice API. But you have to do a lot of walking to find the few elements you’re scraping. That’s still too much work.

1 comment:

Darius said...

How to compare and choose scraping tools?
This series of posts is dedicated to executives taking charge of projects that entail scraping information from one or more websites.
http://www.fornova.net/blog/?p=18