I am getting a higher number of requests for collecting data off websites on the internet. I wanted to share how I approach data mining and browser scraping. Keep in mind that all websites have their own terms of use and rights to the content. Make sure you have the right to scrape data before you do.
Tools
I am a ruby on rails programmer so my natural flow is to use active record and ruby to solve just about all my programs. So here is what I use
- Ruby 2.0.0+ and Rails
- Linux Ubuntu 12.04
- Chrome Dev Tools
- Nokogiri gem- Parse HTML into a neat easy data structure
- Watir Webdriver gem – Browser Scraping
- StreetAddress gem – Parses addresses using an old perl algorithm.
Important extra gems and versions that work together well as of October 2013
[ruby]
gem ‘nokogiri’, ‘1.5.6’
gem ‘watir-webdriver’, ‘0.6.2’
gem ‘StreetAddress’, ‘1.0.3’, :require => "street_address"
[/ruby]
Approach
- Presentation – Determine how the website is giving the data (table, iframe, etc)
- Navigation – Determine how to cycle through the website to gather more pages of data
During Part 2 I will be focusing on the presentation part of this and how to write some ruby code to grab data from websites.