Austin Story

Ruby, Rails and Javascript Blog

Powered by Genesis

Browser Scraping and Data Mining Part 1

October 5, 2013 By Austin Story 2 Comments

I am getting a higher number of requests for collecting data off websites on the internet.  I wanted to share how I approach data mining and browser scraping.  Keep in mind that all websites have their own terms of use and rights to the content.  Make sure you have the right to scrape data before you do.

Tools

I am a ruby on rails programmer so my natural flow is to use active record and ruby to solve just about all my programs.  So here is what I use

  1. Ruby 2.0.0+ and Rails
  2. Linux Ubuntu 12.04
  3. Chrome Dev Tools
  4. Nokogiri gem- Parse HTML into a neat easy data structure
  5. Watir Webdriver gem – Browser Scraping
  6. StreetAddress gem – Parses addresses using an old perl algorithm.

Important extra gems and versions that work together well as of October 2013

[ruby]
gem ‘nokogiri’, ‘1.5.6’
gem ‘watir-webdriver’, ‘0.6.2’
gem ‘StreetAddress’, ‘1.0.3’, :require => "street_address"

[/ruby]

Approach

  1. Presentation – Determine how the website is giving the data (table, iframe, etc)
  2. Navigation – Determine how to cycle through the website to gather more pages of data

During Part 2 I will be focusing on the presentation part of this and how to write some ruby code to grab data from websites.

Filed Under: Programming, Ruby on Rails, Uncategorized Tagged With: Browser Scraping, Data Mining, Nokogiri, ruby, ruby on Rails, Watir-webdriver

Comments

  1. Cesar says

    January 11, 2016 at 9:33 am

    Hello,

    Recenty i work in scrape with ruby, mechanize but websites have angularjs framework do not extract data, is there any way? I read watir gem but do not have result

    Reply
    • Austin Story says

      January 11, 2016 at 8:09 pm

      Cesar, you will need to use something that will allow the javascript to load. We use PhantomJS for testing an angular application on a project i work on. That is where i would start.

      https://rubygems.org/gems/phantomjs/versions/1.9.8.0

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • AngularJS
  • Books
  • Devise
  • Elasticsearch
  • ES6
  • Information Security
  • Integrations
  • Javascript
  • Linux
  • Minitest
  • PhoneGap
  • Programming
  • React
  • Redux
  • Ruby
  • Ruby on Rails
  • Stripe
  • Testing
  • Theory
  • TypeScript
  • Uncategorized
  • Vue
  • Webpack