🌱 Web crawlers

http://ruby.bastardsbook.com/chapters/web-crawling/ http://ruby.bastardsbook.com/chapters/html-parsing/ https://github.com/felipecsl/wombat

Parsing HTML with Nokogiri | The Bastards Book of Ruby Writing a Web Crawler | The Bastards Book of Ruby

Example crawler

require 'rubygems'
require 'nokogiri'
require 'open-uri'

base_wikipedia_url = "http://en.wikipedia.org"
list_url = "#{base_wikipedia_url}/wiki/List_of_Nobel_laureates"

page = Nokogiri::HTML(open(list_url))
rows = page.css('div.mw-content-ltr table.wikitable tr')

rows[1..-2].each do |row|

  hrefs = row.css("td a").map { |a|
    a['href'] if a ['href'] =~ /^\wiki\//
  }.compact.uniq

  hrefs.each do |href|
    remote_url = base_wikipedia_url + href
    puts remote_url
  end

end

Web Scraping with Ruby and Nokogiri for Beginners | Distilled

ruby tool