🌱 Web crawlers
http://ruby.bastardsbook.com/chapters/web-crawling/ http://ruby.bastardsbook.com/chapters/html-parsing/ https://github.com/felipecsl/wombat
Parsing HTML with Nokogiri | The Bastards Book of Ruby Writing a Web Crawler | The Bastards Book of Ruby
Example crawler
require 'rubygems'
require 'nokogiri'
require 'open-uri'
base_wikipedia_url = "http://en.wikipedia.org"
list_url = "#{base_wikipedia_url}/wiki/List_of_Nobel_laureates"
page = Nokogiri::HTML(open(list_url))
rows = page.css('div.mw-content-ltr table.wikitable tr')
rows[1..-2].each do |row|
hrefs = row.css("td a").map { |a|
a['href'] if a ['href'] =~ /^\wiki\//
}.compact.uniq
hrefs.each do |href|
remote_url = base_wikipedia_url + href
puts remote_url
end
end
Web Scraping with Ruby and Nokogiri for Beginners | Distilled