— layout: post title: Scraping Hubble (ruby http crawler) —

Hubble takes the most beautiful images in the universe. They are also cool enough to post the image on their site. I wanted to download their images for use as backgrounds, or art on future openframeworks experiments. Ruby has a couple libs that make this easier. HTTParty and the standard NetLib were used in the script I wrote for pulling the images down. HubbleScrapper doesn’t take any arguments, but does have some interesting tidbits in it.

Following 301

There are two methods used to follow moved responses. HTTParty does this internally, and that is used for the index of the search. However, this was not working correctly (pulled the preview rather than the full image). I found an example method that used the standard lib and incorporated it into my fetch method. Not sure why HTTParty and the below snippet differ, would have to look at the internals.

##
# fetch pulls uri_str using the standard Net package, recurses up to a limit if
# redirected
#
def fetch(uri_str, limit = 10)
  # You should choose better exception.
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  url = URI.parse(uri_str)
  req = Net::HTTP::Get.new(url.path, { 'User-Agent' => 'hubble-fetcher'})
  response = Net::HTTP.start(url.host, url.port) { |http| http.request(req) }
  case response
  when Net::HTTPSuccess     then response
  when Net::HTTPRedirection
    puts "Redirect Location: #{response['location']}"
    fetch(response['location'], limit - 1)
  else
    response.error!
  end
end

Threads

I used a simple consumer model to handle threading. This isn’t producer consumer, since the production is completed before the threading starts. Basically 8 threads are created (later joined), and these threads fetch the image links independently.

workers = (0...8).map do
  Thread.new {
    while url = @image_page_urls.pop
      visit_image_page url
    end
  }
end