Using Ruby’s http library – download and process web pages – I

Ruby has excellent networking support. Ruby has low level networking features such as sockets and tcp/ip protocols. It also has a high level API for handling protocols such as http and ftp. In this post we will look the Ruby http library. We also look at how this library can be used to download and process web pages.

1. Downloading a web page using Ruby

Following code illustrates using net/http library for downloading the Google’s home page. You should see the google homepage html in console!

require 'net/http'
class HttpSample
def downloadGoogleHome
  http_response = Net::HTTP.get_response( URI.parse('http://www.google.com/'))
  puts http_response.body
end
s = HttpSample.new
s.downloadGoogleHome
end

Now in my machine, this returns text which says “document has moved”. This is because google is send a redirect to www.google.co.in. Following code shows how we can handle http redirect.

require 'net/http'
class HttpSample
def downloadGoogleHome
  http_response = Net::HTTP.get_response( URI.parse('http://www.google.com/'))
  if(http_response.kind_of?(Net::HTTPRedirection))
    new_url = http_response['Location']
    http_response = Net::HTTP.get_response(URI.parse(new_url))
  end  
  puts http_response.body
end
s = HttpSample.new
s.downloadGoogleHome
end

Now how do we rewrite this if we need to use a proxy server to connect to internet? In Ruby it is pretty simple. Check out the new version below.

require 'net/http'
class HttpSample
def downloadGoogleHome
  proxy = Net::HTTP::Proxy('ipaddress', portnumber) # use actual ip and port
  url = URI.parse('http://www.google.com')
  http_response = proxy.get_response(url)
  puts http_response.body
end
s = HttpSample.new
s.downloadGoogleHome
end

3 Responses to “Using Ruby’s http library – download and process web pages – I”

  1. I have been trying to write a test program to download a webpage but it wasn’t working. So, searched and found this link. With this code as well I am getting the same error and it’s not working for me. I am behind proxy. Below are the details. Plz help me out in this.

    C:\myprograms>ruby -v
    ruby 1.8.6 (2008-08-11 patchlevel 287) [i386-mswin32]

    test2.rb :
    —————————–
    require ‘net/http’
    class HttpSample
    def downloadGoogleHome
    proxy = Net::HTTP::Proxy(‘autocache.hp.com’,8080) # use actual ip and port
    url = URI.parse(‘http://www.google.com’)
    http_response = proxy.get_response(url)
    puts http_response.body
    end
    s = HttpSample.new
    s.downloadGoogleHome
    end
    ———–
    Error:

    C:\myprograms>ruby test2.rb
    C:/Ruby/lib/ruby/1.8/net/http.rb:560:in `initialize’: A connection attempt faile
    d because the connected party did not properly respond after a period of time, o
    r established connection failed because connected host has failed to respond. –
    connect(2) (Errno::ETIMEDOUT)
    from C:/Ruby/lib/ruby/1.8/net/http.rb:560:in `open’
    from C:/Ruby/lib/ruby/1.8/net/http.rb:560:in `connect’
    from C:/Ruby/lib/ruby/1.8/timeout.rb:53:in `timeout’
    from C:/Ruby/lib/ruby/1.8/timeout.rb:93:in `timeout’
    from C:/Ruby/lib/ruby/1.8/net/http.rb:560:in `connect’
    from C:/Ruby/lib/ruby/1.8/net/http.rb:553:in `do_start’
    from C:/Ruby/lib/ruby/1.8/net/http.rb:542:in `start’
    from C:/Ruby/lib/ruby/1.8/net/http.rb:379:in `get_response’
    from test2.rb:6:in `downloadGoogleHome’
    from test2.rb:10

    ————–
    I have tried using the proxy name with http:// also and that gives socketerror. I have tried using IP for the proxy and get same kinds of errors. Can you plz help me regarding this?

  2. i tried to use ur last code i.e
    # require ‘net/http’
    # class HttpSample
    # def downloadGoogleHome
    # proxy = Net::HTTP::Proxy(‘ipaddress’, portnumber) # use actual ip and port
    # url = URI.parse(‘http://www.google.com’)
    # http_response = proxy.get_response(url)
    # puts http_response.body
    # end
    # s = HttpSample.new
    # s.downloadGoogleHome
    # end

    But it gives me an error :
    SocketError
    getaddrinfo: Name or service not known
    Can u help me how to fix it ?

  3. don’t forget OpenURI
    require ‘open-uri’
    open(“http://www.ruby-lang.org/”) {|f|
    f.each_line {|line| p line}
    }

Leave a Reply