Quick and Dirty URL Validation

Posted by Trevor in Ruby/Rails on February 09, 2009

I've come across a few different ways to validate URLs in my day, but they all seem a bit more complicated than necessary. Perhaps I'll see the wisdom of these techniques soon, but for now it seems like there's an easy solution to the problem:

 
class Link < ActiveRecord::Base
 
  attr_accessible :url
  validate :validate_url
 
private
 
  def validate_url
    errors.add(:url) unless %w(200 301 302).include?(Link.status_code(self.url))
  end
 
  def self.status_code(url)
    regexp = url.match(/https?:\/\/([^\/]+)(.*)/)
    path = regexp[2].blank? ? '/' : regexp[2]
    Net::HTTP.start(regexp[1]) {|http| http.head(path).code}
  rescue
    nil
  end
 
end
 

Et VoilĂ .

22 Comments

Abusable? Sure HEAD isn’t *supposed* to do anything, but I’ll bet money there are sites out there that have URLs you shouldn’t be blindly hitting. I guess a simple timeout wouldn’t be the end of the world, but could be moderately annoying. Also, I could make your servers show connections to TERRORISTS or kiddie porn servers, if I knew where to connect to them (which I don’t).

All in all, I’m not sure you want to leave your server connecting up to another one to a blind check.

 Trevor

Err… I’m not sure how exactly this could be a security problem. Perhaps an example would help?

My last attempt got swallowed by your spam filter, I think.

Not a problem for your server getting hacked, but more in this line

http:/alqueda.com
http:/www.kiddiepr0n.com

http:/www.wellsfargo (thanks for the loan of your fat pipes for my DDOS)

http:/www.vulnerableserver.com/troublesome_url (at least they got your IP as the one that brought them down).

Basically, while I think HEADs will be mostly harmless, this still does leave you as an anonymous proxy in at least one way, for people who may know what to actually exploit (unlike me).

Umm, I didn’t mean for those to actually be turned into links, sorry.

 Caden

Well that was completely bizarre. Try adding more foil Tim. You need more foil.

 Trevor

Yeah, I think the HEAD request is harmless. Maybe I’m wrong, though.

This technique I’m talking about isn’t a method for preventing spam or anything – it’s just a quick way to validate URLs are accessible (e.g. not http:/sdf38830.com or something nonsensical like that).

 Daniel Berger

require ‘uri’
URI.parse(url).host

Ah I forgot to add the merely annoying one

http:/serverthattimesout.com

Trevor, you may be entirely right, I do not know. My only point is counting on everyone else on the internet on honor “this MUST NOT have side effects” or whatever the language is, causes things like the Google Accelerator deleting lots of people info fiasco.

 Jacob Harris

This does allow for a really simple denial-of-service attack on any site using it. Basically, you write a very simple server (you can do it in Sinatra if you like) that sleeps for a really long time on a head request to a specific URL. Then you just submit the URL multiple times to the form validating this URL. Repeat until you preoccupy ever Mongrel or cause Passenger to spawn enough Apache processes to thrash. Of course, you could add a timeout on the check, but I could also use Mechanize… ;)

I like tenderlove’s refinement of sending back a very large response (is there somewhere you can cram base64 encoded video into a HTTP response header, maybe, to additionally add in the illegal/copyrighted content problems), when you finely do respond, to help chew up memory, too.

 Trevor

Hmm… yeah, I guess you could use Timeout to help, but it does seem like making requests of arbitrary websites may have unindented consequences.

I’m not sure what you could do about receiving large header responses. I suppose a timeout could help there, too.

Here’s where I would start looking in terms of doing timeout stuff:

http://www.ruby-doc.org/core/classes/Timeout.html
http://www.slashdotdash.net/2008/02/15/ruby-tidbit-timeout-code-execution/

Although I came across these not too long ago:

http://blog.segment7.net/articles/2006/04/11/care-and-feeding-of-timeout-timeout
http://blog.headius.com/2008/02/rubys-threadraise-threadkill-timeoutrb.html

So, I’m not sure if timeout is safe to use or not :)

 Trevor

As an aside, found this easy way to do basic URL format validation:

http://www.ruby-doc.org/core/classes/URI.html#M004840

require ‘uri’
validates_format_of :uri, :with => URI.regexp

Very nice – no need for a custom regexp :)

This is the same basic concept of the older http_url_validation_improved plugin. It’s not on github, so perhaps that is why you missed it.

We use it in Kete (http://kete.net.nz) to achieve what you are after. It does look like it could use that URI.regexp refactoring added to it.

One nice thing is that it will check for allowed content types with configuration. It also gives more finegrained feedback when validation fails.

 Trevor

Thanks, Walter. That looks like a good plugin to consider if you’re worried about overkill but need more than quick and dirty.

 name

Btw, regexp is wrong. You should escape slashes:
regexp = url.match(/https?:\/\/([^\/]+)(.*)/)

 Trevor

Ah yes. Fixed, thanks!

 Trevor

This is my favorite URL format validator right now:

http://github.com/henrik/validates_url_format_of/tree/master

URI.regexp didn’t catch a lot of invalid stuff I tried to trow at it.

 name

There is still one slash left unescaped.

 Trevor

Hmm… yes… I think I got it now? I pasted it from running code, so I hope so :)

If I wanted to break this, I think I’d write a custom “webserver” that responded to a HEAD request with an unending stream of headers. Server never closes the connection, just keeps sending you headers until the client finally gives up, assuming it ever does.

Leave a comment

WP_Big_City