Tag Archives: link rot

Link Rot Browser Extensions Bridge the web of today to the past

A frustrating part of Internet research is link rot.  When sites disappear off the net, so seemingly does the information they contained and you hit a wall: a 404 error.  These “page not found” messages tell you, but more importantly notify your web browser, that the data you’re seeking doesn’t exist.  And in 2005, Frank McCown, Sheffan Chan, Michael L. Nelson, and Johan Bollen presented on “The Availability and Persistence of Web References in D-Lib Magazine” at the International Web Archiving Workshop in Vienna, that “half of the URLs cited in D-Lib Magazine articles were no longer accessible 10 years after publication.”  That’s an alarming figure, particularly in the digital librarian community.

Since 1996, the Internet Archive has been making copies of web sites on the Internet to preserve them.  They can be searched using their Wayback Machine, and although there are gaps (thankfully not caching some of the earliest web sites I made in the late 90′s..) it’s an invaluable resource.  About a month ago I decided to create my own solution to link rot, and I first wrote a Greasemonkey script that would detect when I’m on a 404 page, and redirect the browser to the most-recent archived copy of that web page located at the Internet Archive; I call it 404ward.  My idea was, and still is, that users should have seamless experience researching on the web no matter if that link works today or it did 10 years ago.  I want to spend less clicking and searching.  Since writing the original code and asking some friends on Twitter (my handle is davelester), I’ve ported it over to a Firefox extension that’s easier for individuals to install.

It turns out that 404ward isn’t the first, or only solution to link rot out there.  A colleague passed around links to Firefox extensions that do similar things, so I’ll briefly review them.  There are already four extensions that are compatible with Firefox 3 and work with the Internet Archive:

  • Passive Cache and CacheIt! both allow you to right-click the link on a web page and choose whether or not you’d like to see a version cached by Google or the Internet Archive.  It’s nice, but doesn’t help you one you arrive at a 404 page, or if you directly typed in a URL that wasn’t working.
  • A third extension with the lengthy name “404: File is Not Found?  Now it will be!” is poorly implemented.  I can say this, because they’ve approached redirection the same way I originally did with 404ward, which is to search a web page for the words “404 error” in the HTML, and then automatically redirect a user to the Internet Archive.  For example, this doesn’t work when you’re visiting a page that’s about 404 errors — the extension thinks that you’ve hit a dead link and it automatically redirects you.  Additionally, sites that don’t exist at all won’t return HTML for the extension to read so they won’t work as well.
  • The most mature out of all the current extensions that address link rot is called Resurrect Pages.  When you arrive at a 404 page, Resurrect Pages offers you a set of links that visit the web caches of CoralCDN, Google, Yahoo, Internet Archive, Live Search, GigaBlast, and WebCite.  It does a great job making these resources more available, but there’s still a bunch of clicking involved.

Here’s what the 404 or page not found errors look like with Resurrect Pages:

Where does 404ward fit into the mix?  I had hoped to release an alpha version of my blog, but it’s not compatible with Firefox 3.  But that’s OK – I can table my javascript experiment for what’s currently out there.  Resurrect Pages makes terrific strides towards link rot, and I’m throwing my support behind it as a must-have tool for researchers this fall, along with my other fav firefox extension, Zotero.

[link to wikipedia article where I first found the McCown reference]

Posted in Firefox | Tagged , , , | 1 Comment