Tag Archives: bash

Spidering / Link Checking With wget

I use XENU for link checking sites and finding missing assets but I couldn’t figure out how to make sure that it was following the redirects it encountered. For example, if an inline image source is “/images/sitelogo.jpg” but that 301 redirects to “/images/sitelogo-new.jpg”, XENU will report the redirect (as an error if you prefer), but what I really want to know is whether the destination of that redirect was a 200 OK (or a 404, or something else unintended). It wasn’t clear to me if XENU was ensuring that the file existed after being redirected.

I tried out a few other free tools but none seemed even as good as XENU. It was then that I stumbled upon the “spider” option in wget. You can set it free on a URL like so:

wget --spider -l 2 -r -p -o wgetOutput.log http://somesite.net

This will spider the URL up to 2 levels deep and ensure that any inline assets on the pages within those levels are also downloaded. The “-p” option ensures that inline assets like images or css are downloaded from a page even when the maximum number of levels in the “-l” option is reached. The output is logged to wgetOutput.log

At the very end of wgetOutput.log you’ll find a list of broken links that looks something like this. You will also get a ton of other useful information about every request that it made – so you know exactly what it’s doing!

Spider mode enabled. Check if remote file exists.
--2013-08-06 20:10:40--  http://somesite.net/images/sitelogo-new.png
Reusing existing connection to somesite.net:80.
HTTP request sent, awaiting response... 200 OK
Length: 4153 (4.1K) [image/png]
Remote file exists but does not contain any link -- not retrieving.
Removing somesite.net/images/sitelogo-new.png.
unlink: No such file or directory

Other Useful Options

Specify a user agent:

-U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"

Spider a site that forces you to log in:

  1. Get the Cookie Exporter Add-on for Firefox.
  2. Log into the site you want to spider.
  3. From Firefox, run Tools -> Export Cookies -> cookiesFile.txt
  4. Use the “–load-cookies” option:
    --load-cookies cookiesFile.txt

Complete Example:

wget --spider -l 2 -r -p -o wgetOutput.log -U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" --load-cookies cookiesFile.txt http://somesite.net