I use XENU for link checking sites and finding missing assets but I couldn’t figure out how to make sure that it was following the redirects it encountered. For example, if an inline image source is “/images/sitelogo.jpg” but that 301 redirects to “/images/sitelogo-new.jpg”, XENU will report the redirect (as an error if you prefer), but what I really want to know is whether the destination of that redirect was a 200 OK (or a 404, or something else unintended). It wasn’t clear to me if XENU was ensuring that the file existed after being redirected.
I tried out a few other free tools but none seemed even as good as XENU. It was then that I stumbled upon the “spider” option in wget. You can set it free on a URL like so:
wget --spider -l 2 -r -p -o wgetOutput.log http://somesite.net
This will spider the URL up to 2 levels deep and ensure that any inline assets on the pages within those levels are also downloaded. The “-p” option ensures that inline assets like images or css are downloaded from a page even when the maximum number of levels in the “-l” option is reached. The output is logged to wgetOutput.log
At the very end of wgetOutput.log you’ll find a list of broken links that looks something like this. You will also get a ton of other useful information about every request that it made – so you know exactly what it’s doing!
Spider mode enabled. Check if remote file exists. --2013-08-06 20:10:40-- http://somesite.net/images/sitelogo-new.png Reusing existing connection to somesite.net:80. HTTP request sent, awaiting response... 200 OK Length: 4153 (4.1K) [image/png] Remote file exists but does not contain any link -- not retrieving. Removing somesite.net/images/sitelogo-new.png. unlink: No such file or directory
Other Useful Options
Specify a user agent:
-U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
Spider a site that forces you to log in:
- Get the Cookie Exporter Add-on for Firefox.
- Log into the site you want to spider.
- From Firefox, run Tools -> Export Cookies -> cookiesFile.txt
- Use the “–load-cookies” option:
wget --spider -l 2 -r -p -o wgetOutput.log -U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" --load-cookies cookiesFile.txt http://somesite.net