Executing Gummiworms The trials and tribulations of a grumpy curmudgeonly old git

21Oct/102

Finding Old Web Pages

While I was trying to recover as much of the old blog as possible because of my own carelessness I discovered Warrick. Warrick works pretty well, but it does cause problems when running against Google because if you make too many requests to Google in a certain time period, approx 100-150 in a short period of time they will blacklist your IP address for around 12 hours. If you were making a lot of requests manually this wouldn't be a problem because Google asks you to prove that you are human and if you successfully prove that you are (using the CAPTCHA) then you can continue to make your requests. But Warrick is a perl script that is screen scraping Google so you never see the request to prove you are human so with sites that have a largish number of pages Warrick will fail to get any meaningful content, the file will just contain the request from Google  to prove you are human. So Warrick is great for giving you the basic content from the Internet Archive, Google, bing and yahoo caches automatically but you'll need to fill in the gaps. To fill in the gaps you'll need to know how to query the search engine caches manually. most people know how to do this for Google, site: and cache: but other search engines use different methods. I stumbled across an interesting website about using search engines while I was looking for automated methods of recreating the blog using wget, cURL and warrick and in particular this page. It's a bit out of date (2008) but it is still mostly correct and useful.

Share
Comments (2) Trackbacks (0)
  1. Somehow none of this sounds like fun to me… Teach your kittens to make daily backups?

  2. or at least not keep turning off the power to the external hd


Leave a comment

 

No trackbacks yet.