While I was trying to recover as much of the old blog as possible because of my own carelessness I discovered Warrick. Warrick works pretty well, but it does cause problems when running against Google because if you make too many requests to Google in a certain time period, approx 100-150 in a short period of time they will blacklist your IP address for around 12 hours. If you were making a lot of requests manually this wouldn't be a problem because Google asks you to prove that you are human and if you successfully prove that you are (using the CAPTCHA) then you can continue to make your requests. But Warrick is a perl script that is screen scraping Google so you never see the request to prove you are human so with sites that have a largish number of pages Warrick will fail to get any meaningful content, the file will just contain the request from GoogleÂ to prove you are human. So Warrick is great for giving you the basic content from the Internet Archive, Google, bing and yahoo caches automatically but you'll need to fill in the gaps. To fill in the gaps you'll need to know how to query the search engine caches manually. most people know how to do this for Google, site: and cache: but other search engines use different methods. I stumbled across an interesting website about using search engines while I was looking for automated methods of recreating the blog using wget, cURL and warrick and in particular this page. It's a bit out of date (2008) but it is still mostly correct and useful.