Executing Gummiworms The trials and tribulations of a grumpy curmudgeonly old git

21Oct/102

Finding Old Web Pages

While I was trying to recover as much of the old blog as possible because of my own carelessness I discovered Warrick. Warrick works pretty well, but it does cause problems when running against Google because if you make too many requests to Google in a certain time period, approx 100-150 in a short period of time they will blacklist your IP address for around 12 hours. If you were making a lot of requests manually this wouldn't be a problem because Google asks you to prove that you are human and if you successfully prove that you are (using the CAPTCHA) then you can continue to make your requests. But Warrick is a perl script that is screen scraping Google so you never see the request to prove you are human so with sites that have a largish number of pages Warrick will fail to get any meaningful content, the file will just contain the request from Google  to prove you are human. So Warrick is great for giving you the basic content from the Internet Archive, Google, bing and yahoo caches automatically but you'll need to fill in the gaps. To fill in the gaps you'll need to know how to query the search engine caches manually. most people know how to do this for Google, site: and cache: but other search engines use different methods. I stumbled across an interesting website about using search engines while I was looking for automated methods of recreating the blog using wget, cURL and warrick and in particular this page. It's a bit out of date (2008) but it is still mostly correct and useful.

Share
19Oct/100

Recovering a website from search engine caches

As noted earlier I screwed up this blog by zapping the database and not having a working current backup. I wasn't too bothered as i'd only made 60 posts over the course of six months but I was a bit annoyed with myself for not being more careful. The blog consisted of  me pontificating about how the USPS sucks, posting links to other blogs where people had posted interesting things or ZipitZ2 related items, most of which really weren't that interesting but there were a few interesting posts and comments that I really wanted to recover so I started looking ways to recover webpages from search engine caches. Pulling a single page from a search engine cache is very easy, just search for the website in the search engine and click the cached link. But I wasn't previously aware of a method of pulling a complete website from a cache automatically. I am sure that someone will tell me that it's possible using wget or cURL (almost anything is possible using a combination of wget, cURL, bash and netcat) but I don't know how. I googled a while and stumbled over a cool piece of software called Warrick. Warrick is a program written in perl (figures :) ) that runs *nix and Windows that will recover as much of a website from the Internet Archive, Google, bing and yahoo caches. It works best within the first few days of a website disappearing as over time the search engine caches will degrade and less and less of the desired data will be available. Using Warrick yesterday I was able to pretty much recover all the posts from the blog and i'll repost them over the next few weeks. I'm not sure if i'll post them as new posts with a new data or if i'll try to backdate them. If you know how to backdate posts in wordpress i'd love to know as the only method i can think of off the top of my head is direct db manipulation.

Share