Get cached images from your visitors

Dec 12 2009

Jeff Atwood (Coding Horror fame) was in for a horror when he realized that his server crashed and his data was gone and due to some reason, the backup mechanism was not working. The complete data in Coding Horror and the StackOverflow blog disappeared.

Since his blog is very popular, many archiving systems including the Google cache have copies of the pages and I hope that they have by now recovered the complete textual data. The biggest problem in this case is getting back the images. There are not many archiving services that may have the complete backup of the images in the website.

So what should Jeff do now?

Since Coding horror is a high traffic blog, I think there is a way to get back at least some of the images. (The probability of this working depends a lot on the traffic to the website and a bit of luck)

Here are the steps:

  1. Configure the web server to return 304 for every image request. The HTTP status code 304 means that the file is not modified and this means that the browser will fetch the file from its cache if it is present there. (credit: this SuperUser answer)
  2. In every page in the website, add a small script to capture the image data and send it to the server.
  3. Save the image data in the server.
  4. Convert the pixel data to get the original images.Voila!

Capturing the image data

We are going to use the Canvas functionality in HTML 5 to get back the image data.

Here is the code you should insert into the pages of the website. It gets all the images in the current page, loads it to the HTML Canvas, gets the pixel data for the image and sends it to the server through an Ajax post.

This PHP script (Can PHP rescue Jeff? ;) To be fair, the server side code is trivial) saves the data to files in the server. Note that the files themselves will not be images, they will just contain the pixel data of the images. In addition to this, we are also saving the original file name and the image dimensions. This means that we can easily reconstruct the original images from this data. Data from every visitor is saved in a different file to just to make sure that you have enough redundancy (Watch out for his redundancy filling up your server disks)

Remember that this is a proof of concept code. You will have to modify it to use it in regular production environments and to get some real use from it. There are many limitations to this code. It goes without saying that you will get the image data back from the users only if they have the images cached in their browsers. This script will work only in the latest versions of Chrome, Firefox, Safari, Opera etc. (Don’t ever expect it to work in IE for the next decade). In addition to this, remember that the pixel data will be many times bigger than the original file size and you may have to carefully analyze the disk space usage of this script. (I guess in an emergency, none of these really matters).

You should edit the post URL in the script to match your domain name.

Finally, I have tested the code and it seems to be working (for me, at least). You need to include JQuery in the pages using this script and remember that due to security restrictions in the browsers, you will have to place all these files under the same domain name. Please tell me if there are any other flaws in the code.

[Updated: code changes to reduce the file size by 50%. The decimal numbers were converted to hex and the spaces in between the numbers removed. The file sizes can be further reduced by using the full character set.]

22 responses so far

Leave a Reply