Archive for the 'Internet' Category

The flow of PageRank

Feb 02 2010

You may be familiar with Google’s PageRank technology. Google considers lots of variables to calculate the PageRank of your website. This is a discussion on an extremely simplified version of PageRank.

Assume that we rank the websites by the number and quality of incoming links. The quality of an incoming link is defined as a function of the PageRank of the site which link to the other.

Let us take an example. The following figure shows how a small group of websites link to each other.

Note that website F does not have any incoming link while website G does not have any outgoing link.

Now from the given graph of links we have to find out the (relative) PageRank of each of the websites. Initially we will assume that all the pages have the same PageRank. Now we count the number of incoming links to each site and change the PageRank according to the number of incoming links.

We define PageRank of site A as:

PR(A) = ? PR(x)/L(x)

where L(x) = number of outgoing links in site x

and x denotes the sites linking to A.

When you run this algorithm for the first time, the PageRank of all the pages get updated. Now the problem is that since the PageRank of all the incoming pages have been updated, we have to re-calculate the PageRank of the pages again to take the new PageRank values into consideration. (You can predict this problem just by noticing that the function is a recursive one.) The same problem surfaces in every iteration of the algorithm.

The question is that if the PageRanks change in every iteration, how do we know when to stop the iteration? Do the PageRanks ever stabilize? (The proper term is convergence).

Here is a python script to simulate the PageRank calculation many times over and over to find out whether the values converge or not. The output values are represented as percentages. (Google considers this value as the probability of a person visiting any particular website).

The chart below shows how the PageRank changes after each iteration:

As you can see the PageRanks fluctuate highly in the initial iterations and then they stabilize. This means that the PageRank function converges.

Another think to note is that adding more nodes to the graph did not seem to affect the convergence. Even if you double the number of sites in the collection, the number of iterations taken to converge stays almost the same. Others have also reached the same results (ppt). The PageRank function is analogous to electric current flowing through a mesh. Even if there are a lot of nodes and sources, the current flow stabilizes (and stabilizes really fast).

Also note that site D has the highest PageRank, which is to be expected as it has the most incoming links. Site F has the lowest PageRank because it does not have incoming links.

According to this algorithm, linking out to other sites do not reduce the PageRank of your website. There is a problem though. Take the case of site G. It does not link to any other site. This means that the PageRank is not flowing out of site G to any other site. If site G linked to other sites, it would have increased the PageRank of the other sites by a tiny bit. (This case affects only the first link out of any site). To solve this problem, Google divides the PageRank of sites like these (called sinks) to all other sites. You may also want to read about Damping factor.

Before leaving can you explain why the PageRank of site A is greater than that of site B?

6 responses so far

Get cached images from your visitors

Dec 12 2009

Jeff Atwood (Coding Horror fame) was in for a horror when he realized that his server crashed and his data was gone and due to some reason, the backup mechanism was not working. The complete data in Coding Horror and the StackOverflow blog disappeared.

Since his blog is very popular, many archiving systems including the Google cache have copies of the pages and I hope that they have by now recovered the complete textual data. The biggest problem in this case is getting back the images. There are not many archiving services that may have the complete backup of the images in the website.

So what should Jeff do now?

Since Coding horror is a high traffic blog, I think there is a way to get back at least some of the images. (The probability of this working depends a lot on the traffic to the website and a bit of luck)

Here are the steps:

  1. Configure the web server to return 304 for every image request. The HTTP status code 304 means that the file is not modified and this means that the browser will fetch the file from its cache if it is present there. (credit: this SuperUser answer)
  2. In every page in the website, add a small script to capture the image data and send it to the server.
  3. Save the image data in the server.
  4. Convert the pixel data to get the original images.Voila!

Capturing the image data

We are going to use the Canvas functionality in HTML 5 to get back the image data.

Here is the code you should insert into the pages of the website. It gets all the images in the current page, loads it to the HTML Canvas, gets the pixel data for the image and sends it to the server through an Ajax post.

This PHP script (Can PHP rescue Jeff? ;) To be fair, the server side code is trivial) saves the data to files in the server. Note that the files themselves will not be images, they will just contain the pixel data of the images. In addition to this, we are also saving the original file name and the image dimensions. This means that we can easily reconstruct the original images from this data. Data from every visitor is saved in a different file to just to make sure that you have enough redundancy (Watch out for his redundancy filling up your server disks)

Remember that this is a proof of concept code. You will have to modify it to use it in regular production environments and to get some real use from it. There are many limitations to this code. It goes without saying that you will get the image data back from the users only if they have the images cached in their browsers. This script will work only in the latest versions of Chrome, Firefox, Safari, Opera etc. (Don’t ever expect it to work in IE for the next decade). In addition to this, remember that the pixel data will be many times bigger than the original file size and you may have to carefully analyze the disk space usage of this script. (I guess in an emergency, none of these really matters).

You should edit the post URL in the script to match your domain name.

Finally, I have tested the code and it seems to be working (for me, at least). You need to include JQuery in the pages using this script and remember that due to security restrictions in the browsers, you will have to place all these files under the same domain name. Please tell me if there are any other flaws in the code.

[Updated: code changes to reduce the file size by 50%. The decimal numbers were converted to hex and the spaces in between the numbers removed. The file sizes can be further reduced by using the full character set.]

18 responses so far

How to defend against Yahoo! Slurp

Oct 09 2009

I was going through the logs of my web server for the last month and was shocked to see that a whopping 22.93% of the total bandwidth of a particular website of mine was used by the Yahoo crawler called Slurp (I should have known better, given the revealing name).

This is just ridiculous particularly when taking into account the fact that Yahoo sends negligible number of visitors to the website.

Search Engine market share for Yahoo is coming down anyway - it is at 6.84% currently. For most of my sites Yahoo never send more than 4% of the total traffic. This means that I have to pull the plug on Yahoo! Slurp’s free run for the time being.

So how do I stop the Yahoo! crawler?

Create a file named robots.txt in the root folder of the website with the following lines of text in it:

User-Agent: Slurp

Disallow: /

User-Agent: *

Disallow:

If you don’t want to completely block the Yahoo crawler, you can just reduce the amount of requests Slurp sends to your server. To do this use the following lines in your robot.txt file:

User-agent: Slurp

Crawl-delay: 1

This “delay value” increases the time between successive Yahoo! crawler activities, and lowers the access rate of Slurp to your server. In the official FAQ you can see the details about Yahoo! Slurp and several ways to reduce the number of requests it makes to your site. For me though, supporting the Crawler is not worth the cost.

3 responses so far

More on ads

Jun 02 2009

For the last one year or so I have been using a desktop application that displayed an ad (something like 800 x 60 pixels) in its header.

Since yesterday the ad is not showing up. Now, I cannot recall what the ad was about!

It is a real shame that ads don’t work if used without permission. You try spamming me for a whole year, and I still see only what I want to.

No responses yet

Ad Revenue on the Web?

May 26 2009

I wish more people understood this: if ads are working on your site it is because someone has figured out how to monetize your users. (At a multiple of how much you get from the ads!) Why isn’t that someone you?

- from HN

No responses yet

Welcome the new player to the game

Apr 03 2009

Yesterday while looking at the long tail referrers of my website, I found something peculiar. Can you spot it?

kumo

Let us all welcome the new player to the game!

2 responses so far

Broken windows theory & online communities

Apr 02 2009

People reflect their surroundings very much in their actions, even more than what we think they do. They react to situations based on the ambiance of their surroundings rather than according to the behavioral traits they developed over time.

The broken window theory conveys this simple yet powerful idea:

Consider a building with a few broken windows. If the windows are not repaired, the tendency is for vandals to break a few more windows. Eventually, they may even break into the building, and if it’s unoccupied, perhaps become squatters or light fires inside.

Or consider a sidewalk. Some litter accumulates. Soon, more litter accumulates. Eventually, people even start leaving bags of trash from take-out restaurants there or breaking into cars.

In short, make the surroundings better and people start behaving better.

The same theory applies to online communities too. There are a lot of online communities which fostered by hosting high quality discussions and providing excellent services to the users. What happens when you start attracting a very large number of users? What happens when users start deciding what is best for them? What if their decisions are bad for the community as a whole?

Take the case of Reddit. Programming reddit used to be the place where smart programmers used to hang around and have quality discussions about the subject they care about the most, but the simple fact that reddit supported other kinds of news/content in the form of subreddits made the site a place for a lot of funny, worthless and snarky comments. Now people are more interested in taking sides in worthless arguments (about Joel Spolsky?) than serious productive discussions.

Remember that I am not ranting about the quality of the users of reddit, but what I am trying to say is that reddit as a community has become bloated. Of course reddit does have a lot of brilliant hackers as users, but the place is not like what it used to be. I doubt that there can be serious (programming) discussions in reddit anymore. Reddit is becoming more of a Digg than anything else.

Meanwhile Hacker News is trying very hard to prevent the same thing happeneing to them. It is a low-traffic news site for programmers that has very high quality content and committed contributors. A few days ago the site was mentioned in some social websites (including reddit) and a lot of traffic came in, and guess what they did? Here is what Paul Graham suggested:

We’ve had a huge spike in traffic lately, from roughly 24k daily uniques to 33k. This is a result of being mentioned on more mainstream sites. I hope this spike will subside, like past ones have. In the meantime I may temporarily hack a few things to make the site faster, like putting fewer results on threads pages.

You can help the spike subside by making HN look extra boring. For the next couple days it would be better to have posts about the innards of Erlang than women who create sites to get hired by Twitter.

That is a very bold step to take, and worth it if you take the quality of the content seriously.

It is not that the people using programming reddit and hacker news are different. Even if the same person visited both the sites, he will be more inclined to post funny remarks in reddit while he will give serious opinions in hacker news. Not that there is something bad in being humorous, but being too much funny is kinda annoying.

The next time you build your online community, build it right. Weed out the unwanted distractions. And the next time you notice a comment that adds no value to the discussion on your blog, delete it. Sometimes, deleting content build reputation faster than creating content.

Whatever you do, whatever you write, whatever you say, make it count, and the people around will give you back the same quality.

2 responses so far

Next »