Archive for the 'Internet' category

On Internet Explorer

Jun 05 2010 Published by Niyaz PK under Internet, Microsoft

Web developers always complain about the amazing ways in which different versions of the infamous Internet Explorer break their websites. Currently there are three versions of IE found in the wild to give nightmares for any decent web developer – versions 6, 7 and 8. I think the problem with IE is not just about the IE team’s reluctance to conform to the latest web standards.

The biggest problem with IE is that its release cycle is a total failure. It is obvious to everyone but Microsoft. Below listed are the years in which three different versions of IE were released:

  • IE6 – 2001
  • IE7 – 2006
  • IE 8 – 2009

Compare this to the recent major point releases of Google Chrome:

  • Chrome 3 – 2009
  • Chrome 4 – 2010
  • Chrome 5 – 2010
  • Chrome 6 – 2010 (expected)

Some might argue that it is not fair to compare just these dates without knowing the details of the version-ing system the browser teams use, but let me tell you that this argument would still not help turn the blame away from the IE team. Google Chrome released many small updates even between these major releases and sometimes even in a weekly basis.

The web browser should not be considered as just another desktop application. It is something that billions of people use every single day. It is the most important application in your computer. It is something that should be updated at least every month or so rather than every 5 years. Currently the patches from Microsoft for IE are related only to security issues. Meanwhile Firefox too is thinking about making their update mechanism silent and automatic (ie without user intervention) similar to what Google Chrome does. If the IE team is not going to release updates for their browser frequently enough why bother releasing it at all?

Now think about it in clear terms – A lion’s share of users are not diligent enough to care about the version of the browser they are using. Outside the tech world, many do not know about the availability of better browser versions. You have to update the software without the user taking the initiative. How hard is it to figure out this?

If you go to the IE9 site, you will see the following score for IE9 preview version in the Acid3 test:

Impressive? Barely so. The current version of Google Chrome (5.0) already passes the Acid3 test with a score of 100! On top of that there are still no reliable reports on when this priceless edition of IE9 will finally ship after all these months of working on polishing the CSS rounded corners. Yes, there are lengthy posts in the IE team blog about how closely they follow the specs of CSS rounded corners, while they don’t dare to open their mouth about the <canvas> tag!

IE9 boasts hardware accelerated graphics rendering for faster performance, but what that means is that IE9 will not be available for windows XP users. Keep in mind that Windows XP is the most used operating system in the world. This in turn means that when IE9 is released, web developers will have to support four different versions of IE.

One part of me prays that they ship a better version of IE soon, while the other paranoid part of me prays that they stop shipping IE altogether. Keeping the history of IE in mind, I do have reasons to be paranoid.

Links:

  1. IE team blog
  2. The CSS Corner: About CSS corners – IEBlog
  3. IE9 Acid3 Test

2 responses so far

The flow of PageRank

Feb 02 2010 Published by Niyaz PK under Internet, Math

You may be familiar with Google’s PageRank technology. Google considers lots of variables to calculate the PageRank of your website. This is a discussion on an extremely simplified version of PageRank.

Assume that we rank the websites by the number and quality of incoming links. The quality of an incoming link is defined as a function of the PageRank of the site which link to the other.

Let us take an example. The following figure shows how a small group of websites link to each other.

Note that website F does not have any incoming link while website G does not have any outgoing link.

Now from the given graph of links we have to find out the (relative) PageRank of each of the websites. Initially we will assume that all the pages have the same PageRank. Now we count the number of incoming links to each site and change the PageRank according to the number of incoming links.

We define PageRank of site A as:

PR(A) = ? PR(x)/L(x)

where L(x) = number of outgoing links in site x

and x denotes the sites linking to A.

When you run this algorithm for the first time, the PageRank of all the pages get updated. Now the problem is that since the PageRank of all the incoming pages have been updated, we have to re-calculate the PageRank of the pages again to take the new PageRank values into consideration. (You can predict this problem just by noticing that the function is a recursive one.) The same problem surfaces in every iteration of the algorithm.

The question is that if the PageRanks change in every iteration, how do we know when to stop the iteration? Do the PageRanks ever stabilize? (The proper term is convergence).

Here is a python script to simulate the PageRank calculation many times over and over to find out whether the values converge or not. The output values are represented as percentages. (Google considers this value as the probability of a person visiting any particular website).

The chart below shows how the PageRank changes after each iteration:

As you can see the PageRanks fluctuate highly in the initial iterations and then they stabilize. This means that the PageRank function converges.

Another think to note is that adding more nodes to the graph did not seem to affect the convergence. Even if you double the number of sites in the collection, the number of iterations taken to converge stays almost the same. Others have also reached the same results (ppt). The PageRank function is analogous to electric current flowing through a mesh. Even if there are a lot of nodes and sources, the current flow stabilizes (and stabilizes really fast).

Also note that site D has the highest PageRank, which is to be expected as it has the most incoming links. Site F has the lowest PageRank because it does not have incoming links.

According to this algorithm, linking out to other sites do not reduce the PageRank of your website. There is a problem though. Take the case of site G. It does not link to any other site. This means that the PageRank is not flowing out of site G to any other site. If site G linked to other sites, it would have increased the PageRank of the other sites by a tiny bit. (This case affects only the first link out of any site). To solve this problem, Google divides the PageRank of sites like these (called sinks) to all other sites. You may also want to read about Damping factor.

Before leaving can you explain why the PageRank of site A is greater than that of site B?

5 responses so far

Get cached images from your visitors

Dec 12 2009 Published by Niyaz PK under Internet, Programming

Jeff Atwood (Coding Horror fame) was in for a horror when he realized that his server crashed and his data was gone and due to some reason, the backup mechanism was not working. The complete data in Coding Horror and the StackOverflow blog disappeared.

Since his blog is very popular, many archiving systems including the Google cache have copies of the pages and I hope that they have by now recovered the complete textual data. The biggest problem in this case is getting back the images. There are not many archiving services that may have the complete backup of the images in the website.

So what should Jeff do now?

Since Coding horror is a high traffic blog, I think there is a way to get back at least some of the images. (The probability of this working depends a lot on the traffic to the website and a bit of luck)

Here are the steps:

  1. Configure the web server to return 304 for every image request. The HTTP status code 304 means that the file is not modified and this means that the browser will fetch the file from its cache if it is present there. (credit: this SuperUser answer)
  2. In every page in the website, add a small script to capture the image data and send it to the server.
  3. Save the image data in the server.
  4. Convert the pixel data to get the original images.Voila!

Capturing the image data

We are going to use the Canvas functionality in HTML 5 to get back the image data.

Here is the code you should insert into the pages of the website. It gets all the images in the current page, loads it to the HTML Canvas, gets the pixel data for the image and sends it to the server through an Ajax post.

This PHP script (Can PHP rescue Jeff? ;) To be fair, the server side code is trivial) saves the data to files in the server. Note that the files themselves will not be images, they will just contain the pixel data of the images. In addition to this, we are also saving the original file name and the image dimensions. This means that we can easily reconstruct the original images from this data. Data from every visitor is saved in a different file to just to make sure that you have enough redundancy (Watch out for his redundancy filling up your server disks)

Remember that this is a proof of concept code. You will have to modify it to use it in regular production environments and to get some real use from it. There are many limitations to this code. It goes without saying that you will get the image data back from the users only if they have the images cached in their browsers. This script will work only in the latest versions of Chrome, Firefox, Safari, Opera etc. (Don’t ever expect it to work in IE for the next decade). In addition to this, remember that the pixel data will be many times bigger than the original file size and you may have to carefully analyze the disk space usage of this script. (I guess in an emergency, none of these really matters).

You should edit the post URL in the script to match your domain name.

Finally, I have tested the code and it seems to be working (for me, at least). You need to include JQuery in the pages using this script and remember that due to security restrictions in the browsers, you will have to place all these files under the same domain name. Please tell me if there are any other flaws in the code.

[Updated: code changes to reduce the file size by 50%. The decimal numbers were converted to hex and the spaces in between the numbers removed. The file sizes can be further reduced by using the full character set.]

22 responses so far

How to defend against Yahoo! Slurp

Oct 09 2009 Published by Niyaz PK under Internet

I was going through the logs of my web server for the last month and was shocked to see that a whopping 22.93% of the total bandwidth of a particular website of mine was used by the Yahoo crawler called Slurp (I should have known better, given the revealing name).

This is just ridiculous particularly when taking into account the fact that Yahoo sends negligible number of visitors to the website.

Search Engine market share for Yahoo is coming down anyway - it is at 6.84% currently. For most of my sites Yahoo never send more than 4% of the total traffic. This means that I have to pull the plug on Yahoo! Slurp’s free run for the time being.

So how do I stop the Yahoo! crawler?

Create a file named robots.txt in the root folder of the website with the following lines of text in it:

User-Agent: Slurp

Disallow: /

User-Agent: *

Disallow:

If you don’t want to completely block the Yahoo crawler, you can just reduce the amount of requests Slurp sends to your server. To do this use the following lines in your robot.txt file:

User-agent: Slurp

Crawl-delay: 1

This “delay value” increases the time between successive Yahoo! crawler activities, and lowers the access rate of Slurp to your server. In the official FAQ you can see the details about Yahoo! Slurp and several ways to reduce the number of requests it makes to your site. For me though, supporting the Crawler is not worth the cost.

3 responses so far

More on ads

Jun 02 2009 Published by Niyaz PK under Internet

For the last one year or so I have been using a desktop application that displayed an ad (something like 800 x 60 pixels) in its header.

Since yesterday the ad is not showing up. Now, I cannot recall what the ad was about!

It is a real shame that ads don’t work if used without permission. You try spamming me for a whole year, and I still see only what I want to.

No responses yet

Ad Revenue on the Web?

May 26 2009 Published by Niyaz PK under Internet

I wish more people understood this: if ads are working on your site it is because someone has figured out how to monetize your users. (At a multiple of how much you get from the ads!) Why isn’t that someone you?

- from HN

No responses yet

Welcome the new player to the game

Apr 03 2009 Published by Niyaz PK under Internet, Microsoft

Yesterday while looking at the long tail referrers of my website, I found something peculiar. Can you spot it?

kumo

Let us all welcome the new player to the game!

2 responses so far

Older posts »