Saturday, 12 January 2013

How does Google rank webpages?

Each time you search on www.google.com, Google solves a very big system of linear equation to rank the webpages.
The idea of embedding links in text dates back to the middle of the last
century. As the Internet scaled up, and with the introduction of the web in 1989,
the browser in 1990, and the web portal in 1994, this vision was realized on
an unprecedented scale. The network of webpages is huge: somewhere between
40 billion and 60 billion according to various estimates. And most of them are
connected to each other in a giant component of this network. It is also sparse:
most webpages have only a few hyperlinks pointing inward from other webpages
or pointing outward to other webpages. Google search organizes this huge and
sparse network by ranking the webpages.
More important webpages should be ranked higher. But how do you quantify
how important a webpage is? Well, if there are many other important webpages
pointing towards webpage A, A is probably important. This argument implicitly
assumes two ideas:
  •  Webpages form a network, where a webpage is a node, and a hyperlink is a directed link in the network: webpage A may point to webpage B without B pointing back to A.
  •  We can turn the seemingly circular logic of \important webpages pointing to you means you are important" into a set of equations that characterize the equilibrium (a xed-point equilibrium, not a game-theoretic Nash equilibrium) in terms of a recursive de nition of \importance." This importance score will then act as an approximation of the ultimate test of search engines: how useful a user nds the search results.
A network consists of both a topology and functionalities. Topology is often represented by a graph and various matrices.
Suppose there are N webpages. Each webpage i has Oi outgoing links and
Ii incoming links. We cannot just count the number of webpages pointing to
a given webpage A, because that number, the in-degree of the node in the
hyperlinked graph, is often not the right measure of importance.
Let us denote the \importance score" of each webpage by i. If important
webpages point to webpage A, maybe webpage A should be important too, where the sum is taken over all the webpages pointing to A.
However, this is not quite right either, since node i may be pointing to many
other nodes in this graph, and that means each of these nodes receives only a
small portion of node i's importance score.
Let us assume that each node's importance score is evenly distributed across
all the outgoing links from that node, i.e., each of the outgoing neighbors of node
i receives i=Oi importance score. Now each node's importance score can also
be written as the sum of the importance scores received from all of the incoming
neighbors, indexed by j.
If this sum is indeed also 1, we have consistency of the scores. But it is not
clear whether we can readily compute these scores, or even whether there is a
consistent set of scores at all.
It turns out that, with a couple of modi cations to the basic idea above, there is always a unique set of consistent scores, denoted as  i. These scores determine the ranking of the webpages: the higher the score, the higher the webpage is ranked.
For example, consider a very small graph with just four webpages and six hyperlinks, shown in Figure 3.1. This is a directed graph where each node is a
webpage and each link a hyperlink. A consistent set of importance scores turns
out to be (0:125; 0:125; 0:375; 0:375): webpages 3 and 4 are more important than webpages 1 and 2. In this small example, it so happens that webpages 3 and 4, linking each other, push both webpages' rankings higher. Intuitively, the scores make sense. First, by symmetry of the graph, webpages 1 and 2 should have the same importance score. We can view webpages 3 and 4 as if they form one webpage rst, a supernode 3+4. Since node 3+4 has two incoming links, and each of nodes 1 and 2 has only one incoming link, node 3+4 should have a higher importance score. Since node 3 points to node 4 and vice versa, these two nodes' importance scores mix into an equal division at equilibrium. This line of reasoning qualitatively explains the actual scores we see.
But how do we calculate the exact scores? In this small example, it boils down
to two simple linear equations. Let the score of node 1 (and 2) be x, and that
of node 3 (and 4) be y. Looking at node 1's incoming links, we see that there is
only one such link, coming from node 4 that points to three nodes. So we know x = y=3. By normalization, all scores must add up to 2x + 2y = 1. So we have
x = 0:125 and y = 0:375.
Now, how do we compute this set of consistent scores in a large, sparse, general
graph of hyperlink connectivity?

No comments:

Post a Comment