Google PageRank



By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Degree and Importance of Network

1.1. Degree

Degree of Undirected Graph

  • The degree of vertex in a graph is the number of edges connected to it
  • denote the degree of vertex $i$ by $k_{i}$
  • for an undirected graph of $n$ vetices


$$ k_i = \sum_{j=1}^{n} \; A_{ij} $$


  • Every edge in an undirected graph has two ends and if there are $m$ edges


$$ 2m = \sum_{i=1}^{n} \; k_i $$


  • the mean degree $c$ of a vertex


$$ c = \frac{1}{n} \sum_{i=1}^{n} \; k_i $$


  • the maximum possible number of edges in a graph is $\left( \begin{array}{c} n\\2 \end{array} \right) = \frac{n(n-1)}{2}$

  • density $\rho$ of a graph is the fraction of these edges that are actually present


$$ \rho = \frac{m}{\left( \begin{array}{c} n\\2 \end{array} \right)}$$


Degree of Directed Graph

  • in-degree: the number of incoming edges connected to a vertex


$$ k_i^{\text{in}} = \sum_{j=1}^{n} \; A_{ij} $$


  • out-degree: the number of outgoing edges


$$ k_i^{\text{out}} = \sum_{i=1}^{n} \; A_{ij} $$


1.2. Centrality

  • Which are the most important or central vertices in a network?
  • degree centrality is a simple centrality measure
    • in a social network, it seems reasonable to suppose that inidividuals who have connections to many others might have more influence, more access to information

Eigenvector centrality

  • not all neighbors are equivalent
  • increase importance by having connections to other vertices that are themselves important

  • sum of the centrality (ranks, scores) of its in-comming neighbors

  • iteratively update


$$ \begin{align*} r_i \leftarrow \sum_{j \, \in \, N(i)} r_j = \sum_{j} A_{ji}r_j \end{align*} $$


  • matrix form with a row vector of $r$


$$ \begin{align*} &1) \quad r \leftarrow r\,A &\text{update}\\ &2) \quad r \leftarrow \frac{r}{\parallel r \parallel_1}& \text{normalize} \end{align*} $$


  • left eigenvector of the adjacency matrix A associated with the largest eigenvalue


$$ \lambda \, r = r\,A $$


2. Google PageRank

  • Source
    • Networks: Friends, Money, and Bytes
    • by Prof. Mung Chiang at Princeton University
    • youtube playlist
  • Consider World Wide Web as a graph (or network)

  • Question: which page is more important?

  • For Google, which page among google search results should be placed on top?



2.1. PageRank



$$ A = \begin{bmatrix} 0 & 0 & 1 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ 1 & 1 & 1 & 0 \end{bmatrix}$$


  • In-degree
    • What makes a page important?
    • Possible measure
      • In-degree: measure of the number of incoming links a node has ($1,1,3,1$)
    • Does this tell the whole story?


  • Important score
$$ \pi = \left[\pi_1,\, \pi_2,\, \pi_3,\, \pi_4 \right]$$
  • Importance score, $\pi_i$ for node $i$

    • each node...
    • has its own importance score
    • Spreads an equal amount of its importance to each outgoing link
  • row vector by convention

  • Out-degree: the number of outgoing links from a node $\to (1,1,1,3)$

  • Markov chain with transition probability matrix $H$


$$ H = D^{-1}A = \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & \frac{1}{3} \end{bmatrix} \begin{bmatrix} 0 & 0 & 1 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ 1 & 1 & 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & 0 & \frac{1}{1}\\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 0 \end{bmatrix}$$


  • Simultaneous equations
    • We can write each node's score in terms of its incoming links
    • Recursion: seemingly circular logic


$$ \begin{align*} \pi_1 &= \frac{\pi_4}{3}\\ \pi_2 &= \frac{\pi_4}{3}\\ \pi_3 &= \pi_1 + \pi_2 + \frac{\pi_4}{3}\\ \pi_4 &= \pi_3 \end{align*} $$
    


$$ H = \begin{bmatrix} 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & 0 & \frac{1}{1}\\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 0 \end{bmatrix}$$


  • or eigenvalue problem (transposed)


$$ \left[\pi_1,\, \pi_2,\, \pi_3,\, \pi_4 \right] \times \begin{bmatrix} 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & 0 & \frac{1}{1}\\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 0 \end{bmatrix} = 1\left[\pi_1,\, \pi_2,\, \pi_3,\, \pi_4 \right] $$


In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
A = np.matrix([[0, 0, 1, 0],
              [0, 0, 1, 0],
              [0, 0, 0, 1],
              [1, 1, 1, 0]])

d = np.diag([1, 1, 1, 3])
d = np.asmatrix(d)

H = d.I*A
In [3]:
# eigenvalue

[D, V] = np.linalg.eig(H.T)

print(D) # check one of eigenvalue = 1
print(V[:,0]/np.sum(V[:,0]))
[ 1.0000000e+00+0.j         -5.0000000e-01+0.64549722j
 -5.0000000e-01-0.64549722j  6.2307862e-35+0.j        ]
[[0.125-0.j]
 [0.125-0.j]
 [0.375-0.j]
 [0.375-0.j]]
In [4]:
# iterative

# random initialization
r = np.random.rand(1,4)
r = r/np.sum(r)
print(r)
[[0.30686635 0.39529553 0.10359321 0.1942449 ]]
In [5]:
for _ in range(100):
    r = r*H
    
print(r)    
[[0.125 0.125 0.375 0.375]]
In [6]:
r = np.random.rand(1,4)
r = r/np.sum(r)

r_pre = 1/4*np.ones([1,4])

while np.linalg.norm(r_pre - r) > 1e-10:
    r_pre = r
    r = r*H
    
print(r)    
[[0.125 0.125 0.375 0.375]]
  • Dangling nodes
    • Does what we just presented always lead to a unique solution?
    • Not quite yet !

    • Dangling nodes
      • nodes that do not point to any others
      • lead to no meaningful solution
      • Solution: assume dangling node points to every node in the graph ($\rightarrow$ random surfer)
  • The Random Surfer
    • Surfing the web at random
    • Two possible actions
      • pick a hyperlink randomly from the current page
      • enter a URL randomly
    • Stochastic matrix $H'$


$$H' = H + \frac{1}{N}d \mathbf{1}^T \qquad$$

$\qquad \qquad $where $d$ is dangling (absorbing) nodes indicator column vector and $\mathbb{1}$ is a unit column vector



$$ H' = \begin{bmatrix} 0 & 0 & 0 & 1\\ \frac{1}{3} & 0 & \frac{1}{3} & \frac{1}{3}\\ \frac{1}{3} & \frac{1}{3} & 0 & \frac{1}{3}\\ 0 & 0 & 0 & 0 \end{bmatrix} + \begin{bmatrix} 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix} = H + \frac{1}{4} \begin{bmatrix} 0\\0\\0\\1\end{bmatrix} \begin{bmatrix} 1 & 1 & 1 &1 \end{bmatrix} $$


In [7]:
A = np.matrix([[0, 0, 0, 1],
               [1, 0, 1, 1],
               [1, 1, 0, 1],
               [0, 0, 0, 0]])

s = np.sum(A, axis = 1)

H2 = np.zeros(A.shape)

for i in range(4):
    if s[i,0] != 0:
        H2[i,:] = A[i,:]/s[i,0]
    else:
        H2[i,:] = 1/4*np.ones([1,4])
    
H2 = np.asmatrix(H2)
print(H2)
[[0.         0.         0.         1.        ]
 [0.33333333 0.         0.33333333 0.33333333]
 [0.33333333 0.33333333 0.         0.33333333]
 [0.25       0.25       0.25       0.25      ]]
In [8]:
r = np.random.rand(1,4)
r = r/np.sum(r)

r_pre = 1/4*np.ones([1,4])

while np.linalg.norm(r_pre - r) > 1e-10:
    r_pre = r
    r = r*H2
    
print(r)    
[[0.22222222 0.16666667 0.16666667 0.44444444]]
  • Disconnected graph

    • Connected component - group of nodes which can reach eath other, but none outside of the group
    • More than one connected component

      • Infinitely many solutions
      • Stuck in subgraph forever
      • Solution: add the purely random surfing into the mix
        • you get bored, enter a URL randomly
    • PageRank: what is the chance a page is selected at any time?

    • Google matrix $G$




$$ G = \alpha \begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 & 0\\ \frac{1}{2} & \frac{1}{2} & 0 & 0\\ 0 & 0 & \frac{1}{2} & \frac{1}{2}\\ 0 & 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix} + (1-\alpha) \frac{1}{4}\begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1 \end{bmatrix} $$


$$G = \alpha H' + (1-\alpha)\frac{1}{N}\mathbf{1}\mathbf{1}^T$$


In [9]:
A = np.matrix([[1, 1, 0, 0],
               [1, 1, 0, 0],
               [0, 0, 1, 1],
               [0, 0, 1, 1]])

s = np.sum(A, axis = 1)

H2 = np.zeros(A.shape)

for i in range(4):
    if s[i,0] != 0:
        H2[i,:] = A[i,:]/s[i,0]
    else:
        H2[i,:] = 1/4*np.ones([1,4])
    
H2 = np.asmatrix(H2)
print(H2, '\n')

alpha = 0.85;
G = alpha*H2 + (1-alpha)*1/4*np.ones([4,4])
G = np.asmatrix(G)

print(G)
[[0.5 0.5 0.  0. ]
 [0.5 0.5 0.  0. ]
 [0.  0.  0.5 0.5]
 [0.  0.  0.5 0.5]] 

[[0.4625 0.4625 0.0375 0.0375]
 [0.4625 0.4625 0.0375 0.0375]
 [0.0375 0.0375 0.4625 0.4625]
 [0.0375 0.0375 0.4625 0.4625]]
In [10]:
r = np.random.rand(1,4)
r = r/np.sum(r)

r_pre = 1/4*np.ones([1,4])

while np.linalg.norm(r_pre - r) > 1e-10:
    r_pre = r
    r = r*G
    
print(r)    
[[0.25 0.25 0.25 0.25]]

2.2. PageRank Computations

  1. Eigenvalue problem (choose left eigenvector $\pi$ with $\lambda = 1$)

    $$ \pi G = 1 \pi \implies \pi = \alpha \pi H' + (1-\alpha)\frac{1}{N}\mathbf{1}^T$$

    where $\pi$ is stationary distribution of Markov chain, and raw vector by convention

  2. Power iterations:

$$ \pi \leftarrow \pi \left(\alpha H' + (1-\alpha)\frac{1}{N}\mathbf{1}\mathbf{1}^T \right) = \alpha \pi H' + (1-\alpha)\frac{1}{N}\mathbf{1}^T $$


2.3. Summary


$$ \begin{align*} H & &\text{from webgraph}\\ H'& = H + \frac{1}{N}d \mathbf{1}^T &\text{to overcome dangling issue}\\ G &= \alpha H' + (1-\alpha)\frac{1}{N}\mathbf{1}\mathbf{1}^T= \alpha \left( H + \frac{1}{N}d \mathbf{1}^T\right) + (1-\alpha)\frac{1}{N}\mathbf{1}\mathbf{1}^T &\text{to overcome disconnected subgraph} \end{align*} $$



$$ \pi [k+1] = \pi [k] G$$


In [11]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')