Google PageRank

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Degree and Importance of Network¶

1.1. Degree¶

Degree of Undirected Graph

The degree of vertex in a graph is the number of edges connected to it
denote the degree of vertex $i$ by $k_{i}$
for an undirected graph of $n$ vetices

$$ k_i = \sum_{j=1}^{n} \; A_{ij} $$

Every edge in an undirected graph has two ends and if there are $m$ edges

$$ 2m = \sum_{i=1}^{n} \; k_i $$

the mean degree $c$ of a vertex

$$ c = \frac{1}{n} \sum_{i=1}^{n} \; k_i $$

the maximum possible number of edges in a graph is $\left( \begin{array}{c} n\\2 \end{array} \right) = \frac{n(n-1)}{2}$
density $\rho$ of a graph is the fraction of these edges that are actually present

$$ \rho = \frac{m}{\left( \begin{array}{c} n\\2 \end{array} \right)}$$

Degree of Directed Graph

in-degree: the number of incoming edges connected to a vertex

$$ k_i^{\text{in}} = \sum_{j=1}^{n} \; A_{ij} $$

out-degree: the number of outgoing edges

$$ k_i^{\text{out}} = \sum_{i=1}^{n} \; A_{ij} $$

1.2. Centrality¶

Which are the most important or central vertices in a network?
degree centrality is a simple centrality measure
- in a social network, it seems reasonable to suppose that inidividuals who have connections to many others might have more influence, more access to information

Eigenvector centrality

not all neighbors are equivalent
increase importance by having connections to other vertices that are themselves important
sum of the centrality (ranks, scores) of its in-comming neighbors
iteratively update

$$ \begin{align*} r_i \leftarrow \sum_{j \, \in \, N(i)} r_j = \sum_{j} A_{ji}r_j \end{align*} $$

matrix form with a row vector of $r$

$$ \begin{align*} &1) \quad r \leftarrow r\,A &\text{update}\\ &2) \quad r \leftarrow \frac{r}{\parallel r \parallel_1}& \text{normalize} \end{align*} $$

left eigenvector of the adjacency matrix A associated with the largest eigenvalue

$$ \lambda \, r = r\,A $$

online note

2. Google PageRank¶

Source
- Networks: Friends, Money, and Bytes
- by Prof. Mung Chiang at Princeton University
- youtube playlist

Consider World Wide Web as a graph (or network)
Question: which page is more important?
For Google, which page among google search results should be placed on top?

Read Wikipedia
The original PageRank paper by Google’s founders Sergey Brin and Lawrence Page

2.1. PageRank¶

$$ A = \begin{bmatrix} 0 & 0 & 1 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ 1 & 1 & 1 & 0 \end{bmatrix}$$

In-degree
- What makes a page important?
- Possible measure
  - In-degree: measure of the number of incoming links a node has ($1,1,3,1$)
- Does this tell the whole story?

Important score

$$ \pi = \left[\pi_1,\, \pi_2,\, \pi_3,\, \pi_4 \right]$$

Importance score, $\pi_i$ for node $i$
- each node...
- has its own importance score
- Spreads an equal amount of its importance to each outgoing link
row vector by convention
Out-degree: the number of outgoing links from a node $\to (1,1,1,3)$
Markov chain with transition probability matrix $H$

$$ H = D^{-1}A = \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & \frac{1}{3} \end{bmatrix} \begin{bmatrix} 0 & 0 & 1 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ 1 & 1 & 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & 0 & \frac{1}{1}\\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 0 \end{bmatrix}$$

Simultaneous equations
- We can write each node's score in terms of its incoming links
- Recursion: seemingly circular logic

$$ \begin{align*} \pi_1 &= \frac{\pi_4}{3}\\ \pi_2 &= \frac{\pi_4}{3}\\ \pi_3 &= \pi_1 + \pi_2 + \frac{\pi_4}{3}\\ \pi_4 &= \pi_3 \end{align*} $$

$$ H = \begin{bmatrix} 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & 0 & \frac{1}{1}\\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 0 \end{bmatrix}$$

or eigenvalue problem (transposed)

$$ \left[\pi_1,\, \pi_2,\, \pi_3,\, \pi_4 \right] \times \begin{bmatrix} 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & \frac{1}{1} & 0\\ 0 & 0 & 0 & \frac{1}{1}\\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} & 0 \end{bmatrix} = 1\left[\pi_1,\, \pi_2,\, \pi_3,\, \pi_4 \right] $$

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

A = np.matrix([[0, 0, 1, 0],
              [0, 0, 1, 0],
              [0, 0, 0, 1],
              [1, 1, 1, 0]])

d = np.diag([1, 1, 1, 3])
d = np.asmatrix(d)

H = d.I*A

# eigenvalue

[D, V] = np.linalg.eig(H.T)

print(D) # check one of eigenvalue = 1
print(V[:,0]/np.sum(V[:,0]))

[ 1.0000000e+00+0.j         -5.0000000e-01+0.64549722j
 -5.0000000e-01-0.64549722j  6.2307862e-35+0.j        ]
[[0.125-0.j]
 [0.125-0.j]
 [0.375-0.j]
 [0.375-0.j]]

# iterative

# random initialization
r = np.random.rand(1,4)
r = r/np.sum(r)
print(r)

[[0.30686635 0.39529553 0.10359321 0.1942449 ]]

for _ in range(100):
    r = r*H
    
print(r)

[[0.125 0.125 0.375 0.375]]

r = np.random.rand(1,4)
r = r/np.sum(r)

r_pre = 1/4*np.ones([1,4])

while np.linalg.norm(r_pre - r) > 1e-10:
    r_pre = r
    r = r*H
    
print(r)

[[0.125 0.125 0.375 0.375]]

Dangling nodes
- Does what we just presented always lead to a unique solution?
- Not quite yet !
- Dangling nodes
  - nodes that do not point to any others
  - lead to no meaningful solution
  - Solution: assume dangling node points to every node in the graph ($\rightarrow$ random surfer)

The Random Surfer
- Surfing the web at random
- Two possible actions
  - pick a hyperlink randomly from the current page
  - enter a URL randomly
- Stochastic matrix $H'$

$$H' = H + \frac{1}{N}d \mathbf{1}^T \qquad$$

$\qquad \qquad $where $d$ is dangling (absorbing) nodes indicator column vector and $\mathbb{1}$ is a unit column vector

$$ H' = \begin{bmatrix} 0 & 0 & 0 & 1\\ \frac{1}{3} & 0 & \frac{1}{3} & \frac{1}{3}\\ \frac{1}{3} & \frac{1}{3} & 0 & \frac{1}{3}\\ 0 & 0 & 0 & 0 \end{bmatrix} + \begin{bmatrix} 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix} = H + \frac{1}{4} \begin{bmatrix} 0\\0\\0\\1\end{bmatrix} \begin{bmatrix} 1 & 1 & 1 &1 \end{bmatrix} $$

A = np.matrix([[0, 0, 0, 1],
               [1, 0, 1, 1],
               [1, 1, 0, 1],
               [0, 0, 0, 0]])

s = np.sum(A, axis = 1)

H2 = np.zeros(A.shape)

for i in range(4):
    if s[i,0] != 0:
        H2[i,:] = A[i,:]/s[i,0]
    else:
        H2[i,:] = 1/4*np.ones([1,4])
    
H2 = np.asmatrix(H2)
print(H2)

[[0.         0.         0.         1.        ]
 [0.33333333 0.         0.33333333 0.33333333]
 [0.33333333 0.33333333 0.         0.33333333]
 [0.25       0.25       0.25       0.25      ]]

r = np.random.rand(1,4)
r = r/np.sum(r)

r_pre = 1/4*np.ones([1,4])

while np.linalg.norm(r_pre - r) > 1e-10:
    r_pre = r
    r = r*H2
    
print(r)

[[0.22222222 0.16666667 0.16666667 0.44444444]]

Disconnected graph
- Connected component - group of nodes which can reach eath other, but none outside of the group
- More than one connected component
  - Infinitely many solutions
  - Stuck in subgraph forever
  - Solution: add the purely random surfing into the mix
    - you get bored, enter a URL randomly
- PageRank: what is the chance a page is selected at any time?
- Google matrix $G$

$$ G = \alpha \begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 & 0\\ \frac{1}{2} & \frac{1}{2} & 0 & 0\\ 0 & 0 & \frac{1}{2} & \frac{1}{2}\\ 0 & 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix} + (1-\alpha) \frac{1}{4}\begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1 \end{bmatrix} $$

$$G = \alpha H' + (1-\alpha)\frac{1}{N}\mathbf{1}\mathbf{1}^T$$

A = np.matrix([[1, 1, 0, 0],
               [1, 1, 0, 0],
               [0, 0, 1, 1],
               [0, 0, 1, 1]])

s = np.sum(A, axis = 1)

H2 = np.zeros(A.shape)

for i in range(4):
    if s[i,0] != 0:
        H2[i,:] = A[i,:]/s[i,0]
    else:
        H2[i,:] = 1/4*np.ones([1,4])
    
H2 = np.asmatrix(H2)
print(H2, '\n')

alpha = 0.85;
G = alpha*H2 + (1-alpha)*1/4*np.ones([4,4])
G = np.asmatrix(G)

print(G)

[[0.5 0.5 0.  0. ]
 [0.5 0.5 0.  0. ]
 [0.  0.  0.5 0.5]
 [0.  0.  0.5 0.5]] 

[[0.4625 0.4625 0.0375 0.0375]
 [0.4625 0.4625 0.0375 0.0375]
 [0.0375 0.0375 0.4625 0.4625]
 [0.0375 0.0375 0.4625 0.4625]]

r = np.random.rand(1,4)
r = r/np.sum(r)

r_pre = 1/4*np.ones([1,4])

while np.linalg.norm(r_pre - r) > 1e-10:
    r_pre = r
    r = r*G
    
print(r)

[[0.25 0.25 0.25 0.25]]

2.2. PageRank Computations¶

Eigenvalue problem (choose left eigenvector $\pi$ with $\lambda = 1$)

$$ \pi G = 1 \pi \implies \pi = \alpha \pi H' + (1-\alpha)\frac{1}{N}\mathbf{1}^T$$

where $\pi$ is stationary distribution of Markov chain, and raw vector by convention
Power iterations:

$$ \pi \leftarrow \pi \left(\alpha H' + (1-\alpha)\frac{1}{N}\mathbf{1}\mathbf{1}^T \right) = \alpha \pi H' + (1-\alpha)\frac{1}{N}\mathbf{1}^T $$

2.3. Summary¶

$$ \begin{align*} H & &\text{from webgraph}\\ H'& = H + \frac{1}{N}d \mathbf{1}^T &\text{to overcome dangling issue}\\ G &= \alpha H' + (1-\alpha)\frac{1}{N}\mathbf{1}\mathbf{1}^T= \alpha \left( H + \frac{1}{N}d \mathbf{1}^T\right) + (1-\alpha)\frac{1}{N}\mathbf{1}\mathbf{1}^T &\text{to overcome disconnected subgraph} \end{align*} $$

$$ \pi [k+1] = \pi [k] G$$

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')