Google PageRank
Table of Contents
Degree of Undirected Graph
the maximum possible number of edges in a graph is $\left( \begin{array}{c} n\\2 \end{array} \right) = \frac{n(n-1)}{2}$
density $\rho$ of a graph is the fraction of these edges that are actually present
Degree of Directed Graph
Eigenvector centrality
increase importance by having connections to other vertices that are themselves important
sum of the centrality (ranks, scores) of its in-comming neighbors
Consider World Wide Web as a graph (or network)
Question: which page is more important?
For Google, which page among google search results should be placed on top?
Read Wikipedia
The original PageRank paper by Google’s founders Sergey Brin and Lawrence Page
Importance score, $\pi_i$ for node $i$
row vector by convention
Out-degree: the number of outgoing links from a node $\to (1,1,1,3)$
Markov chain with transition probability matrix $H$
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
A = np.matrix([[0, 0, 1, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[1, 1, 1, 0]])
d = np.diag([1, 1, 1, 3])
d = np.asmatrix(d)
H = d.I*A
# eigenvalue
[D, V] = np.linalg.eig(H.T)
print(D) # check one of eigenvalue = 1
print(V[:,0]/np.sum(V[:,0]))
# iterative
# random initialization
r = np.random.rand(1,4)
r = r/np.sum(r)
print(r)
for _ in range(100):
r = r*H
print(r)
r = np.random.rand(1,4)
r = r/np.sum(r)
r_pre = 1/4*np.ones([1,4])
while np.linalg.norm(r_pre - r) > 1e-10:
r_pre = r
r = r*H
print(r)
$\qquad \qquad $where $d$ is dangling (absorbing) nodes indicator column vector and $\mathbb{1}$ is a unit column vector
A = np.matrix([[0, 0, 0, 1],
[1, 0, 1, 1],
[1, 1, 0, 1],
[0, 0, 0, 0]])
s = np.sum(A, axis = 1)
H2 = np.zeros(A.shape)
for i in range(4):
if s[i,0] != 0:
H2[i,:] = A[i,:]/s[i,0]
else:
H2[i,:] = 1/4*np.ones([1,4])
H2 = np.asmatrix(H2)
print(H2)
r = np.random.rand(1,4)
r = r/np.sum(r)
r_pre = 1/4*np.ones([1,4])
while np.linalg.norm(r_pre - r) > 1e-10:
r_pre = r
r = r*H2
print(r)
Disconnected graph
More than one connected component
PageRank: what is the chance a page is selected at any time?
Google matrix $G$
A = np.matrix([[1, 1, 0, 0],
[1, 1, 0, 0],
[0, 0, 1, 1],
[0, 0, 1, 1]])
s = np.sum(A, axis = 1)
H2 = np.zeros(A.shape)
for i in range(4):
if s[i,0] != 0:
H2[i,:] = A[i,:]/s[i,0]
else:
H2[i,:] = 1/4*np.ones([1,4])
H2 = np.asmatrix(H2)
print(H2, '\n')
alpha = 0.85;
G = alpha*H2 + (1-alpha)*1/4*np.ones([4,4])
G = np.asmatrix(G)
print(G)
r = np.random.rand(1,4)
r = r/np.sum(r)
r_pre = 1/4*np.ones([1,4])
while np.linalg.norm(r_pre - r) > 1e-10:
r_pre = r
r = r*G
print(r)
Eigenvalue problem (choose left eigenvector $\pi$ with $\lambda = 1$)
$$ \pi G = 1 \pi \implies \pi = \alpha \pi H' + (1-\alpha)\frac{1}{N}\mathbf{1}^T$$
where $\pi$ is stationary distribution of Markov chain, and raw vector by convention
Power iterations:
$$ \pi [k+1] = \pi [k] G$$
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')