A graph i a pair \(G = (V, E)\) where:
There exist multiple types of graphs:
Google PageRank algorithm is a good example of it: the weight of each webpage is given by the number of links pointing to it, not by its content.
Another “fun” example is the way wikipedia checks for the reliability of its articles: real articles have more coherent citations than fake ones. Reliable article citations both cite back the main article and between themselves. Check out one of the most famous hoaxes which tricked experts into awarding a grant to study a made-up language: Balboa Creole French
Degree \(k_i\): Number of adjacent nodes to node \(n_i\).
Average Undirected Degree \(\bar k = \frac{1}{N} \sum_{i=1}^N k_i = \frac{2 E}{N}\)
Directed Degree \(k_i^{in}, k_i^{out}\): Edges pointing into \(n_i\), Edges pointing out of \(n_i\) in a directed graph.
Average Directed Degree \(\bar k = \frac{1}{N} \sum_{i=1}^N k_i^{in} + k_i^{out} = \frac{E}{N}\)
Degree distribution: Sometimes it is useful to plot a histogram of node degree frequencies. Often resulting in a power-law distribution.
Notice you can get in and out degree of a node by respectively summing the row and column values of its associated adjacency matrix.
Notice that if a graph is directed, distance is not symmetric.
Depending on our problem definition we can choose between different centrality measures:
Takes the degree of a node as a measure of its centrality. Depends a lot on the size of the graph, just this number alone is not enough. Since absolute values are not very informative, we more often use the Normalized Degree centrality:
\[c_D^\star (i) := \frac{c_D (i)}{n - 1}\]Still does not consider graph topology.
“How close a node is to all others of the network”. Again working with relative values is more informative, Normalized Closeness centrality
\[c_C (i) = \frac{n - 1}{\sum_j d_{ij}}\]Only works for connected components.
Fixes the disconnected components problem from closeness centrality (it wil not be \(0\) if two nodes are not connected).
Where:
“Given a node, how many pairs of nodes have a shortest path through it”. . We also have the Normalized Betweenness centrality:
\[c_B (i) = \sum_{i \neq j \neq k} \frac{\sigma_jk (i)}{\binom{n-1}{2}}\]Sometimes it is more interesting to look at the number of paths through edges and not through nodes. This centrality is called Edge Betweenness centrality.
A simple clustering technique using this metric is to iteratively remove the edge with highest betweenness centrality. We thus hierarchically obtain the graph’s clusters. Notice it is very expensive though, we need to re-compute the edge betweenness centrality after each deletion.
“Importance of a node depends on the importance of its neighbors”. If storing the importance of each node in \(v\):
Algorithm:
- Assign an importance of 1 to each node: \(v=\textrm{ones}(N)\)
- \(v \leftarrow \frac{1}{\lambda} \sum_j A_{ij}v_j\) (\(v\) now holds the (normalized) degree of each node)
- \(v \leftarrow \frac{1}{\lambda} \sum_j A_{ij}v_j\) (\(v\) now holds the (normalized) sum of degrees of each node’s neighbors)
- Repeat this operation until convergence
Where \(\lambda\) is some normalizing constant to ensure the iteration does not diverge.
Notice that by applying this algorithm we will reach an “importance” vector which satisfies: \(A v = \lambda v\). This algorithm is known as power iteration: And is a way of finding the principal eigenvector: the biggest eigenvalue eigenvector.
This centrality is aka Eigenvector centrality.
Similar centrality measures are Katz centrality
To compare centrality orders we can use the Kendall rank correlation coefficient. Other common metrics are: MSE between ordered indexes
Where \(e(i)\) denotes the number of links between neighbors of node \(n_i\). “How many friends do your friends have in common” or: “In how many triangles do you participate”.
We can use this metric to see how our graph’s clustering compares to a random graph with connectivity probability: \(p = \frac{E}{\binom{N}{2}}\).