Graph Basics

Content from: mmds.org
Post author: Oleguer Canal

A graph i a pair \(G = (V, E)\) where:

  • \(V\) is a set of objects called nodes (or vertices). \(N = \vert V \vert\)
  • \(E\) a set of edges (paired nodes).

There exist multiple types of graphs:

  • Undirected/Directed
  • Unweighted/Weighted
  • Bi/Tri/…partite
  • Dense/Sparse

Google PageRank algorithm is a good example of it: the weight of each webpage is given by the number of links pointing to it, not by its content.

Another “fun” example is the way wikipedia checks for the reliability of its articles: real articles have more coherent citations than fake ones. Reliable article citations both cite back the main article and between themselves. Check out one of the most famous hoaxes which tricked experts into awarding a grant to study a made-up language: Balboa Creole French

Basic definitions

  • Degree \(k_i\): Number of adjacent nodes to node \(n_i\).

  • Average Undirected Degree \(\bar k = \frac{1}{N} \sum_{i=1}^N k_i = \frac{2 E}{N}\)

  • Directed Degree \(k_i^{in}, k_i^{out}\): Edges pointing into \(n_i\), Edges pointing out of \(n_i\) in a directed graph.

  • Average Directed Degree \(\bar k = \frac{1}{N} \sum_{i=1}^N k_i^{in} + k_i^{out} = \frac{E}{N}\)

    • Source: Node such that \(k^{in} = 0\)
    • Sink: Node such that \(k^{out} = 0\)
  • Degree distribution: Sometimes it is useful to plot a histogram of node degree frequencies. Often resulting in a power-law distribution.

  • Adjacency Matrix: Matrix representing the connections between nodes: \(A_{ij} = 1_{(n_i, n_j) \in E}\).

Paths

  • Path: Sequence of nodes such that each consecutive pair is in edges.

Notice you can get in and out degree of a node by respectively summing the row and column values of its associated adjacency matrix.

  • Distance \(d_{ij}\): Shortest path length between two nodes. If two nodes are not connected the distance is usually taken as infinite.

Notice that if a graph is directed, distance is not symmetric.

  • Graph Diameter: Can be measured in different ways:
    • Longest distance. (i.e. largest shortest path) Outliers can be an issue.
    • Average path length. (i.e. average length of all pairs of nodes) Disconnected nodes might be an issue. Usually we compute the average path length of a connected component as \(\hat h = \frac{\sum_{i \neq j} h_{ij}}{2 E_{max}}\), where \(h_{ij}\) is the distance from node \(n_i\) to \(n_j\). (this way we discard the infinite values).

Centrality Measures

Depending on our problem definition we can choose between different centrality measures:

Degree centrality

\[c_D (i) := k_i\]

Takes the degree of a node as a measure of its centrality. Depends a lot on the size of the graph, just this number alone is not enough. Since absolute values are not very informative, we more often use the Normalized Degree centrality:

\[c_D^\star (i) := \frac{c_D (i)}{n - 1}\]

Still does not consider graph topology.

Closeness centrality

\[c_C (i) = \frac{1}{\sum_j d(i, j)}\]

“How close a node is to all others of the network”. Again working with relative values is more informative, Normalized Closeness centrality

\[c_C (i) = \frac{n - 1}{\sum_j d_{ij}}\]

Only works for connected components.

Harmonic centrality

\[c_H (i) = \sum_j \frac{1}{d_{ij}}\]

Fixes the disconnected components problem from closeness centrality (it wil not be \(0\) if two nodes are not connected).

Betweenness centrality

\[c_B (i) = \sum_{i \neq j \neq k} \frac{\sigma_jk (i)}{\sigma_jk}\]

Where:

  • \(\sigma_{jk}\) the number of shortest paths from from \(j\) to \(k\).
  • \(\sigma_{jk} (i)\) is the number of shortest paths from \(j\) to \(k\) through \(i\)

“Given a node, how many pairs of nodes have a shortest path through it”. . We also have the Normalized Betweenness centrality:

\[c_B (i) = \sum_{i \neq j \neq k} \frac{\sigma_jk (i)}{\binom{n-1}{2}}\]

Sometimes it is more interesting to look at the number of paths through edges and not through nodes. This centrality is called Edge Betweenness centrality.

A simple clustering technique using this metric is to iteratively remove the edge with highest betweenness centrality. We thus hierarchically obtain the graph’s clusters. Notice it is very expensive though, we need to re-compute the edge betweenness centrality after each deletion.

Importance centrality

“Importance of a node depends on the importance of its neighbors”. If storing the importance of each node in \(v\):

Algorithm:

  1. Assign an importance of 1 to each node: \(v=\textrm{ones}(N)\)
  2. \(v \leftarrow \frac{1}{\lambda} \sum_j A_{ij}v_j\) (\(v\) now holds the (normalized) degree of each node)
  3. \(v \leftarrow \frac{1}{\lambda} \sum_j A_{ij}v_j\) (\(v\) now holds the (normalized) sum of degrees of each node’s neighbors)
  4. Repeat this operation until convergence

Where \(\lambda\) is some normalizing constant to ensure the iteration does not diverge.

Notice that by applying this algorithm we will reach an “importance” vector which satisfies: \(A v = \lambda v\). This algorithm is known as power iteration: And is a way of finding the principal eigenvector: the biggest eigenvalue eigenvector.

This centrality is aka Eigenvector centrality.

Similar centrality measures are Katz centrality

To compare centrality orders we can use the Kendall rank correlation coefficient. Other common metrics are: MSE between ordered indexes

Clustering coefficient

Local clustering coefficient

\[c (i) = \frac{e(i)}{\binom{k_i}{2}}\]

Where \(e(i)\) denotes the number of links between neighbors of node \(n_i\). “How many friends do your friends have in common” or: “In how many triangles do you participate”.

We can use this metric to see how our graph’s clustering compares to a random graph with connectivity probability: \(p = \frac{E}{\binom{N}{2}}\).