Graph Data

— Alex Reinhart and Christopher Genovese

A graph is a structure that represents the relationships among a set of entities. It consists of

  • A set of nodes (also called vertices) representing the entities.
  • A set of edges each connecting two nodes.

Two nodes connected by an edge are called neighbors, or adjacent.

Graphs are enormously important in statistics, in computing, and more generally. Many things can be treated as graphs:

  1. Social networks, where nodes are people and edges are relationships between them
  2. Statistical models, where nodes are variables and edges represent dependence, causality, or correlation
  3. Road maps, where nodes are intersections and buildings and edges represent connections between them via road
  4. Electric circuits, where nodes are circuit nodes and edges the resistors, inductors, and capacitors connecting them
  5. The Internet is a bunch of nodes (routers and individual computers) connected by Ethernet, WiFi, ISPs, undersea cables…

Treating something complicated, like a social network, as a graph lets us apply a range of powerful tools to work with it. By learning generally about graphs, we learn tools that can be used for many different kinds of problems.

Rather than remembering the details of various graph algorithms, it is useful to think about common graph problems and how they apply to your situation. Finding good algorithms then follows fairly easily.

Common graph problems:

  • Finding connected components
  • Finding shortest paths
  • Matching and coloring
  • Transitive closure
  • Finding a minimal spanning tree
  • Traversing the graph
  • Topological sorting
  • Finding separating edges and nodes
  • Finding cycles and “tours”
  • Maximizing flows through a network
  • Calculating useful embeddings.

But first, let’s learn about graphs.

Flavors of Graphs #

Mathematically, a graph \(G = (N, E)\), where \(N\) is a set of nodes and \(E\) is a set containing 2-sets of the form \(\{n_1, n_2\}\), that represents an edge between nodes \(n_1\) and \(n_2\). (They are 2-sets because they are not ordered: \(\{n_2, n_1\}\) represents the same edge as \(\{n_1, n_2\}\).)

A graph is connected if one can move between any two nodes by a series of steps between adjacent nodes. That is, any node is a neighbor of a neighbor of a neighbor of a… of any other node.

Graphs are often shown, er, graphically with the nodes as circles (or other shapes) and the edges as lines or curves linking the nodes.


A graph can be basic, like a single node with no edges, or complex, like:


Properties (especially Labels and Weights) #

While the nodes and edges define the relationships, we often want to encode more information. To that end, we can associate properties with the nodes and the edges. These properties can be arbitrary data (or meta-data), but most common are labels and weights, which are strings and numbers, respectively, associated with nodes and/or edges.

Here is a graph with labeled nodes:


Here is a graph with labeled nodes and weighted edges:


Weights might be used to represent many things: the strength of correlation between variables, the distance between two points on the road network, or the number of emails sent between two people, among many other potential applications.

Directed versus Undirected Graphs #

The graphs displayed above are undirected: the edges \(\{n_1, n_2\}\) represent symmetric relationships between the nodes. For example, Kai is friends with Nathan if and only if Nathan is friends with Kai.

In contrast, a directed graph, or digraph, has edges with direction that represent asymmetric relationships. For example, Joe might “like” Grace without Grace liking Joe.

Mathematically, the edges in a digraph are not sets but tuples, specifically ordered pairs \((n_1, n_2)\) representing an edge from node \(n_1\) to node \(n_2\).

Graphically, we display directed edges as arrows:


Simple versus Non-Simple #

A simple graph has

  • no edges from a node to itself,
  • at most one edge between any pair of nodes;

otherwise, the graph is non-simple. The graphs above were all simple; here’s a non-simple example.


Most graphs we work with are simple graphs.

Sparse versus Dense #

A graph is sparse if only a small fraction of the possible edges are present; otherwise it is dense. Which of the examples above are sparse? Which are dense?

Later we’ll see that the difference between a sparse and a dense graph has implications for how we choose to represent a graph in our programs.

Consider a large social network, like Facebook, where edges represent “friendship” between two members. Is such a graph sparse or dense?

Cyclic versus Acyclic #

A cycle in a digraph is a sequence of adjacent nodes (a path, respecting direction) that begin and end with the same node. (There are several different flavors of cycle.)

An acyclic graph has no cycles (else it is cyclic). A directed, acyclic graph is called a DAG.

One of the most important types of acyclic graph is a tree.

A graph G is a tree if any of the following (equivalent) properties hold:

  • G is connected, acyclic graph.
  • G is connected but deleting any edge makes it disconnected.
  • Any two distinct nodes in G are connected by exactly one simple path (a path containing distinct nodes only).
  • A finite G with \(n\) nodes is a tree if it is acyclic and has \(n - 1\) edges.
  • A finite G with \(n\) nodes is a tree if it is connected and has \(n - 1\) edges.

There are many varieties of tree. Here’s a binary tree, for example, where each node has only two children:


Representing Graphs #

There are several common data structures used to represent graphs. For a graph \(G\), let \(n = #N\) and \(e = #E\) denote the total number of nodes and edges.

When we talked about data structures, we made a distinction between the abstract operations a data structure supports and the actual implementation, in code and in memory, of the data structure. The same applies here: a graph is an abstract idea, and there are a few operations we want it to support:

  • Check if there is an edge between two nodes
  • Add and remove nodes
  • Add and remove edges between nodes
  • Traverse the graph

Let’s discuss a few common implementations, which have different tradeoffs in computational complexity and storage costs.

Adjacency Matrix #

An \(n \times n\) matrix where an entry at row \(i\) and column \(j\) indicates whether there is an edge connecting \(i\) and \(j\). In digraphs, rows represent source nodes and columns represent destination nodes. In undirected graphs, adjacency matrices are symmetric.

Adjacency matrices make it easy to look up whether an edge exists. They are useful for dense, simple graphs. Properties must be stored elsewhere.

A 0 1 1 0 0 0 0
B 1 0 0 1 1 0 0
C 1 0 0 0 0 1 1
D 0 1 0 0 0 0 0
E 0 1 0 0 0 0 0
F 0 0 1 0 0 0 0
G 0 0 1 0 0 0 0

In graphs with weighted edges, the adjacency matrix often stores the weights on each edge instead of 1. Instead of 0 for no-edge, it is sometimes convenient to use \(\infty\).

Adjacency List #

For each node, we store a list of the node’s neighbors:

Node Neighbors

We can allow self links and multi-links (via repeats in the list). We can also store data (or a pointer to it) in the list.

Incidence Matrix #

An \(n \times e\) boolean matrix, where rows represent nodes and columns represent edges, and where the $i,j$th element being true means that node \(i\) is “incident on” edge \(j\).

Operations and Performance #

Operation Adjacency Matrix Adjacency List Incidence Matrix Winner
u, v adjacent? O(1) O(n) O(e) Adjacency Matrix
degree(u) O(n) O(neighbors) O(e) Adjacency List
insert node O(n^2) O(1) O(n e) Adjacency List
insert edge O(1) O(neighbors) O(n e) Adjacency List
delete node O(n^2) O(e) O(n e) depends
delete edge O(1) O(e) O(n e) Adjacency Matrix
memory (small G) O(n^2) O(n + e) O(n e) Adjacency List
memory (big G) O(n^2) O(n + e) O(n e) Adjacency Matrix
traversal Theta(n + e) Theta(n^2) Adjacency List

Adjacency lists are a good default, but pay attention to the problem at hand.

Group Exercise #

Design two data structures/classes that represent an undirected graph in the Adjacency Matrix and List representation, respectively.

You can assume the nodes are identified by numbers from 1..n or 0..(n-1), whichever is more convenient.

Write functions to translate between the two representations.

(For later) Choose one of the operations above in the table and implement it.