Graph Traversal

— Alex Reinhart and Christopher Genovese

To traverse a graph is to visit every node and/or edge systematically.

This seems boring, but it’s actually an important part of many things we want to do with graphs: finding connected components, finding paths between nodes, calculating graph statistics, and much more. Even “finding paths between nodes” is useful for an incredible number of problems, from Google Maps to internet routing, and even tasks as plainly statistical as kernel density estimation can be phrased in terms of traversals of graphs (or trees).

There are several strategies for traversing a graph, each with different useful properties.

Start with an undirected, simple graph.

Q: If I showed you a large, undirected, simple graph and asked you to count the nodes exactly, how would you do it?

Strategies #

A traversal strategy produces a sequence of nodes and edges, with each node and edge listed exactly once. There are two, very commonly used strategies and one general strategy:

Breadth-First Search (BFS)
visit all neighbors of the current node before visiting any of their neighbors.
Depth-First Search (DFS)
visit all neighbors of the next visited node before visiting the other neighbors of the current node.
Priority Search
visit nodes in priority order, adding neighbors at each stage

These strategies differ in how the sequence of nodes is handled.

Aside: Stacks, Queues, and Priority Queues #

A brief refresher from our discussion of data structures.

Stack #

A stack is a data structure for processing objects in a Last In-First Out manner. The most recent object added is said to be at the “top” of the stack.

Stacks support two primary (and one optional) operations:

push(obj)
add object to the top of the stack
pop()
remove the object from the top of the stack and return it
peek()
return the object from the top of the stack without removing it

It is an error to pop an empty stack.

Queue #

A queue is a data structure for processing objects in a First In-First Out manner. The most recent object added is said to be at the “back” of the queue; the next object to be processed is said to be at the “front” of the queue.

Queues support three primary operations:

enqueue(obj)
add object to the back of the queue
dequeue()
remove the object at the front of the queue and return it
is_empty()
do any objects remain in the queue?

It is an error to dequeue and empty queue. (Don’t confuse dequeue with deque, a different but related data structure.) Sometimes the front() operation is provided, which like peek(), looks at the front object without removing it.

Priority Queue #

A priority queue is a data structure for processing objects in a specialized order determined by priorities attached to the objects. Objects are processed from highest to lowest priority. A priority queue generalizes stacks and queues. A priority queue with priorities increasing as objects are added gives a stack; a priority queue with priorities decreasing as objects are added gives a queue.

Priority queues support three primary operations:

extract_highest()
remove the object of highest priority and return it
insert(obj, p)
add object with specified priority p
is_empty()
do any objects remain in the queue?
peek_highest()
examine but do not remove highest priority object

Traversal States #

Let’s describe a general strategy for traversing a graph. This general strategy traverses a graph in any order, and can be adjusted to do whatever task you want to do: find a graph, calculate statistics of the graph, do coloring…

During traversal, we will assign nodes two possible states:

fresh
the node has not yet been considered or processed
processed
the node has been visited and its neighbors added to the queue/stack to consider

Because the state of an edge can be inferred from its surrounding nodes, it is usually sufficient to keep track of the node states.

We will maintain a traversal state object that contains the current state of all nodes and a record of the history of traversal, including from what node was each node visited, the “times” at which each node is processed, and an indicator (which we can set) of whether the search should stop.

As part of this state, we will also track an arbitrary accumulator object, which we will use to store any data or results that we accumulate over the course of the traversal.

Traversal Power-Ups #

To make the traversal flexible, we will pass two processing functions to the algorithms:

on_node(graph, node, state)
called on fresh nodes when they are visited
on_edge(graph, from_node, to_node, state)
called on edge when first traversed

These will be able to access the current node or edge as well as the current traversal state. By specifying difference choices of these functions, we can make our graph traversal algorithm perform many different tasks.

A Traversal Template #

from TraversalState import TraversalState

def noop(graph, node, state):
    return state

def traverse(graph, start, acc, on_node=noop, on_edge=noop, state=None):
    if state is None:
        state = TraversalState(graph.nodes(), acc)

    # We'll use a queue, stack, or priority queue, depending on the order of
    # traversal we want to use
    remaining_nodes = Queue_or_Stack()
    remaining_nodes.insert(start)

    while not remaining_nodes.is_empty() and not state.finished:
        current_node = remaining_nodes.pop()

        # Don't want to visit nodes twice, e.g. if there are several edges
        # pointing in to them
        if state.fresh(current_node):
            state.process(current_node, time)

            state = on_node(graph, current_node, state)

            for neighbor in graph.neighbors(current_node):
                if state.fresh(neighbor):
                    remaining_nodes.push(neighbor)

                    state = on_edge(graph, current_node, neighbor, state)

    return state

Breadth-First Search (BFS) #

We will maintain a queue of nodes, initialized with the start node. We successively dequeue a node and process it and enqueue all of that nodes fresh neighbors (processing all the edges along the way).

Inputs:

graph
an undirected graph
start
a node at which to start the search
acc
an accumulator object of arbitrary type
on_node(graph, node, state)
a function called when a node is visited
on_edge(graph, from_node, to_node, state)
a function called when an edge is traversed
ts
a traversal state object (newly initialized if None)

Output:

  • An updated traversal state ts'

The algorithm is as above, using a queue.

Assume we also have a functions bfs_all that calls bfs on successive fresh nodes, updating the same traversal state and returning it.

def bfs_all(graph, acc, on_node=None, on_edge=None):
    ts = None

    for node in graph.nodes():
        if ts is None or ts.fresh(node):
            ts = bfs(graph, node, acc, on_node, on_edge, ts)

    return ts

Question: What if we used this as the on_edge function, meaning we called bfs with on_edge=track_parents? Suppose that state.parents starts as an empty dictionary.

def track_parents(graph, from_node, to_node, state):
    state.parents[to_node] = from_node

    return state

Exercises on using BFS #

  1. You have a function

    def inc(graph, node, ts):
        ts.accumulator += 1

    and call tstate = bfs(start, 0, on_node=inc). What is tstate.accumulator after the call?

    1. You have a function
    def inc_if_blue(graph, node, ts):
        props = graph.get_node_properties(node)
        if  props['color'] == 'blue':
            ts.accumulator += 1

    and call tstate = bfs(start, 0, inc_if_blue). What is tstate.accumulator after the call?

  2. You have a function

    def parents(graph, node, ts):
        parent = ts.parent[node]
        my_name = graph.get_node_properties(node, 'label')
        if parent:
            p_name = graph.get_node_properties(parent, 'label')
        else:
            p_name = None
        ts.accumulator[my_name] = p_name

    and call tstate = bfs(start, {}, parents). What is tstate.accumulator after the call?

    1. You have a function
    def blue_labels(graph, node, ts):
        props = graph.get_node_properties(node)
        if  props['color'] == 'blue':
            ts.accumulator.append(props['label'])

    and call tstate = bfs(start, [], before_node=blue_labels). What is tstate.accumulator after the call?

  3. Write a function find_path(parents, start, end) that takes the BFS tree (through the parent pointers) and returns a list of node IDs giving a path from start to end, or None if there is no such path.

  4. Configure BFS to find the connected components of a graph, the sets of nodes such that within each set there is a path between any two nodes.

  5. Configure BFS to determine if the graph can be two-colored, meaning that we can assign one of two colors to every node without two nodes of the same color sharing an edge between them. A two-colorable graph is said to be bipartite. Find the two coloring or return None/null/NA if the graph is not bipartite.

Depth First Search (DFS) #

In contrast to BFS, in DFS we will maintain a stack of nodes, initialized with the start node.

For our algorithm, we take the inputs:

graph
an undirected graph
start
a node at which to start the search
acc
an accumulator object of arbitrary type
on_node(graph, node, state)
a function called when a node is visited
on_edge(graph, from_node, to_node, state)
a function called when an edge is traversed
ts
a traversal state object (newly initialized if None)

We output an updated traversal state ts'.

We successively pop a node and process it and push all of that node’s fresh neighbors (processing all the edges along the way). There is a recursive logic to DFS: for each fresh neighbor, call DFS on it (maintaining state).

DFS(start):
  for neighbor in neighbors(start):
     if neighbor is FRESH:
        DFS(neighbor)

Wait, where’s the stack?

Alternately, we can explicitly use a stack, looping until the stack is empty. We reuse the template shown above, but with a stack.

Again, suppose we have dfs_all which continues searching until no fresh nodes are found:

def dfs_all(self, acc, on_node, on_edge):
    ts = None

    for node in self.nodes():
        if ts is None or ts.fresh(node):
            ts = self.dfs(node, acc, on_node, on_edge, ts)

    return ts

Example Configuring DFS #

  1. Task: Print traversal history as DFS runs

    Basic idea: Print each node as it is being visited and processed, and print each edge as it is being traversed. Here, we will use node labels to keep track.

  2. Task: Detect cycles in a graph with DFS.

    Basic Idea: In an undirected graph, if there are no back edges, we have a tree – hence, no cycles. But any back edge creates a cycle. So, look for back edges.

DAGs and Topological Sort #

A topological sort of a DAG is a linear ordering of the DAG’s nodes such that if \((u,v)\) is a directed edge in the graph, node \(u\) comes before node \(v\) in the ordering.

Given a DAG, how do we use DFS to do a topological sort?

Other Applications and Exercises #

  1. Configure dfs to count the number of “descendants” of a node.
  2. Configure dfs to compute a path between two nodes.

Application: Directed Graphs and Strongly Connected Components #

A directed graph is strongly connected if there is a directed path between any two nodes.

The strongly connected components of a directed graph are its maximal, strongly connected subgraphs.

We can find the strongly connected components with two DFS’s. For an arbitrary node v, the graph is strongly connected if we can find a directed path from v to any other node u and a directed path from any node w to v.

Let G be a directed graph and let G’ have the same nodes and edges with all the edges reversed. Pick an arbitrary node v.

The algorithm for detecting strong connectivity is basically as follows:

  1. Do DFS(G, v).
  2. If the traversal does not contain all nodes, then there are nodes we cannot reach from v. Hence, G cannot be strongly connected.
  3. Do DFS(G', v).
  4. If this traversal does not contain all nodes, then there are nodes in G from which we cannot reach v. Hence, G cannot be strongly connected.

To find the strongly connected components, we just do a little processing.

In step 1, record the processed_times. In step 3, do DFS_ALL(G') with the nodes ordered as the reversal of the processing_times.

Priority search operates just like BFS and DFS, the only difference is that each time we visit a node, we add its neighbors to the queue with associated priorities

Exercise: How do we modify BFS/DFS to get priority search?