Graph Traversal

— Christopher Genovese and Alex Reinhart

To traverse a graph is to visit nodes and edges systematically.

This seems boring, but it’s actually an important part of many things we want to do with graphs: finding connected components, finding paths between nodes, calculating graph statistics, and much more. Even “finding paths between nodes” is useful for an incredible number of problems, from Google Maps to internet routing, and even tasks as plainly statistical as kernel density estimation can be phrased in terms of traversals of graphs (or trees).

There are several strategies for traversing a graph, each with different useful properties.

Start with an undirected, simple graph.

Q: If I showed you a large, undirected, simple graph and asked you to count the nodes exactly, how would you do it? Suppose the only operation you can perform is, given a node, to obtain its neighbor nodes; you cannot see all nodes at once.

Strategies #

A traversal strategy produces a sequence of nodes and edges, with each node and edge listed exactly once. There are two, very commonly used strategies and one general strategy:

Breadth-First Search (BFS)
visit all neighbors of the current node before visiting any of their neighbors.
Depth-First Search (DFS)
visit all neighbors of the next visited node before visiting the other neighbors of the current node.
Priority Search
visit nodes in priority order, adding neighbors at each stage

These strategies differ in how the sequence of nodes is handled.

Aside: Stacks, Queues, and Priority Queues #

Stack #

A stack is a data structure for processing objects in a Last In-First Out manner. The most recent object added is said to be at the “top” of the stack.

Stacks support two primary (and one optional) operations:

push(obj)
add object to the top of the stack
pop()
remove the object from the top of the stack and return it
peek()
return the object from the top of the stack without removing it
is_empty()
do any objects remain in the stack?

It is an error to pop an empty stack.

Queue #

A queue is a data structure for processing objects in a First In-First Out manner. The most recent object added is said to be at the “back” of the queue; the next object to be processed is said to be at the “front” of the queue.

Queues support three primary operations:

enqueue(obj)
add object to the back of the queue
dequeue()
remove the object at the front of the queue and return it
peek()
return the object from the front of the queue without removing it
is_empty()
do any objects remain in the queue?

It is an error to dequeue and empty queue. (Don’t confuse dequeue with deque, a different but related data structure.) Sometimes the front() operation is provided, which like peek(), looks at the front object without removing it.

Priority Queue #

A priority queue is a data structure for processing objects in a specialized order determined by priorities attached to the objects. Objects are processed from highest to lowest priority. A priority queue generalizes stacks and queues. A priority queue with priorities increasing as objects are added gives a stack; a priority queue with priorities decreasing as objects are added gives a queue.

Priority queues support three primary operations:

dequeue()
remove the object of highest priority and return it
enqueue(obj, p)
add object with specified priority p
is_empty()
do any objects remain in the queue?
peek()
examine but do not remove highest priority object

Team Activity: Building a Graph Object #

  • What are the operations we need on this type?
    • add edges
    • add nodes
    • remove edges and nodes
    • access or update properties of nodes and edges
    • print a represention of the graph
    • find a nodes neighbors
    • find nodes incident on an edge
    • check equality of graphs
    • simplify the graph
    • find connected components, find paths between nodes, …
  • Directed vs. Undirected: DiGraph Subtype??
  • Different storage methods?
  • What changes with multigraphs? self edges?

A Traversal Template #

A basic traversal algorithm:

from Graph          import Graph
from TraversalState import TraversalState
from NodeStore      import node_store_factory

def node_noop(graph, node, state):
    return state

def edge_noop(graph, current_node, neighbor_node, state):
    return state

def traverse(
        graph: Graph,
        start,
        acc,
        before_node=node_noop,
        after_node=node_noop,
        on_edge=edge_noop,
        state: TraversalState | None = None,
        node_store='Stack',
        node_priority='priority'
) -> TraversalState:
    if state is None:
        state = TraversalState(graph.nodes, acc)
    remaining_nodes = node_store_factory(node_store)

    time = 0  # Tick the clock with every move
    priority = graph.node_properties(start, node_priority, None)
    remaining_nodes.insert(start)

    while not remaining_nodes.is_empty() and not state.finished:
        current_node = remaining_nodes.current

        if state.fresh(current_node):
            # First visit to this node node, mark it
            state.visit(current_node, time)
            time += 1
            if (parent := state.parent_of(current_node)) is not None:
                on_edge(graph, parent, current_node, state)
                if state.finished:
                    break
            before_node(graph, current_node, state)
            if state.finished:
                break

            # Schedule visit to all fresh neighbors
            for neighbor in graph.neighbors(current_node):
                if state.fresh(neighbor):
                    state.node_has_parent(neighbor, current_node)
                    priority = graph.node_properties(neighbor, node_priority, None)
                    remaining_nodes.insert(neighbor, priority)
        elif state.visited(current_node):
            # Neighbors all visited, process this node
            state.process(current_node, time)
            time += 1
            after_node(graph, current_node, state)
            remaining_nodes.remove_current()
        else:
            # The current node is processed, move on
            remaining_nodes.remove_current()

    return state

Traversal States #

Let’s describe a general strategy for traversing a graph. This general strategy traverses a graph in any order, and can be adjusted to do whatever task you want to do: find a graph, calculate statistics of the graph, do coloring…

During traversal, we will assign nodes three possible states:

fresh
the node has not yet been considered or processed
visited
the node has been found, but some of its neighbors' remain fresh.
processed
the node has been visited and all its neighboring nodes have as well.

We track these states to avoid accidentally doing redundant work, for example by traversing the same nodes over and over again.

We will maintain a traversal state object that contains the current state of all nodes and a record of the history of traversal, including from what node was each node visited, the “times” at which each node is processed, and an indicator (which we can set) of whether the search should stop.

As part of this state, we will also track an arbitrary accumulator object, which we will use to store any data or results that we accumulate over the course of the traversal.

Traversal Power-Ups #

To make the traversal flexible, we will pass three processing functions to the algorithms:

before_node(graph, node, state)
called on fresh nodes when they are visited
after_node(graph, node, state)
called on visited nodes when they are processed
on_edge(graph, from_node, to_node, state)
called on edge when traversed

These will be able to access the current node or edge as well as the current traversal state. By specifying difference choices of these functions, we can make our graph traversal algorithm perform many different tasks.

Team Activity: NodeStores #

Depth First Search (DFS) #

In contrast to BFS, in DFS we will maintain a stack of nodes, initialized with the start node.

Function: dfs Inputs:

graph
an undirected graph
start
a node at which to start the search
acc
an accumulator object of arbitrary type
before_node(graph, node, state)
a function called when a node is first visited
after_node(graph, node, state)
a function called when a node is processed
on_edge(graph, from_node, to_node, state)
a function called when an edge is traversed
ts
a traversal state object (newly initialized if None)

We output an updated traversal state ts'.

We successively pop a node and process it and push all of that node’s fresh neighbors (processing all the edges along the way). There is a recursive logic to DFS: for each fresh neighbor, call DFS on it (maintaining state).

DFS(start):
  for neighbor in neighbors(start):
     if neighbor is FRESH:
        DFS(neighbor)

Wait, where’s the stack?

Alternately, we can explicitly use a stack, looping until the stack is empty. We reuse the template shown above, but with a stack.

Again, suppose we have dfs_all which continues searching until no fresh nodes are found:

def dfs_all(self, acc, before_node=None, after_node=None, on_edge=None):
    ts = None

    for node in self.nodes:
        if ts is None or ts.fresh(node):
            ts = self.dfs(node, acc, before_node, after_node, on_edge, ts)

    return ts

Example Configuring DFS #

  1. Task: Print traversal history as DFS runs

    Basic idea: Print each node as it is being visited and processed, and print each edge as it is being traversed. Here, we will use node labels to keep track.

  2. Task: Detect cycles in a graph with DFS.

    Basic Idea: In an undirected graph, if there are no back edges, we have a tree – hence, no cycles. But any back edge creates a cycle. So, look for back edges.

Question #

What does the time and parent information do for us?

DAGs and Topological Sort #

A topological sort of a DAG is a linear ordering of the DAG’s nodes such that if \((u,v)\) is a directed edge in the graph, node \(u\) comes before node \(v\) in the ordering.

Given a DAG, how do we use DFS to do a topological sort?

Other Applications and Exercises #

  1. Configure dfs to count the number of “descendants” of a node.
  2. Configure dfs to compute a path between two nodes.

Application: Directed Graphs and Strongly Connected Components #

A directed graph is strongly connected if there is a directed path between any two nodes.

The strongly connected components of a directed graph are its maximal, strongly connected subgraphs.

We can find the strongly connected components with two DFS’s. For an arbitrary node v, the graph is strongly connected if we can find a directed path from v to any other node u and a directed path from any node w to v.

Let G be a directed graph and let G’ have the same nodes and edges with all the edges reversed. Pick an arbitrary node v.

The algorithm for detecting strong connectivity is basically as follows:

  1. Do DFS(G, v).
  2. If the traversal does not contain all nodes, then there are nodes we cannot reach from v. Hence, G cannot be strongly connected.
  3. Do DFS(G', v).
  4. If this traversal does not contain all nodes, then there are nodes in G from which we cannot reach v. Hence, G cannot be strongly connected.

To find the strongly connected components, we just do a little processing.

In step 1, record the processed_times. In step 3, do DFS_ALL(G') with the nodes ordered as the reversal of the processing_times.

Breadth-First Search (BFS) #

We will maintain a queue of nodes, initialized with the start node. We successively dequeue a node and process it and enqueue all of that nodes fresh neighbors (processing all the edges along the way).

Function: bfs Inputs:

graph
an undirected graph
start
a node at which to start the search
acc
an accumulator object of arbitrary type
before_node(graph, node, state)
a function called when a node is first visited
after_node(graph, node, state)
a function called when a node is processed
on_edge(graph, from_node, to_node, state)
a function called when an edge is traversed
ts
a traversal state object (newly initialized if None)

Output:

  • An updated traversal state ts'

The algorithm is as above, using a queue.

Assume we also have a functions bfs_all that calls bfs on successive fresh nodes, updating the same traversal state and returning it.

def bfs_all(graph, acc, before_node=None, after_node=None, on_edge=None):
    ts = None

    for node in graph.nodes:
        if ts is None or ts.fresh(node):
            ts = bfs(graph, node, acc, on_node, on_edge, ts)

    return ts

Question: How can we use the parents information in the traversal state?

Exercises on using BFS #

  1. You have a function

    def inc(graph, node, ts):
        ts.accumulator += 1
    

    and call tstate = bfs(start, 0, on_node=inc). What is tstate.accumulator after the call?

  2. You have a function

    def inc_if_blue(graph, node, ts):
        props = graph.node_properties(node)
        if props['color'] == 'blue':
            ts.accumulator += 1
    

    and call tstate = bfs(start, 0, on_node=inc_if_blue). What is tstate.accumulator after the call?

  3. You have a function

    def parents(graph, node, ts):
        parent = ts.parent.get(node, False)
        my_name = graph.node_properties(node, 'label')
        if parent:
            p_name = graph.node_properties(parent, 'label')
        else:
            p_name = None
        ts.accumulator[my_name] = p_name
    

    and call tstate = bfs(start, {}, parents). What is tstate.accumulator after the call?

  4. You have a function

    def blue_labels(graph, node, ts):
        props = graph.node_properties(node)
        if  props['color'] == 'blue':
            ts.accumulator.append(props['label'])
    

    and call tstate = bfs(start, [], before_node=blue_labels). What is tstate.accumulator after the call?

  5. Write a function find_path(parents, start, end) that takes the BFS tree (through the parent pointers) and returns a list of node IDs giving a path from start to end, or None if there is no such path.

  6. Configure BFS to find the connected components of a graph, the sets of nodes such that within each set there is a path between any two nodes.

  7. Configure BFS to determine if the graph can be two-colored, meaning that we can assign one of two colors to every node without two nodes of the same color sharing an edge between them. A two-colorable graph is said to be bipartite. Find the two coloring or return None/null/NA if the graph is not bipartite.

Priority search operates just like BFS and DFS, the only difference is that each time we visit a node, we add its neighbors to the queue with associated priorities

Question: How do we modify BFS/DFS to get priority search?