Data Structures (Week 4 Thursday)

— Christopher Genovese and Alex Reinhart

Announcements #

  • New exercises on problem bank
  • Re-Pull the s750 documents directory for today’s class
  • Will introduce projects on Tuesday, revised schedule

Plan for Today #

  • A few more thoughts on functional traversal, script structure, and parsing
  • A consequential choice: how we organize our data
  • A menagerie of data structures: operations and performance
  • Maybe Activity

On Computational Complexity #

How do we choose among the various data structures and algorithms available to solve a problem?

Among the many practical dimensions of this decision, performance is an often critical one.

Many users of dynamic languages hold folk beliefs about performance and optimization. Many R users, for instance, think that if code can be written as matrix and vector operations, or in terms of a base function that’s implemented in C, it will be fast.

So they work to write all our algorithms to use vectorized base functions. If they can’t then – because “Loops are slow in R” – they pull out the Rcpp package and write it in C++.

While this is not entirely misguided, it ignores the critical role that choice of algorithm can play. The difference between C and R isn’t nearly as big as the difference between an \(O(n)\) and \(O(n^2)\) algorithm.

Choice of data structures can make a similarly important difference. We must choose a structure that efficiently supports the operations we need for our problem.

Big Oh and Friends #

Analogous to asymptotic error in statistical problems, we use big Oh notation to describe the algorithm performance.

Premises:

  1. An algorithm is composed of many low-level operations (comparison, moving data in memory, addition, multiplication, etc).
  2. To good approximation, these operations all take the same amount of time.
  3. The number of operations \(n\) required to execute an algorithm scales with the input size/size of the problem.

If an algorithm requires \(g(n)\) operations, we say that \[ g(n) = O(f(n)) \] iff for all sufficiently large \(n\), \(g(n) \le c f(n)\).

The purpose is to focus on the intrinsic performance of the algorithm, without regard to the details of the language or computer we are using.

Examples:

  • Computing the mean of a length \(n\) vector is \(O(n)\).
  • Sorting a length \(n\) vector is \(O(n\log n)\).
  • Computing the pairwise distances for a set of \(n\) points is \(O(n^2)\).
  • Computing all permutations of a string of length \(n\) is \(O(n!)\).

In practice, we start to worry with a quadratic algorithm, though sometimes we have no choice.

Other considerations to keep in mind:

  • Memory use
  • CPU Cache* and pipeline
  • I/O

The Architecture of Data #

Computer memory is addressable. hink of memory (loosely) as a giant array, every location has an address that essentially indexes into this memory.

n** References and Pointers In “low-level” languages like C or assembly, we have data types, called pointers, that store explicit memory adddresses. We can manipulate these, access the data set

/* Returns length of a null-padded string s */
int strlen(char* s) {
    for( p = s; *p; ++p ) {
        ;
    }
    return p - s;
}

/*
 * Compares two null-padded strings
 * Returns 0 if equal, <0 if s1 < s2, >0 if s1 > s2.
 */
int strcmp(char* s1, char* s2) {
    while( *s1 && *s2 && *s1 == *s2 ) {
        ++s1;
        ++s2;
    }
    return *s1 - *s2;
}

(C++ has explicit pointers, from compatibility with C, but pointer manipulation is less common an idiom in that, higher level, language.)

In most higher level languages, however, we cannot manipulate explicit pointers, but we do can use references to objects. Sometimes the references must be explicitly indicated (e.g., rust); other times (e.g., python) all aggregate data types are references.

a = [1, 2, 3, 4]
b = a
c = a[:]
b[2] = 20
print(a)
print(c)

Here b and a are actually references to the underlying list object; c is a copy.

References can lead to accidental mutation. This can lead to some subtle bugs, e.g,. mutable defaults:

def wrap_body(body_strings=[]):
    if not body_strings:
        body_strings.append('@')
    elif not body_strings[0].startswith('@'):
        body_strings[0] = '@' + body_strings[0]
    body_strings.append("footer")
    return body_strings

Using references does enable us to create complex data structures. Note that under the hood, references, can be more complicated than apparent as the system may need to track multiple locations for an object (e.g., virtual functions).

R does not give references to its data. Instead, it shares data for reads but copies on writes. Environments provide one way to simulate references.

a <- c(1, 2, 3, 4)
b <- a
b[3] <- 20

a

Heap, Stack, and Cache #

Memory that our program accesses is in fact more complicated than a giant array. It is managed in detail by the operating system and we get an abstracted (and somewhat illusory) view of it.

There are four main abstractions describing memory organization that will .

  1. External Storage

    Disks and other external data devices (even other servers) require I/O operations to access. They have huge amounts of storage but are orders of magnitude slower than direct memory access.

  2. Heap

    A pool of memory that processes can request pieces of. This is used for program-level data and longer-lived data used in our program.

    Heap is typically very large, but heap access is slower. Over time the heap can become fragmented, making it harder to obtain a large contiguous block.

  3. Stack

    A smaller, faster store of memory arranged linearly and accessed only at a single, current location – the stack pointer.

    This is used for local variables in functions. Each function called “pushes” a frame of local data onto the stack, “popping” when finished with only its return value left.

  4. Cache

    A CPU cache is a specialized, very fast (orders of magnitude faster) memory located close to the processor core. This usually stores recently-used data, so algorithms that maintain locality of access can see significant speed up. There are typically several levels of cache (L1, L2, …) that increase in size and decrease in speed of access.

    Analysis of the cache performance can be quite subtle, but in performance critical code, locality of memory access is worth considering.

Managing References and Memory #

If I create an object, the data from that object is stored on the heap. What is the life cycle of that data?

There are three common strategies for managing the memory containing our data and references to those data:

  1. Explicit allocation/deallocation

    The programmer explicitly requests memory for objects, and explicitly releases it when done. Lots of control and efficiency, but prone to serious bugs.

  2. Garbage collection

    References to allocated objects are tracked, and an object that is no longer actively referenced is released, cleaned up in periodic sweeps.

  3. Ownership model

    Explicit ownership of all data is tracked. When data goes out of its owner’s scope, it is released.

Data Structures #

A data structure is a formal organization of data that supports a pre-specified set of operations on it and is designed to allow efficient access and manipulation of the data it contains.

Basic Data #

Primitive Types #

Boolean true, false
Integer signed or unsigned integers in various ranges
Numeric fixed precision floating point numbers
Character represents a single glyph, various representations
Pointer an address in memory

There are others used, but these are the most common.

Aggregate Types #

Simple collections of values stored in a contiguous block of memory. These are the building blocks of more complex data structures.

Tuple fixed-length sequence of values, with each element having a specified type
Array contiguous, fixed-length collection of values all of the same type
Record/Struct structured group of named attributes of various types in contiguous storage
Union region of memory that can store several specified types, undifferentiated
Tagged Union region of memory that can store several specified types, identified

Object Types #

Most programming languages offer a facility for defining more complicated types, which store data and have certain behaviors.

String A sequence of characters
Enumeration A value that can take only one
Function A callable object; that is, a … function
File A representation of a file in a file system
Stream A stream of data elements made available sequentially
BigNum An arbitrary-precision number
Object A general type encapsulating specific data and behaviors

Abstract, Algebraic, and Recursive Data Types #

Higher level and more complex data organizations described in various ways depending on the language and computing environment.

data Algebra f a = f a -> a
data List a = Nil | Cons a (List a)
data BTree a = Leaf a | Branch (BTree a) (BTree a)

These typically support a variety of specialized operations and exhibit various performance trade-offs.

Core Data Structures: A Tour #

Several data structures are used so frequently that they play a role at some level in most of the programs we write.

Lists #

A list is a linear, finite sequence of values.

Abstract specification:

data List a = Nil | Cons a (List a)

Basic operations:

~first
List a -> Maybe a~, returns the first item in the list in O(1) time (What if the list is empty?)
~rest
List a -> List a~, returns the tail of the list in O(1) time
~cons
List a -> a -> List a~, adds a value at the head of the list in O(1) time

Other operations: Performance depends on implementation, from O(1) to O(n)

~nth
List a -> Int -> Maybe a~, returns the nth element of the list.
~insert
List a -> Int -> a -> List a~, inserts item at position in list
~remove
List a -> Int -> a -> List a~, removes item at position in list

Implementations:

  • Array - items stored in contiguous memory (usually on heap)

    Access at arbitrary index is fast, O(1). Insertion fast until we run out of space and need to copy the data to expand.

    /ox-hugo/array-list.png
  • Linked list

    Each item is a record with data and a pointer to next item (or nil if none).

    /ox-hugo/linked-list.svg
  • Doubly linked list

    Each item is a record with data and pointers to next and previous items (or nil if none).

    /ox-hugo/doubly-linked-list.svg
  • Persistent Vectors - a special tree structure gives O(1) access and update

Stacks, Queues, and Deques #

A stack is a collection of values where only the most recently added item is available. Items are available Last In, First Out (LIFO). (Think of a stack of cafeteria trays.)

Abstract specification:

data Stack a = Nil | Top a (Stack a)

Main Operations

~push
Stack a -> a -> Stack a~, adds item to the top of stack in O(1) time.
~pop
Stack a -> Maybe a~, removes and returns the top item in O(1) time. (Popping an empty stack is an error.)
~isEmpty
Stack a -> Bool~, indicates whether stack is empty in O(1) time.
(Optional) ~peek
Stack a -> Maybe a~, returns the top item without removing it, in O(1) time.

Implementations: Array or Linked List, variants like circular stacks

A queue is a collection of values where only the least recently added item is available (at the front) and where items can be added only at the end. Items are available First In, First Out (FIFO). (Think of a line at the grocery store.)

Abstract specification:

data Queue a = (Front (List a), Back (List a))

Main operations

~enqueue
Queue a -> a -> Queue a~, adds item to the end of queue in O(1) time.
~dequeue
Queue a -> a~, removes and returns the item at the front of queue O(1) time. (Dequeueing an empty queue is an error.)
~isEmpty
Queue a -> Bool~, indicates whether stack is empty in O(1) time.
(Optional) ~peek
Queueu a -> a~, returns the top item without removing it, in O(1) time.

Implementations: Doubly-Linked List or Rolling Array

A deque (pronounced “deck”) is a generalization of stacks and queues that stands for double-ended queue.

The abstract specification states that items can be added to or removed from either end in O(1) time, with operations addFront(D,x), popFront(D), addRear(D,x), popRear(D), and isEmpty(D) at least.

Implementation is often with a doubly-linked list.

Language implementations:

  • collections.deque in Python
  • collections, rstackdeque, or dequer packages in R (don’t use liqueueR)
  • std::stack and std::queue in C++
  • DataStructures.jl for Julia
  • Question

    Suppose you’ve written a program to scrape Wikipedia and extract data. It starts at a specific article and extracts all the links from that page. It then puts all those links on a stack.

    To decide which article to look at next, it pops the top of the stack and loads that article, then puts all of its links onto the stack.

    Describe the order in which it will visit the articles. At what point will it have visited all the links from the first article visited?

Priority Queues #

A priority queue is a collection of items with associated scores (called priorities) in which it is efficient to remove (or view) the item with highest priority. (Think of the line at a hospital emergency room, where the most serious cases are taken first.)

Main operations:

  • insert(Q, x, p) adds item x with priority p to priority queue Q.
  • removeHighest(Q) removes and returns the highest priority item from priority queue Q. (It is an error to remove from an empty queue.)
  • findHighest(Q) returns highest priority item from priority queue Q without removing it.
  • isEmpty(Q) indicates whether Q is empty.

insert and findHighest can be implemented in O(1) time, though for naive implementations the latter can be O(n).

Implementations: arrays, heaps

Language implementations:

  • heapq in Python standard library
    • In heapq, you don’t specify a separate priority: items are sorted by their own value, not by a separate priority. But there’s a trick: If you make a tuple (priority, value), the < operator for tuples compares the first element first, then the second, and so on; so (1, "foo") < (2, "bar"). You can use this to insert items with any priority you want.
    • Note that if there is a tie, such as (1, "foo") < (2, "bar"), Python compares the second element in the tuple. If the second element is a type that < is not defined for, this won’t work; you might need some way of breaking ties.
  • collections package in R
  • Question

    How could I use a priority queue to sort a bunch of items?

Hash Tables #

A hash table (aka dictionary, associative array, hash map, or sometimes map) that map (typically sparse) values of one type to values of another. We look up values stored in the table by their keys. Think of a dictionary: we look up the definition of a word using the word as a key into the dictionary.

Abstract specification (basic):

  • insert(H, k, v) – Insert value v into hash table H associated with key k
  • lookup(H, k) – Find the value, if any, in H associated with key k
  • remove(H, k) – Remove a key and its associated value from the table

The idea is that these operations are fast, ideally O(1), even though the set of possible values may be large and sparse.

The trick to making this work is using a good hash function. A hash function is a function which takes an object (integer, tuple, string, anything) and returns a number. Hash functions are designed to be very fast, and so that slightly different inputs give very different outputs. The key property:

Two unequal objects are unlikely to have the same hash value

We will talk more about hash functions and their many uses later in the course.

Language Implementations:

  • Native: dict in python, map in clojure, tables in lua, Julia, ….
  • std::unordered_map in C++
  • collections or hash packages in R (only string keys supported). R’s lists are not hash tables, and have O(n) lookup or worse. R does use hash tables internally (“environments”) to connect variable names to their values.

Here is a basic implementation of a hash table. Suppose a hash function gives output in the range [0, N], where N is a suitably large number. We pick a smaller number M and allocate an array of size M.

To add an item to the set, we calculate i = hash(item) mod M. Then look up the ith element in the array.

  • If it is empty, add the item as the first element of a linked list.
  • If it already contains an entry, search the linked list there. If the item is not already in the list, append it. This is called chaining.
/ox-hugo/hash-set-chaining.svg

When two separate items end up in the same bucket – the same array element – we call it a collision.

To determine if an item is already in the set, calculate i = hash(item) mod M again. Search the linked list at that index to see if the item is there.

If M is suitably large – much larger than the number of items in the set – there will be few collisions and looking up any element will be fast.

Hash sets have a statistic called a load factor: the average number of entries per bucket. If the load factor is high, there are many collisions, and looking up entries will require searching lists. Many hash set implementations automatically grow their backing array when the load factor gets too high.

Note: The hash set or table implementation in your chosen programming language, like Python or R, will automatically deal with collisions – you don’t have to write this chaining logic.

(There are other schemes besides chaining to handle collisions, like linear or quadratic probing and Robin Hood hashing.)

Sets #

A set is a container that behaves like a mathematical set: they are unordered and contain only one of each object.

Specification:

  • elementOf(S, x)
  • insert(S,x)
  • remove(S,x)
  • subsetOf(S1,S2), union(S1,S2), …
  • many more…

A hash set is a fast (O(1) for access and insertion) implementation of a sort based on hashing.

Languages like Python and Clojure provide native set types.

Languages like C++ and Java have extensive libraries supporting a variety of sets.

  • std::set, std::unordered_set, and boost libraries in C++
  • SortedSet and SortedDict in DataStructures.jl in Julia
  • sets package in R. Actually stores sets as lists when created, and converts them to hash tables for set operations that require it.

Trees #

Trees represent hierarchical organization of data. Data is arranged in nodes which have children (and sometimes parents). We follow from the root of the tree out to the leaves. These are important both as data structures and as mathematical objects. We will see many varieties of trees, but for now, we look at a simple case where each node can have at most two children. These are called binary trees, and they are an important case.

A binary tree is an efficient way to store ordered data so that searching for specific entries is fast. By maintaining a sorted tree of entries, we can find any entry by traversing the tree.

A node in a tree has three parts:

  • value
  • pointer to left child
  • pointer to right child

The tree is called “binary” because it only has two children. There are other variations, like B-trees, where nodes can have multiple children. PostgreSQL uses B-trees for indexing databases for fast querying.

Suppose we have the entries {8, 3, 10, 14, 6, 1, 4, 7, 13}.

/ox-hugo/Binary_search_tree.svg

(source)

To search if a node is in the tree takes only O(log n) time, as does insertion and deletion, since the data structure is effectively already sorted – traversing the tree is like doing a binary search of a sorted array. We can also find all nodes greater than a certain value, less than a certain value, in a certain range, and so on, by carefully traversing the tree. This is how Postgres accelerates queries with indices.

Question: What happens if we accidentally insert all the entries in sorted order?

  • Binary tree traversal

    A common operation given a tree is to “visit” all the nodes in the tree (presumably doing something with the information stored in those nodes). This operation is called traversing the tree.

    There are many different orders in which one can traverse a tree, but three are particularly valuable in practice:

    Preorder : Visit root, visit left subtree, visit right subtree

    Inorder : Visit left subtree, visit root, visit right subtree

    Postorder : Visit left subtree, visit right subtree, visit root

    As you can see, the name refers to when the root is visited relative to its subtree.

  • Exercise to think about

    • If I have a binary tree, how can I produce an array or list containing its elements in sorted order?

    • Are there assertions that would be useful to use in an implementation of a binary tree?

    • How could I use a binary tree to implement a priority queue?

  • Applications

    • Sets and maps
    • Radix trees (routing tables)
    • Heaps (which lead to priority queues, which lead to graph routing algorithms like A*)
    • R-trees, k-d trees, and spatial indexing. Very useful for nearest-neighbor and other spatial algorithms
  • Language Implementations

    • None in R :(
    • bintrees package in Python

Graphs #

A graph is a collection of nodes and edges that represent pairwise relationships between various entities. Many different styles (directed, undirected, simple, multi, …) and representations (adjacency lists, adjacency matrices, …).

Graphs are important both as data structures and as mathematical objects, and are the subject of their own class.

Data Frame #

A combination of several different structures for tabular and name access to a regular, array-like grid of data. Allows attribute storage

Persistent Hash Maps #

Efficient immutable hash tables using structure sharing to give near constant lookup and modifiication.

Some Other Highlights #

Tries #

(Also called a prefix-tree) A tree data structure used for locating sequential keys (typically strings) in a set, grouped by their (usually) initial contents.

B-trees and Quad-trees #

B-trees are self-balancing tree structures that allow search and insertion of data, and are well suited to managing large blocks of data.

Quadtrees (and cousins) are tree structures with four children per node that are often used to represent spatial data, or other points in the plane.

Both of these are commonly used in database systems.

Bloom Filters #

A probabilistic representation of a set that is extremely space efficient. The basic operation is testing whether an element belongs; false positives are possible but not false negatives. Roughly speaking, fewer than 10 bits per item are required for a 1% false positive rate.

Bloom filters make heavy use of hash functions.

Zippers #

An immutable representation of a tree with focus

data Zipper a = Zipper { focus :: a
                       , left :: [a]
                       , right :: [a]
                       , parent :: Zipper a
                       , changed :: Bool }

Basic operations include navigating, inserting/replacing/editing nodes, and producing an unfocused tree.

2-3 Finger Trees #

A tree data structure that provides

data FingerTree a = Empty
                  | Single a
                  | Deep (Digit a) (FingerTree (Node a)) (Digit a)

data Digit a = One a | Two a a | Three a a a | Four a a a a
data Node a = Node2 a a | Node3 a a a

Activity #