Hashing, Streaming, and Monoids, Oh my!

— Christopher R. Genovese

Announcements #

  • HW2 meetings
  • Questions?

Plan #

  1. Hashing
  2. Hashing + Monoids => Efficient Large Data Streaming

Hashing #

The key idea of hashing is to summarize an arbitrary object with a deterministic, fixed-length “signature” with specified properties.

Two signatures of the previous paragraph:

   d0c62a9a66f2b9752a32867d922016bf
   e65ff981ccb4255700f88dfc4e9268c2d51358b0

The function that computes this signature from data is called a hash function; its value is called a hash value, or variously, a hash code, digest, digital signature, or simply hash. A hash function must be a pure function, so the same input object is always assigned the same hash value.

Hashing refers to a class of algorithms that make essential use of hash functions. We will find all manner of interesting uses for hashing.

The properties that we require a hash function to satisfy vary with the application. Common criteria include:

  • Two unequal objects are unlikely to have the same hash value.
  • It is computationally hard to find two objects with the same hash value.
  • Given the hash value, it is computationally hard to reconstruct the object.
  • Two objects that are close together are more likely to have the same hash value than two objects that are far apart.

Hash Functions #

A hash function maps a specified “universe” of objects to a value (a hash value) that can be used as a representative of the object. The hash value is usually an arbitrary bit string (often viewed as an integer), though other types of values may be useful for particular applications.

What makes a good hash function? While it in depends in part on the application, but for most common uses, an ideal hash function would have four properties:

  1. Uniformity of hash values across the output range
  2. Every subset of output bits depends on every non-empty subset of input bits.
  3. Avalanche Effect: \[ P(\text{Output bit $j$ changes} \mid \text{Input bit $i$ changes}) = 1/2 \text{ for all } i,j. \]
  4. Fast calculation

In practice, criteria 1-3 are only partially realizable and trade-off with 4.

Basic structure of typical hash functions #

A schematic for a general hash function:

hash(initial_state, state_bits, input):
   state = initial_state
   bits_hashed = 0
   while input is available:
       chunk = next state_bits bits of input (pad with zeroes)
       optionally pre-mix chunk
       optionally mix bits of state
       state = update(state xor chunk)  # more generally update(state,chunk)
       bits_hashed += state_bits

   return finalize(state, bits_hashed)

Here, update() is a specialized hash function that takes and returns state_bits bits, and finalize is a specialized hash function that maps the state to a bit string of the desired length.

This reduces the problem of hashing general input to the problem of hashing fixed-width integers. Let’s say that state_bits is 64 or 32 for the purposes of discussion.

A common approach to update() is to chain together a sequence of simple reversible operations like

  • Adding or xoring a constant
  • Permuting bits
  • xoring state with shifted versions of state (s xor (s >> k))
  • replacing chunk with a CRC code (or similar)

Example: The update component of Jenkin’s Hash. It takes a 96-bit input in three 32-bit parameters a, b, c

a -= b; a -= c; a ^= (c>>13);
b -= c; b -= a; b ^= (a<<8);
c -= a; c -= b; c ^= (b>>13);
a -= b; a -= c; a ^= (c>>12);
b -= c; b -= a; b ^= (a<<16);
c -= a; c -= b; c ^= (b>>5);
a -= b; a -= c; a ^= (c>>3);
b -= c; b -= a; b ^= (a<<10);
c -= a; c -= b; c ^= (b>>15);
return concat(a,b,c); // string as 96-bit integer

Some good hash functions for general use #

The choice of hash function depends strongly on the application and is mostly heuristic. But the choice can have an important impact on performance. We will consider cases where one or more than one hash function is used at a time.

  • Classical Methods

    We can start by thinking about how to hash integers. We can represent any object as one large or a sequence of smaller integers.

    • Division Method

      \( h(k) = k \bmod M \) (M typically prime)

    • Multiplication Method

      \( h(k) = \lfloor M (A k \bmod 1) \rfloor \)

      where \( 0 < A < 1 \) (e.g., \(A = (\sqrt(5) - 1)/2 = 0.61803\)…).

    • Multiply-Shift Method

      Let \( M = 2^m \) be a power of 2 and let W be the number of bits in a machine word. If \( a < 2^W \) is an odd integer, define \[ h_a(k) = (a k \bmod 2^W) \,\mathrm{div}\, 2^{W-m} \] (reduce a k modulu 2^W and then keep the higher order bits).

      In C-like languages this is easily expressed as

      h_a(k) = (unsigned)(a * k) >> (W - m).
      
    • Multiply-Shift-Add Method

      Improve on Multiply-Shift \[ h_ab(k) = ((a k + b) \bmod 2^W) \,\mathrm{div}\, 2^{W-M} \] where everything is as before except 0 <= b < 2W-M is an integer. When a, b are random integers, h_ab forms a universal family.

    For non-integers, we decompose our input and then combine the hash values of the individual pieces. For example, using Multiply-Shift, initialize a random vector a of odd integers \(< 2^{2W} \) and then \[ h_a(x) = ( (\sum_{i=0}^{k-1} x_i a_i) \bmod 2^{2W} ) \,\mathrm{div}\, 2^{2W} - M. \]

  • Modern Methods

    Modern general-purpose hash functions tend to do more thorough mixing and recombination of the inputs. These have been thoroughly tested and optimized. Reasonable choices include

    • Murmur3 (see digest package for R, mmh3 for python)
    • FarmHash
    • CityHash
    • Spooky
    • JenkinsHash

    with the top two or three particularly recommended. You can google these.

Cryptographic hash functions #

A cryptographic hash function is used for cryptography, secure communications, and various security protocols (authentication, digital signatures, etc).

Cryptographic hash functions act as “one-way functions”. Given the value of the function it is very hard to invert to find the corresponding input. To be secure, it should be very hard to find two distinct inputs with the same hash value.

Crytpographic hash functions have good collision properties, but they tend to produce long bit strings and they tend to be rather slow to compute.

Hash functions: SHA-2 and SHA-3

Rolling Hash Function #

A rolling hash function allows easy updating of the hash value with new inputs. It keeps a window and can remove and add a character from the window easily.

For example: \[ h_k( c) = c_1 a^k-1 + c_2 a^k-1 + … + c_k a^0 \bmod M \] for a constant a and input characters c. Removing and adding the end terms “shifts the window.”

Similarly, given a hash function on characters, we can do \[ h( c) = \mathrm{shift}(h(c_1), k-1) \,\mathrm{xor}\, \cdots \,\mathrm{xor}\, \mathrm{shift}(h(c_k), 0) \] with similar effect.

Universal Hashing and Other Guarantees #

Any single hash function can be “beaten” with the wrong inputs. One approach to mitigating this is to select a random hash function (or more than one) from a large family of functions that gives useful guarantees.

(We will need this for Locality-Sensitive Hashing and other statistical applications.)

A family \(\mathcal{H}\) of hash functions mapping to \(M\) values is said to be universal if, for \(x \ne y\), \[ P\{ h(x) = h(y) \} \le 1/M \] for \(h\) chosen uniformly from the family \(\mathcal{H}\).

The family is near-universal if 1/M is replaced by c/M for some constant c.

Example: A near-universal family.

Let p > M be prime and a in {0,…,p-1}. Then \[ h_a(x) = (a x \bmod p) \bmod M \] is near universal with c = 2.

A universal family: Modifying the above, let \(a \in \{1,…,p-1\}\) and \(b \in \{0,…,p-1\}\). Then \[ h_{ab}(x) = ((a x + b) \bmod p) \bmod M \] is universal.

Note that we often want stronger assumptions on our family: 3+-independence, independence, uniformity. These can sometimes be achieved.

Hash Tables (aka Dictionaries, Maps, Associative Arrays) #

A hash table (a.k.a. hash, hashmap, map, dictionary, associative array) is a data structure for associating arbitrary values with (almost) arbitrary keys.

We need to support three principal operations:

  • Lookup(hash-table, object)
  • Insert(hash-table, object)
  • Remove(hash-table, object)

We will use a hash function hash() in ways described below.

An Analogy #

Consider a simple method of accessing a collection of objects. We assign each an integer key in the range 0..M-1 and store the objects in an array of size M.

To find an object, we access the array at its key index; to remove it, we clear the array at that index. And so forth.

This is fine as far as it goes, but what if:

  • the number of potential keys is very large,
  • the number of stored objects is relatively small,
  • the objects are not easily mappable to integers.

Then, using an array directly like this will be impractical, inefficient, or both.

Instead, in hashing, we derive a key from the object and use that to access the object.

We start with a universe U of possible objects and a hash function h that maps U to the range 0..M-1.

For a value u, h(u) is called the hash value (or sometimes hash code, hash key, or similar).

There are various ways to store and access an object based on this key.

Chaining #

In chaining, we use the hash value as an array index, but instead of storing objects at that index, we store a list of objects. (The array index is commonly called a bucket; the list of objects is often called a chain.)

When there are no objects for a key, the list is empty. Otherwise, we “chain” the objects in a linked list:

/ox-hugo/chaining.svg

The operations are:

  • Lookup(hash-table, object): find the bucket, use linear search to find the object+data
  • Insert(hash-table, object, data): if not in the list, add object+data to the head of the list
  • Remove(hash-table, object): unlink from the chain.

If the hash function “randomizes” the keys sufficiently, most of the chains will be short, and lookup will be fast. But a bad hash function – one with many collisions – will lead to long chains and search that is no faster (even slower) than a simple linear search.

The hash function is the essential ingredient; we tend to use heuristics here. The performance of chaining depends on the hash function and the load factor, the average number of objects per bucket.

Exercise: Assume you have a function hash() to compute hash values. Write simple versions of lookup(), insert(), and remove() in a language of your choice for a hash table of strings.

make_hash_table <- function(size) {
    return( vector("list", size) )
}

lookup <- function(hash_table, key, default=NA) {
    hash_value <- hash(key)
    bucket <- hash_table[[hash_value]]
    if ( key %in% bucket ) {
        return( bucket[[key]] )
    } else {
        return( default )
    }
}

insert <- function(hash_table, object) {
    if ( !lookup(hash_table, object) ) {
        bucket_index <- hash(object$key)
        hash_table[[bucket_index]] <- c(object, hash_table[[bucket_index]])
    }
}

(Extra) Other Methods for Organizing Hash Tables #

  • Open Addressing

    Chaining is a simple idea and is often effective, but it is not the only choice. In modern architectures, locality of reference can dominate performance depending on whether the items references fit in the fast cache memory.

    In open addressing, we store the objects in the hash table itself. We systematically search the table for an object starting the index determined by that object’s hash value. We then probe the table by traversing a specific sequence of slots that eventually covers the entire table.

    If the hash function is h(x), then write the position after k probes as h(x,k).

    Methods:

    • Linear Probing: \( h(x,k) = (h(x) + k) \bmod M \)
    • Quadratic Probing: \( h(x,k) = (h(x) + a k + b k^2) \bmod M \)
    • Double Hashing: \( h(x,k) = (h(x) + g(x) k) \bmod M \) for another hash function g()

    These methods trade off the locality of their references with their tendency to cluster the full positions in the table.

    For lookup, we probe until we either find the object or an empty slot. For insertion, we do the same and put the object in the first empty slot if it is not already present. The remove operation requires some care here. (Why?)

  • Cuckoo Hashing

    Like open addressing but uses a different method to resolve collisions. Instead of probing as in open addressing, we use two hash functions to associate two indices with each object.

    • On lookup, search for the object in its two indices (based on the two hash functions).

    • On insertion, examine the first index for the object. If it is empty, store the object. Otherwise, “bump” the object that is there to the bumped object’s alternative location. This bumping continues until an empty slot is found.

      If no empty slot is found and the algorithm starts to cycle, the table is rebuilt using two new hash functions (randomly selected from a family, say).

    • Deletion is handled directly.

    It guarantees a worst-case constant time lookup because only two locations need to be checked. Insertion also performs well (on average, amortized over many operations) as long as the table is not too full (<< 50%).

  • Tabular Hashing

    Partition the input object into a sequence of chunks of a specified size (e.g., bytes or words).

    Create a lookup table T that contains uniformly random values of the chunk size.

    If object x = x_1 x_2 … x_n, compute \[ h(x) = T[x_1] \,\mathrm{xor}\, T[x_2] \,\mathrm{xor}\, \cdots \,\mathrm{xor}\, T[x_n]. \] This generates a universal family of hash functions with constant expected time per operation.

Perfect Hashing #

A perfect hash function maps a set of items \( W \to [0..m) \) with \( m \ge \#W \), where the function can be computed in \(O(1)\) time.

This is useful when you have a fixed set of keys (e.g., keywords in a language).

There is an optimal algorithm that produces a minimal perfecct hash function in \( O(\#W) \) time (Czech, Havas, Majewski 1992).

  1. Given an undirected, acyclic graph with \( m \) edges and \( n \) nodes, we find a function \( g :\, V \to [0..m) \) where \[ h((u,v)) = (g(u) + g(v)) \bmod m \]

  2. Choose auxilliary functions \( f_1 \) and \( f_2 \) on items (strings)

    \begin{align*} f_1(w) &= \sum_{i=1}^{\#W\} T_1(i, w[i])\,\bmod n\\ f_2(w) &= \sum_{i=1}^{\#W\} T_2(i, w[i])\,\bmod n, \end{align*}

    where \( T_i \)’s are tables of random integers for each position and character in a word.

  3. Choose \( n \) to make the graph acyclic.

  4. Map an item to \(h\) on the edge \((f_1(w), f_2(w))\).

Statistical Applications of Hashing #

  • Locality Sensitive Hashing
  • Feature Hashing
  • MinHash for Set Comparison
  • String Matching
  • Approximate Nearest Neighbor Search
  • Streaming (see below)

Other Applications of Hashing #

  • File Comparison and Verification
  • Cryptographic Protocols
  • Signatures/Fingerprints
  • Perfect Hashing

Data Streaming Algorithms #

A Puzzle: Missing Numbers #

Paul feeds numbers to Carole one at a time. The numbers are integers from a permutation of 1..n with one number missing. Carole needs to detect which number is missing.

Carole could just memorize all the numbers, but if n is large this is infeasible. Assume that she can use O(log n) bits of memory. How can she do it?

What if n is not known? What if there are two numbers missing?

This is a template of a data streaming problem:

  • We get information sequentially
  • The size of the whole data set is very large
  • We have limited memory

Here we will use sample problems at Twitter by P. Oscar Boykin to motivate several streaming algorithms that illustrate the possibilities.

Testing Membership: Bloom Filters #

Problem: Twitter wants to show interesting tweets for every user. Suppose there are > 10^8 users and > 10^8 tweets per day.

Storing this as a Map u (Set t) is challenging. This is huge, costly to transfer (e.g., between servers) and update.

Alternative: Keep an approximate set

Type: Bloom t

x: Bloom t, then x.contains: t -> No | Possibly

No false negatives but probability of a false positive is > 0.

Setup #

We choose integer parameters m and k.

  • We represent the set by an m-bit array a
  • We use k (independent) hash functions \(h_1, h_2, \dotsc, h_k\) that each map items into [0..m).

To write item into the set, we compute h_1(item), h_2(item), ..., h_k(item).

a[h_i(item)] |= 1 for all i in 1..k

To read item, compute

And(a[h_1(item)], ..., a[h_k(item)])

Note: both the write and read operations are monoids! This means they are parallelizable which matters in large n cases.

Can tune false probability with choice of m and k (relative to set size n).

False positive probability \(\approx (1 - e^{-kn/m})^k\), optimized at \(k = \ln(2) (m / n)\) giving probability \(\approx 0.6185^{m/n}\).

Many Variants and Advances #

  • Bloomier Filter (represents set of functions on set)
  • Cuckoo Filter
  • Compressed and Retouched Bloom Filters
  • State Machine Bloom Filter

Note we can construct the hashes from two hashes of the form \(g_j = h_1 + j h_2\).

Cardinalities: Hyperloglog #

Problem: How many unique users on twitter take each pair of possible actions?

Users 10^8, Actions: follow user y, look at tweet x, …

Users by actions is a huge set

Alternative: Keep sets with approximate cardinalities

Setup #

Use a hash function h.

We divide our stream of data items into \(m = 2^p\) substreams of roughly equal size using the first p-bits of the hash value.

In each substream, we compute length of leading pattern 0…01 from bit p+1 of the hash. (This is basically, log2 of 1/remaining bits.) Call this L(item)

We store these numbers in an array M, specifically, M[i] = max(M[i], L(item))

The cardinality estimate is a normalized, bias-adjusted harmonic mean: \[ C = \alpha_m m^2 \left( \sum_{j=1}^m 2^{-M[j]} \right). \]

Small adjustments:

  • Initialize registers to 0 not unit of max monoid
  • Change estimate for small cardinalities
  • Need more bits when set size approaches limit

Note: again we have a read and a write monoid. The read monoid is the harmonic sum; the write monoid is maximum. Essential for good performance in practice.

Error approx 1/sqrt(m)

Practical improvements and implementation here.

Counters: CountMin Sketch #

Problem: How many tweets did each user make in each hour?

10s of trillions keys in this query!

Alterative: an approximate Map k Count.

Type CMS k have get: CMS k -> k -> Approx[Count]

The count returns an upper bound but we have some control on the error.

Setup #

We have k hash functions on [0..m).

At time \(t\), add \((i, c)\) means

CMS[j, h_j(i)] += c for all 1 <= j <= k

Point query: estimate the value of an entry a_i.

\(\hat a_i = \min_{1 \le j \le k} CMS[j, h_j(i)]\)

Other queries are possible: range queries, …