Parallel Programming

— Alex Reinhart

In the olden days, computer processors could only do one thing at a time. A program consisted of a sequence of instructions – individual small operations for the CPU to carry out, such as adding two numbers or storing a result in memory – and the processor did these one at a time. (With various complicated exceptions; take a computer architecture class to learn more.) This, of course, is a problem if you want to do more than one thing at a time, such as playing music while also writing a report. How can the processor both decode the MP3 file and display your text in Word, apparently at the same time?

The answer is multitasking. The operating system lets programs run for a few milliseconds at a time, interrupts them, and then lets a different program run. A typical computer might have tens or even hundreds of simultaneous programs (web browsers, music player, messaging apps, various system processes, …) and, when many of them need to do work at the same time, switches between them tens or hundreds of times per second.

But modern processors often have multiple cores: multiple units that can simultaneously execute separate instructions. My laptop, for instance, has four cores, and it’s not uncommon for servers and workstations to have eight, sixteen, or even more.

How do we take advantage of this to make our programs faster? The general technique is called parallelism, because it involves running multiple things “in parallel”.

Embarrassing Parallelism #

The simplest and easiest form of parallelism is embarrassing parallelism. A problem is called embarrassingly parallel if it can be trivially split up into multiple pieces that can be done simultaneously with no communication between them.

For example:

Embarrassing
You have a database with 10 million rows. You need to apply a function to every row to get 10 million results. You use the same function for every row, and it takes that row as input and doesn’t need data from any other row. In principle, if you had a million CPU cores, each of them could do the calculation for 10 rows and you’d get the result a million times faster than if you had 1 CPU core.
Not embarrassing
You build a weather model that works by dividing up the Earth’s surface into a grid of millions of small squares, then simulating air flow, cloud formation, solar heating, and all the other things that go on to affect weather. Because events in each grid cell affect events in neighboring grid cells, you can’t just split up the calculation naively – if each grid cell is being calculated by a separate CPU, they need to communicate with each other about the results.

Another example: You want to do breadth-first search on a graph. Each node will be processed in some complicated way. You could parallelize this, but the queue of nodes to traverse must be shared among all the parallel workers.

Embarrassingly parallel problems are easier to solve. Non-embarrassing problems can be solved, and often are – but it takes more careful algorithm design and often a fundamentally different approach.

Fortunately, many statistical problems are embarrassing.

In Practice #

The simplest embarrassingly parallel problem is mapping: that is, the map function we described in functional programming. It applies a function to every element of a collection and collects the results.

For small collections and fast functions, doing the work in parallel may not be worth the cost. Doing it in parallel requires firing up separate copies of your program, splitting up the data, sending it to those copies, waiting for them to return results, and assembling the results in the right way. That takes time, and it only pays off if the task is quite slow and waiting that extra few hundred milliseconds is worth it.

#

The parallel package is built into R. It provides parallel mapping by creating variants of the apply functions. For example:

library(parallel)

cl <- makeCluster(4) # makes 4 worker processes

foo <- parSapply(cl, big_honking_data, some_function)

# there are also parLapply, parApply, clusterMap, ... functions

stopCluster(cl)

parSapply splits the big_honking_data into chunks, sends each chunk to a separate worker process, and has each process run some_function on its own chunk.

makeCluster also knows how to make worker processes on other machines. For example, you could log into one server and create worker processes on several other servers, so the work gets distributed between them.

Python #

In Python, the multiprocessing module is included in the standard library. It can create multiple separate Python processes, each running code on a separate chunk of your data.

The module supports various advanced features, such as synchronization and shared memory between processes. But the simplest thing is just Pool.map:

import os

from multiprocessing import Pool

with Pool(os.cpu_count()) as p:
    foo = p.map(some_function, big_honking_data)

This makes a “pool” of worker processes. (with creates a “context manager”, i.e. it ensures those processes are shut down once you leave the with block.) p.map applies the function to the data and gathers the results, just like map, except in parallel.

Caution! #

If your task involves using random numbers, beware: Naive ways of parallelizing it may result in each parallel process using identical streams of random numbers.

Suppose, for example, you’re doing bootstrapping. Your task is to randomly resample the data, fit your model, and get the results, and do this 1000 times.

By default, multiprocessing produces worker processes by “forking”: by making a clone of the parent Python process. That clone has all the data, but it also has the random seed! So each worker process will have the same random seed and hence generate the same sequence of random numbers, and hence each worker process will bootstrap in identical ways. That would be bad. You should check that whatever source of randomness you’re using (Numpy’s random sampling, something from another package, the built-in Python functions, whatever) generates a new seed for each worker process.

In R, see section 6 of vignette("parallel").