In the olden days, computer processors could only do one thing at a time. A program consisted of a sequence of instructions – individual small operations for the CPU to carry out, such as adding two numbers or storing a result in memory – and the processor did these one at a time. (With various complicated exceptions; take a computer architecture class to learn more.) This, of course, is a problem if you want to do more than one thing at a time, such as playing music while also writing a report. How can the processor both decode the MP3 file and display your text in Word, apparently at the same time?
The answer is multitasking. The operating system lets programs run for a few milliseconds at a time, interrupts them, and then lets a different program run. A typical computer might have tens or even hundreds of simultaneous programs (web browsers, music player, messaging apps, various system processes, …) and, when many of them need to do work at the same time, switches between them tens or hundreds of times per second.
But modern processors often have multiple cores: multiple units that can simultaneously execute separate instructions. My laptop, for instance, has four cores, and it’s not uncommon for servers and workstations to have eight, sixteen, or even more.
How do we take advantage of this to make our programs faster? The general technique is called parallelism, because it involves running multiple things “in parallel”.
Embarrassing Parallelism #
The simplest and easiest form of parallelism is embarrassing parallelism. A problem is called embarrassingly parallel if it can be trivially split up into multiple pieces that can be done simultaneously with no communication between them.
For example:
- Embarrassing
- You have a database with 10 million rows. You need to apply a function to every row to get 10 million results. You use the same function for every row, and it takes that row as input and doesn’t need data from any other row. In principle, if you had a million CPU cores, each of them could do the calculation for 10 rows and you’d get the result a million times faster than if you had 1 CPU core.
- Not embarrassing
- You build a weather model that works by dividing up the
Earth’s surface into a grid of millions of small squares, then simulating
air flow, cloud formation, solar heating, and all the other things that
go on to affect weather. Because events in each grid cell affect events
in neighboring grid cells, you can’t just split up the calculation
naively – if each grid cell is being calculated by a separate CPU, they
need to communicate with each other about the results.
Another example: You want to do breadth-first search on a graph. Each node will be processed in some complicated way. You could parallelize this, but the queue of nodes to traverse must be shared among all the parallel workers.
Embarrassingly parallel problems are easier to solve. Non-embarrassing problems can be solved, and often are – but it takes more careful algorithm design and often a fundamentally different approach.
Fortunately, many statistical problems are embarrassing.
In Practice #
The simplest embarrassingly parallel problem is mapping: that is, the map
function we described in functional programming. It applies a function to
every element of a collection and collects the results.
For small collections and fast functions, doing the work in parallel may not be worth the cost. Doing it in parallel requires firing up separate copies of your program, splitting up the data, sending it to those copies, waiting for them to return results, and assembling the results in the right way. That takes time, and it only pays off if the task is quite slow and waiting that extra few hundred milliseconds is worth it.
R #
The parallel
package is built into R. It provides parallel mapping by
creating variants of the apply
functions. For example:
library(parallel)
cl <- makeCluster(4) # makes 4 worker processes
foo <- parSapply(cl, big_honking_data, some_function)
# there are also parLapply, parApply, clusterMap, ... functions
stopCluster(cl)
parSapply
splits the big_honking_data
into chunks, sends each chunk to a
separate worker process, and has each process run some_function
on its own
chunk.
makeCluster
also knows how to make worker processes on other machines. For
example, you could log into one server and create worker processes on several
other servers, so the work gets distributed between them.
Python #
In Python, the multiprocessing module is included in the standard library. It can create multiple separate Python processes, each running code on a separate chunk of your data.
The module supports various advanced features, such as synchronization and
shared memory between processes. But the simplest thing is just Pool.map
:
import os
from multiprocessing import Pool
with Pool(os.cpu_count()) as p:
foo = p.map(some_function, big_honking_data)
This makes a “pool” of worker processes. (with
creates a “context manager”,
i.e. it ensures those processes are shut down once you leave the with
block.)
p.map
applies the function to the data and gathers the results, just like
map
, except in parallel.
Caution! #
If your task involves using random numbers, beware: Naive ways of parallelizing it may result in each parallel process using identical streams of random numbers.
Suppose, for example, you’re doing bootstrapping. Your task is to randomly resample the data, fit your model, and get the results, and do this 1000 times.
By default, multiprocessing
produces worker processes by “forking”: by making
a clone of the parent Python process. That clone has all the data, but it
also has the random seed! So each worker process will have the same random
seed and hence generate the same sequence of random numbers, and hence each
worker process will bootstrap in identical ways. That would be bad. You
should check that whatever source of randomness you’re using (Numpy’s random
sampling, something from another package, the built-in Python functions,
whatever) generates a new seed for each worker process.
In R, see section 6 of vignette("parallel")
.