Cython is an optimizing compiler for Python. It turns Python code into C code which can be compiled into highly efficient native code, provided you do a tiny bit of extra work to annotate variable types.
Cython also makes it easy to call C or C++ libraries, so if you need Python to call an external package, Cython may be the way to go. (cffi is a simpler way, if you just need to call a few C functions.)
Cython is useful when you’ve done extensive profiling to find bottlenecks in your code. Intensive loops and calculations can be factored out into Cython and easily made fast.
Examples #
Here’s some real Python code from my project:
def intensity_grid(xs, ys, xy, Ms, ts, alpha, theta, omega, sigma2, eta2,
T, t, min_dist2, min_t):
tents = np.empty((xs.shape[0], ys.shape[0]))
for ii in range(xs.shape[0]):
for jj in range(ys.shape[0]):
tents[ii, jj] = intensity_at(xs[ii], ys[jj], xy, Ms, ts, alpha,
theta, omega, sigma2, eta2, T, t,
min_dist2, min_t)
return tents
The grid is often very big, and every intensity_at
call requires a sum over
every crime – so this is a very slow and very expensive function. I’d like to
speed it up, and parallelize it if possible. We can move it to a separate
file, intensity.pyx
, and start by annotating the variable types:
import numpy as np
def intensity_grid(double[:] xs, double[:] ys, double[:,:] xy, long[:] Ms,
double[:] ts, double[:] alpha, double[:] theta, double omega,
double sigma2, double eta2, double T, double t,
double min_dist2, double min_t):
cdef int ii, jj
cdef double[:,:] tents = np.empty((xs.shape[0], ys.shape[0]))
for ii in range(xs.shape[0]):
for jj in range(ys.shape[0]):
tents[ii, jj] = intensity_at(xs[ii], ys[jj], xy, Ms, ts, alpha,
theta, omega, sigma2, eta2, T, t,
min_dist2, min_t)
return np.asarray(tents)
(I’ll assume intensity_at
has been moved to the same file and is getting
the same sort of treatment.)
Now the generated C code doesn’t need all sorts of expensive type-checking operations – it checks the variable types at the beginning of the function and then generates highly efficient code for the rest.
We’re doing a lot of array accesses. Python checks the bounds on every access to make sure we don’t access out of bounds. But we know we’re not going out of bounds, so we can annotate:
@cython.boundscheck(False)
@cython.initializedcheck(False)
@cython.wraparound(False)
def intensity_grid(double[:] xs, double[:] ys, double[:,:] xy, long[:] Ms,
...):
This tells Cython not to check array bounds, not to check if the array is
initialized before we use it, and not to do wraparound indexing (i.e. A[-1]
gets the last element of the array). This generates yet more efficient code.
One last bonus – Cython supports OpenMP, a framework for parallelization of code. This function is a prime candidate for parallelization, since each pass through the inner loop is separate from the other passes. It’s embarrassingly parallel. Let’s write the full parallel version:
@cython.boundscheck(False)
@cython.initializedcheck(False)
@cython.wraparound(False)
def intensity_grid(double[:] xs, double[:] ys, double[:,:] xy, long[:] Ms,
double[:] ts, double[:] alpha, double[:] theta, double omega,
double sigma2, double eta2, double T, double t,
double min_dist2, double min_t):
cdef int ii, jj
cdef double[:,:] tents = np.empty((xs.shape[0], ys.shape[0]))
for ii in prange(xs.shape[0], nogil=True, schedule='static'):
for jj in range(ys.shape[0]):
tents[ii, jj] = intensity_at(xs[ii], ys[jj], xy, Ms, ts, alpha,
theta, omega, sigma2, eta2, T, t,
min_dist2, min_t)
return np.asarray(tents)
prange
is a Cython build-in function which acts like range
, but evaluates in
parallel. The nogil
option tells Cython that I’m not going to use Python’s
Global Interpreter Lock, which prevents multiple threads from accessing the
Python interpreter simultaneously – here I’m promising to only call Cython
code, so it can be called in parallel.
The schedule
option chooses how work will be assigned to threads (on different
CPU cores). I chose to have the work just evenly divided, since each iteration
should take about the same amount of time; other schemes split dynamically
based on how long each iteration is taking and which threads are free.
Now my code executes in parallel. A task which would have taken over an hour,
largely inside intensity_at
, now takes fifteen or twenty minutes (on a
quad-core machine).
Notice I didn’t have to tell Cython that tents
should be shared between
threads; it deduced this automatically. It can also handle “reduction
variables”:
def sum(double[:] big_array):
cdef double s = 0.0
for ii in prange(big_array.shape[0], nogil=True, schedule='static'):
s += big_array[ii]
return s
Here each thread adds up its portion of big_array
and the results are
automatically summed together at the end, producing a nice parallel sum. This
only works for simple operations, like +
, which Cython can automatically
figure out how to reduce.
Parallelization won’t work for variables that have to be shared for reading and writing by multiple threads simultaneously – that’s a much more difficult task, and one we’ll talk about more later.
Analyzing Cython performance #
The only way Cython code can be fast is if it understands your data types
well enough to generate efficient C. Cython provides tools for inspecting the
generated C code with cython -a example.pyx
.
Building Cython code #
Cython code has to be compiled before you can use it. You can manually convert it to C and compile it with your favorite C compiler, like GCC:
cython example.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
-I/usr/include/python2.7 -o example.so example.c
This produces an example.so
file, a shared library you can import into
Python code with the normal import example
statement.
Alternately, you can use distutils, Python’s tool for building packages. You
first create a file named setup.py
that specifies what has to be compiled:
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = "An example module",
ext_modules = cythonize('example.pyx')
)
Then you can just run python setup.py build_ext --inplace
and your module
will be Cythonized and compiled automatically for you. There are various
options you can add if you need to call other libraries, like OpenMP; see the
manual for details.
You can also build this step into a Makefile to automatically compile your Cython whenever necessary.