Cython

— Christopher Genovese and Alex Reinhart

Cython is an optimizing compiler for Python. It turns Python code into C code which can be compiled into highly efficient native code, provided you do a tiny bit of extra work to annotate variable types.

Cython also makes it easy to call C or C++ libraries, so if you need Python to call an external package, Cython may be the way to go. (cffi is a simpler way, if you just need to call a few C functions.)

Cython is useful when you’ve done extensive profiling to find bottlenecks in your code. Intensive loops and calculations can be factored out into Cython and easily made fast.

Examples #

Here’s some real Python code from my project:

def intensity_grid(xs, ys, xy, Ms, ts, alpha, theta, omega, sigma2, eta2,
                   T, t, min_dist2, min_t):
    tents = np.empty((xs.shape[0], ys.shape[0]))

    for ii in range(xs.shape[0]):
        for jj in range(ys.shape[0]):
            tents[ii, jj] = intensity_at(xs[ii], ys[jj], xy, Ms, ts, alpha,
                                         theta, omega, sigma2, eta2, T, t,
                                         min_dist2, min_t)

    return tents

The grid is often very big, and every intensity_at call requires a sum over every crime – so this is a very slow and very expensive function. I’d like to speed it up, and parallelize it if possible. We can move it to a separate file, intensity.pyx, and start by annotating the variable types:

import numpy as np

def intensity_grid(double[:] xs, double[:] ys, double[:,:] xy, long[:] Ms,
                   double[:] ts, double[:] alpha, double[:] theta, double omega,
                   double sigma2, double eta2, double T, double t,
                   double min_dist2, double min_t):
    cdef int ii, jj
    cdef double[:,:] tents = np.empty((xs.shape[0], ys.shape[0]))

    for ii in range(xs.shape[0]):
        for jj in range(ys.shape[0]):
            tents[ii, jj] = intensity_at(xs[ii], ys[jj], xy, Ms, ts, alpha,
                                         theta, omega, sigma2, eta2, T, t,
                                         min_dist2, min_t)

    return np.asarray(tents)

(I’ll assume intensity_at has been moved to the same file and is getting the same sort of treatment.)

Now the generated C code doesn’t need all sorts of expensive type-checking operations – it checks the variable types at the beginning of the function and then generates highly efficient code for the rest.

We’re doing a lot of array accesses. Python checks the bounds on every access to make sure we don’t access out of bounds. But we know we’re not going out of bounds, so we can annotate:

@cython.boundscheck(False)
@cython.initializedcheck(False)
@cython.wraparound(False)
def intensity_grid(double[:] xs, double[:] ys, double[:,:] xy, long[:] Ms,
                   ...):

This tells Cython not to check array bounds, not to check if the array is initialized before we use it, and not to do wraparound indexing (i.e. A[-1] gets the last element of the array). This generates yet more efficient code.

One last bonus – Cython supports OpenMP, a framework for parallelization of code. This function is a prime candidate for parallelization, since each pass through the inner loop is separate from the other passes. It’s embarrassingly parallel. Let’s write the full parallel version:

@cython.boundscheck(False)
@cython.initializedcheck(False)
@cython.wraparound(False)
def intensity_grid(double[:] xs, double[:] ys, double[:,:] xy, long[:] Ms,
                   double[:] ts, double[:] alpha, double[:] theta, double omega,
                   double sigma2, double eta2, double T, double t,
                   double min_dist2, double min_t):
    cdef int ii, jj
    cdef double[:,:] tents = np.empty((xs.shape[0], ys.shape[0]))

    for ii in prange(xs.shape[0], nogil=True, schedule='static'):
        for jj in range(ys.shape[0]):
            tents[ii, jj] = intensity_at(xs[ii], ys[jj], xy, Ms, ts, alpha,
                                         theta, omega, sigma2, eta2, T, t,
                                         min_dist2, min_t)

    return np.asarray(tents)

prange is a Cython build-in function which acts like range, but evaluates in parallel. The nogil option tells Cython that I’m not going to use Python’s Global Interpreter Lock, which prevents multiple threads from accessing the Python interpreter simultaneously – here I’m promising to only call Cython code, so it can be called in parallel.

The schedule option chooses how work will be assigned to threads (on different CPU cores). I chose to have the work just evenly divided, since each iteration should take about the same amount of time; other schemes split dynamically based on how long each iteration is taking and which threads are free.

Now my code executes in parallel. A task which would have taken over an hour, largely inside intensity_at, now takes fifteen or twenty minutes (on a quad-core machine).

Notice I didn’t have to tell Cython that tents should be shared between threads; it deduced this automatically. It can also handle “reduction variables”:

def sum(double[:] big_array):
    cdef double s = 0.0

    for ii in prange(big_array.shape[0], nogil=True, schedule='static'):
        s += big_array[ii]

    return s

Here each thread adds up its portion of big_array and the results are automatically summed together at the end, producing a nice parallel sum. This only works for simple operations, like +, which Cython can automatically figure out how to reduce.

Parallelization won’t work for variables that have to be shared for reading and writing by multiple threads simultaneously – that’s a much more difficult task, and one we’ll talk about more later.

Analyzing Cython performance #

The only way Cython code can be fast is if it understands your data types well enough to generate efficient C. Cython provides tools for inspecting the generated C code with cython -a example.pyx.

Building Cython code #

Cython code has to be compiled before you can use it. You can manually convert it to C and compile it with your favorite C compiler, like GCC:

cython example.pyx

gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
    -I/usr/include/python2.7 -o example.so example.c

This produces an example.so file, a shared library you can import into Python code with the normal import example statement.

Alternately, you can use distutils, Python’s tool for building packages. You first create a file named setup.py that specifies what has to be compiled:

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name = "An example module",
    ext_modules = cythonize('example.pyx')
)

Then you can just run python setup.py build_ext --inplace and your module will be Cythonized and compiled automatically for you. There are various options you can add if you need to call other libraries, like OpenMP; see the manual for details.

You can also build this step into a Makefile to automatically compile your Cython whenever necessary.