Build PyTorch Extensions with CUDA and CFFI

Deprecated warning: PyTorch 1.0+ replaced the old TH/THC tensor backend with ATen, and the torch.utils.ffi toolchain described here was removed along the way. If you target PyTorch 1.0 or newer, write your extensions with torch.utils.cpp_extension instead, see this GitHub repository for a working example. The article below documents the older CFFI-based approach and is kept for reference.

Python is one of the most popular languages for deep learning, but as an interpreted language it is slow at tight numerical loops. Most of the time this does not matter, because the heavy lifting happens inside libraries written in C and CUDA. It starts to matter the moment you need an operation that the framework does not provide out of the box.

Data augmentation is a good example. When you apply a custom transform to every image, every epoch, in pure Python, that work runs on the CPU and can easily become the bottleneck that leaves your GPU sitting idle, waiting to be fed. The fix is to move the hot path onto the GPU: implement the transform once in CUDA and call it from Python.

This is where the CFFI half of the title comes in. A Foreign Function Interface lets Python call functions that were compiled from C. PyTorch 0.4 shipped a helper, torch.utils.ffi (built on top of the CFFI library), that compiles a chunk of C into a module you can import like any other. By writing that C as a thin wrapper around a CUDA kernel, we end up with a Python-callable, GPU-accelerated operator. CUDA does the computation; CFFI provides the Python binding, and that combination is what this tutorial builds.

The running example is a color transform for data augmentation (per-channel color scaling plus brightness, contrast, and gamma), applied entirely on the GPU.

How the pieces fit together

Before writing any code, it helps to see how the parts relate. A CFFI CUDA extension is built in three layers, plus some build glue:

Python              __init__.py                    <- autograd-friendly wrapper
   |
   v
C interface (FFI)   augmentation_cuda.c / .h       <- unpacks tensors, calls the kernel
   |
   v
CUDA kernel         augmentation_cuda_kernel.cu / .h   <- the actual GPU computation

There are two reasons for splitting things this way. First, the FFI can only compile plain C, not CUDA, so the .cu file has to be compiled separately by nvcc and then linked in as a precompiled object; that is why the build later happens in two steps. Second, keeping the tensor handling (C) apart from the math (CUDA) keeps each file simple: the .c file never does arithmetic, and the .cu file never touches a PyTorch tensor.

The project layout looks like this:

├── build.py        # tells the FFI how to assemble the extension
├── build.sh        # compiles the .cu with nvcc, then runs build.py
├── __init__.py     # Python-facing API
├── build/          # nvcc drops the compiled .o here
└── src/
    ├── augmentation_cuda.c
    ├── augmentation_cuda.h
    ├── augmentation_cuda_kernel.cu
    └── augmentation_cuda_kernel.h

At runtime the call path is: Python → color_transform_cuda (C) → color_transform_cuda_kernel (CUDA host code) → color_augmentation (the device kernel that runs on every pixel). We will build it in that order, starting from the build script.

Create the Build Script

build.py is the recipe the FFI follows to turn our sources into an importable module. Note that it does not compile the CUDA itself; it compiles the C interface and links against the CUDA object that nvcc produced earlier (we get to that in the final step).

import os
import torch
import torch.utils.ffi

this_folder = os.path.dirname(os.path.abspath(__file__)) + '/'

Headers = []
Sources = []
Defines = []
Objects = []

if torch.cuda.is_available() == True:
    Headers += ['src/augmentation_cuda.h']
    Sources += ['src/augmentation_cuda.c']
    Defines += [('WITH_CUDA', None)]
    Objects += ['build/augmentation_cuda_kernel.o']

ffi = torch.utils.ffi.create_extension(
    name='_ext.augmentation',
    headers=Headers,
    sources=Sources,
    verbose=False,
    with_cuda=True,
    package=False,
    relative_to=this_folder,
    define_macros=Defines,
    extra_objects=[os.path.join(this_folder, Object) for Object in Objects]
)

if __name__ == '__main__':
    ffi.build()

A few things worth pointing out:

  • The Headers/Sources/Defines/Objects lists are populated only when a GPU is available, so the same extension can degrade gracefully on a CPU-only machine.
  • Adding ('WITH_CUDA', None) defines a WITH_CUDA macro, letting the C code guard its GPU paths with #ifdef WITH_CUDA.
  • extra_objects points at build/augmentation_cuda_kernel.o, the precompiled CUDA kernel. This is the link that joins the two halves of the build.
  • name='_ext.augmentation' is where the compiled module lands, which is why, later on, we import augmentation from the _ext folder.

Create the CUDA Interface

This file is the FFI boundary: the signature of color_transform_cuda here is exactly what becomes callable from Python. Its job is pure marshalling (turning PyTorch tensors into the raw pointers the kernel expects) and nothing more. We name it augmentation_cuda.c.

#include <THC/THC.h>

#include "augmentation_cuda_kernel.h"

// symbol to be automatically resolved by PyTorch libs
extern THCState *state;

int color_transform_cuda(THCudaTensor* input, THCudaTensor* output,
        float color_r, float color_g, float color_b, float brightness,
        float contrast, float gamma) {
    int nChannels = input->size[0];
    int height = input->size[1];
    int width = input->size[2];

    THCudaTensor_resize3d(state, output, nChannels, height, width);
    THCudaTensor_fill(state, output, 0);

    int success = 0;
    success = color_transform_cuda_kernel(
        THCudaTensor_data(state, input),
        THCudaTensor_data(state, output),
        nChannels, height, width,
        color_r, color_g, color_b, brightness, contrast, gamma);

    //Check for errors
    if ( !success ) {
        THError("aborting");
    }
    return 1;
}

What each part is doing:

  • extern THCState *state: in the TH/THC era, PyTorch kept a global CUDA context (streams, the allocator, and so on) in a THCState. The extern tells the linker that this symbol lives inside PyTorch’s own libraries, and PyTorch resolves it when the module is loaded. Every THCudaTensor_* call needs it.
  • The size[0..2] reads pull the channel, height, and width out of the input tensor.
  • THCudaTensor_resize3d and THCudaTensor_fill allocate and zero the output on the GPU, so the kernel has somewhere to write.
  • THCudaTensor_data hands back the raw float* device pointer. From here on, the kernel deals only in plain pointers and scalars.
  • If the kernel reports failure, THError raises it back into Python as an exception.

We also need a header, augmentation_cuda.h, so the FFI knows the prototype of color_transform_cuda(...):

int color_transform_cuda(THCudaTensor* input, THCudaTensor* output,
    float color_r, float color_g, float color_b, float brightness,
    float contrast, float gamma);

Create the CUDA Core

The file augmentation_cuda_kernel.cu holds the actual GPU code, and it comes in two halves: a host-side launcher (color_transform_cuda_kernel, ordinary CPU code that configures and starts the GPU work) and the device-side kernel (color_augmentation, the code each GPU thread runs).

We start with the launcher, which also receives the scalar parameters passed down from C:

struct ChromaticParams {
    float color[N_CHANNELS];
    float brightness;
    float contrast;
    float gamma;
};

#define CUDA_NUM_THREADS 512

inline int GET_BLOCKS(const int N) {
    return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;
}

int color_transform_cuda_kernel(float* input, float* output, int n_channels, int height, int width,
        float color_r, float color_g, float color_b, float brightness, float contrast, float gamma) {
    struct ChromaticParams cp;
    cp.color[0]     = color_r;
    cp.color[1]     = color_g;
    cp.color[2]     = color_b;
    cp.brightness   = brightness;
    cp.contrast     = contrast;
    cp.gamma        = gamma;

    // Memory addresses in GPU
    ChromaticParams* p_cp;

    //  Alloc space for GPU copies
    cudaMalloc((void **)&p_cp, sizeof(ChromaticParams));
    // Copy data to GPU
    cudaMemcpy(p_cp, &cp, sizeof(ChromaticParams), cudaMemcpyHostToDevice);

    // Color Augmentation
    int n_threads = height * width;
    color_augmentation<<<GET_BLOCKS(n_threads), CUDA_NUM_THREADS>>>(
        n_threads, n_channels, height, width, input, output, p_cp);

    // Free alloced memory in GPU
    cudaFree(p_cp);

    // Check for errors
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("Error in color_transform_cuda_kernel: %s\n", cudaGetErrorString(err));
        return 0;
    }
    return 1;
}

Almost every CUDA launcher has the same shape:

  1. Gather the parameters. Here the scalar augmentation settings are packed into a ChromaticParams struct.
  2. Copy them to the GPU. cudaMalloc reserves device memory and cudaMemcpy(..., cudaMemcpyHostToDevice) ships the struct over. The tensors are already on the GPU; only these loose scalars need copying.
  3. Choose the launch configuration. We use one thread per pixel, so n_threads = height * width. CUDA_NUM_THREADS (512) is the block size, and GET_BLOCKS is a ceiling division that works out how many blocks of 512 are needed to cover every pixel.
  4. Launch and clean up. The <<<blocks, threads>>> syntax fires the kernel, then we free the scratch memory and check cudaGetLastError for any failure.

Next comes the kernel that actually runs on the GPU:

#define CUDA_KERNEL_LOOP(i, n) \
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; \
       i < (n); \
       i += blockDim.x * gridDim.x)
#define N_CHANNELS 3
#define MAX_MULTIPLIER 1

inline __device__ __host__ float clamp(float f, float a, float b) {
    return fmaxf(a, fminf(f, b));
}

__global__ void color_augmentation(const int nthreads, const int channels, const int height,
        const int width, float* src_data, float* dest_data, const ChromaticParams* cp) {
    CUDA_KERNEL_LOOP(index, nthreads) {
        float x = (float)(index % width); //w-pos
        float y = (float)((index / width) % height); //h-pos
        int n = (index / width / height); // num

        int data_index[N_CHANNELS];
        float rgb[N_CHANNELS];
        float mean_in = 0;
        float mean_out = 0;

        // do color change
        for ( int c = 0; c < channels; ++c ) {
            data_index[c] = width * (height * (channels * n + c) + y) + x;
            rgb[c] = src_data[data_index[c]];
            mean_in += rgb[c];
            rgb[c] *= cp->color[c];
            mean_out += rgb[c];
        }

        float brightness_coeff = mean_in / (mean_out + 0.01f);
        for ( int c = 0; c < channels; ++c ) {
            //compensate brightness
            rgb[c] = clamp(rgb[c] * brightness_coeff, 0.f, 1.f);

            // do gamma change
            rgb[c] = pow(rgb[c], cp->gamma);

            // do brightness change
            rgb[c] = rgb[c] + cp->brightness;

            // do contrast change
            rgb[c] = 0.5f + (rgb[c] - 0.5f) * cp->contrast;

            // write sample to destination
            dest_data[data_index[c]] = clamp(rgb[c], 0.f, MAX_MULTIPLIER);
        }
    }
}

The CUDA_KERNEL_LOOP macro is the standard grid-stride loop: rather than assuming each thread handles exactly one element, every thread strides across the data in steps of blockDim.x * gridDim.x. This keeps the kernel correct even when we launch fewer threads than there are pixels. Inside the loop, each thread turns its flat index back into (x, y) pixel coordinates and applies the color math (per-channel color scaling, a brightness compensation, gamma, a brightness offset, and contrast), writing the clamped result into the output buffer. Note that rgb and data_index are per-channel arrays: the first loop reads every channel and accumulates mean_in/mean_out, and only then does the second loop apply the brightness compensation and the rest, one channel at a time.

Finally, just like augmentation_cuda.h, we declare the launcher’s prototype in augmentation_cuda_kernel.h so the C interface can call it:

int color_transform_cuda_kernel(float* input, float* output, int n_channels, int height, int width,
    float color_r, float color_g, float color_b, float brightness, float contrast, float gamma);

Build & Run

With every source in place, the build is the two-step process the architecture implied: compile the CUDA kernel with nvcc into an object file, then let build.py compile the C interface and link that object in.

#!/usr/bin/bash
TORCH=$(python3 -c "import os; import torch; print(os.path.dirname(torch.__file__))")

cd src

rm ../build/augmentation_cuda_kernel.o
rm -r ../_ext

nvcc -c -o ../build/augmentation_cuda_kernel.o augmentation_cuda_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_61

cd ../
python build.py

Two flags are worth a note:

  • -arch=sm_61 targets a specific GPU compute capability (Pascal, in this case). Set it to match your card.
  • -Xcompiler -fPIC makes the object position-independent, so it can be linked into the Python shared module.

To make the extension accessible from Python, we add an __init__.py. Wrapping the call in a torch.autograd.Function is what lets the operator sit inside a normal PyTorch pipeline:

import torch
from torch.autograd import Function, Variable

import sys, os
sys.path.append(os.path.join(os.path.dirname(__file__), '_ext'))
import augmentation

class Augmentation(Function):
    def color_transform(self, input, rnd):
        output = input.new()
        augmentation.color_transform_cuda(input, output, rnd['color_r'], rnd['color_g'], \
            rnd['color_b'], rnd['brightness'], rnd['contrast'], rnd['gamma'])

        return output

With everything built, we import the wrapper class from the package (the folder that holds __init__.py) and apply the transform to a GPU tensor:

from data_augmentation import Augmentation

# `rnd` holds the random augmentation parameters
output = Augmentation().color_transform(input, rnd)

References