Convolutions are essential components of any neural networks, image processing, computer vision ... but these are also a bottleneck in terms of computations... I will here benchmark different solutions using numpy, scipy or tensorflow. This is work-in-progress, so that any suggestion is welcome (for instance on StackExchange!

Let's first initialize the notebook:

In [1]:
from __future__ import division, print_function
import numpy as np
np.set_printoptions(precision=6, suppress=True)
import os
%matplotlib inline
#%config InlineBackend.figure_format='retina'
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
phi = (np.sqrt(5)+1)/2
fig_width = 10
figsize = (fig_width, fig_width/phi)
%load_ext autoreload
%autoreload 2

setting up the problem

The convolution operates on two 2-D matrices. We will here always consider the case which is most typical in computer vision:

  • a first matrix $A$ is the input and is typically large ($N \times N$ where $N$ is typically larger than $2^{10}=1024$),
  • a second matrix $B$ is the template and is typically smaller (say $M=128$),
  • the result of the convolution $C = A \ast B$ is padded such that it is of the same size as $A$.

    Often, you need to do that with many images on many kernels. In addition, we will test for the effect of prefetching.

    Let's write a small function for this:

In [2]:
N, n_N, M, n_M = 512, 4, 64, 16
# DEBUG 
N, n_N, M, n_M = 256, 10, 32, 8
def get_data(N=N, n_N=n_N, M=M, n_M=n_M, seed=42, prefetching=False):
    np.random.seed(seed)
    if prefetching:
        A = np.random.rand(n_N, N, N)
        B = np.random.rand(n_M, M, M)
        C = np.zeros((n_N, n_M, N, N))
    else:
        A = np.random.rand(N, N, n_N)
        B = np.random.rand(M, M, n_M)
        C = np.zeros((N, N, n_N, n_M))
    return A, B, C

for prefetching in [True, False]:
    print ('with prefetching=', prefetching)
    A, B, C = get_data(prefetching=prefetching)
    print('Checking size of A =', A.shape, ' of B=', B.shape, ', and of C=', C.shape)
with prefetching= True
Checking size of A = (10, 256, 256)  of B= (8, 32, 32) , and of C= (10, 8, 256, 256)
with prefetching= False
Checking size of A = (256, 256, 10)  of B= (32, 32, 8) , and of C= (256, 256, 10, 8)

The call to this function will generate some initialization time, but makes further tests cleaner:

In [3]:
def test_get_data(N=N, n_N=n_N, M=M, n_M=n_M):
    A, B, C = get_data(N, n_N, M, n_M)  
In [4]:
%%timeit
test_get_data(N, n_N, M, n_M)
18.5 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

using scipy

The scipy library as different solutions:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html

In [5]:
from scipy.signal import convolve2d
def test_scipy(A, B, C, prefetching=False):
    if prefetching:
        for i_N in np.arange(A.shape[0]):
            for i_M in np.arange(B.shape[0]):
                C[i_N, i_M, :, :] = convolve2d(A[i_N, :, :], B[i_M, :, :], mode='same', boundary='fill', fillvalue=0)
    else:
        for i_N in np.arange(A.shape[-1]):
            for i_M in np.arange(B.shape[-1]):
                C[:, :, i_N, i_M] = convolve2d(A[:, :, i_N], B[:, :, i_M], mode='same', boundary='fill', fillvalue=0)                
In [6]:
A, B, C = get_data(N, n_N, M, n_M)
In [7]:
%%timeit
test_scipy(A, B, C)
17.9 s ± 775 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
In [9]:
%%timeit
test_scipy(A, B, C, prefetching=True)
14.4 s ± 163 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]:
A, B, C = get_data(N, n_N, M, n_M)
from scipy.signal import fftconvolve
def test_scipy_fft(A, B, C, prefetching=False):
    if prefetching:
        for i_N in np.arange(A.shape[0]):
            for i_M in np.arange(B.shape[0]):
                C[i_N, i_M, :, :] = fftconvolve(A[i_N, :, :], B[i_M, :, :], mode='same')
    else:
        for i_N in np.arange(A.shape[-1]):
            for i_M in np.arange(B.shape[-1]):
                C[:, :, i_N, i_M] = fftconvolve(A[:, :, i_N], B[:, :, i_M], mode='same')                            
In [11]:
A, B, C = get_data(N, n_N, M, n_M)
In [12]:
%%timeit
test_scipy_fft(A, B, C)
576 ms ± 70 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
In [14]:
%%timeit
test_scipy_fft(A, B, C, prefetching=True)
394 ms ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Fruther profiling, shows that most of the computing time is divided between the three FFT (2 forward, one inverse):

%lprun -f fftconvolve test_scipy_fft(A, B, C, prefetching=True)

This shows the advantage of using the Fourier transform to perform the convolution. There is also a slight advantage in using prefetching. However, this solution imposes to have periodic bounds for the borders.

using ndimage

Within scipy, the ndimage library as different solutions:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.correlate.html

In [15]:
from scipy.ndimage import convolve
help(convolve)
Help on function convolve in module scipy.ndimage.filters:

convolve(input, weights, output=None, mode='reflect', cval=0.0, origin=0)
    Multidimensional convolution.
    
    The array is convolved with the given kernel.
    
    Parameters
    ----------
    input : array_like
        Input array to filter.
    weights : array_like
        Array of weights, same number of dimensions as input
    output : ndarray, optional
        The `output` parameter passes an array in which to store the
        filter output. Output array should have different name as
        compared to input array to avoid aliasing errors.  
    mode : {'reflect','constant','nearest','mirror', 'wrap'}, optional
        the `mode` parameter determines how the array borders are
        handled. For 'constant' mode, values beyond borders are set to be
        `cval`. Default is 'reflect'.
    cval : scalar, optional
        Value to fill past edges of input if `mode` is 'constant'. Default
        is 0.0
    origin : array_like, optional
        The `origin` parameter controls the placement of the filter, 
        relative to the centre of the current element of the input.  
        Default of 0 is equivalent to ``(0,)*input.ndim``.
    
    Returns
    -------
    result : ndarray
        The result of convolution of `input` with `weights`.
    
    See Also
    --------
    correlate : Correlate an image with a kernel.
    
    Notes
    -----
    Each value in result is :math:`C_i = \sum_j{I_{i+k-j} W_j}`, where
    W is the `weights` kernel,
    j is the n-D spatial index over :math:`W`,
    I is the `input` and k is the coordinate of the center of
    W, specified by `origin` in the input parameters.
    
    Examples
    --------
    Perhaps the simplest case to understand is ``mode='constant', cval=0.0``,
    because in this case borders (i.e. where the `weights` kernel, centered
    on any one value, extends beyond an edge of `input`.
    
    >>> a = np.array([[1, 2, 0, 0],
    ...               [5, 3, 0, 4],
    ...               [0, 0, 0, 7],
    ...               [9, 3, 0, 0]])
    >>> k = np.array([[1,1,1],[1,1,0],[1,0,0]])
    >>> from scipy import ndimage
    >>> ndimage.convolve(a, k, mode='constant', cval=0.0)
    array([[11, 10,  7,  4],
           [10,  3, 11, 11],
           [15, 12, 14,  7],
           [12,  3,  7,  0]])
    
    Setting ``cval=1.0`` is equivalent to padding the outer edge of `input`
    with 1.0's (and then extracting only the original region of the result).
    
    >>> ndimage.convolve(a, k, mode='constant', cval=1.0)
    array([[13, 11,  8,  7],
           [11,  3, 11, 14],
           [16, 12, 14, 10],
           [15,  6, 10,  5]])
    
    With ``mode='reflect'`` (the default), outer values are reflected at the
    edge of `input` to fill in missing values.
    
    >>> b = np.array([[2, 0, 0],
    ...               [1, 0, 0],
    ...               [0, 0, 0]])
    >>> k = np.array([[0,1,0], [0,1,0], [0,1,0]])
    >>> ndimage.convolve(b, k, mode='reflect')
    array([[5, 0, 0],
           [3, 0, 0],
           [1, 0, 0]])
    
    This includes diagonally at the corners.
    
    >>> k = np.array([[1,0,0],[0,1,0],[0,0,1]])
    >>> ndimage.convolve(b, k)
    array([[4, 2, 0],
           [3, 2, 0],
           [1, 1, 0]])
    
    With ``mode='nearest'``, the single nearest value in to an edge in
    `input` is repeated as many times as needed to match the overlapping
    `weights`.
    
    >>> c = np.array([[2, 0, 1],
    ...               [1, 0, 0],
    ...               [0, 0, 0]])
    >>> k = np.array([[0, 1, 0],
    ...               [0, 1, 0],
    ...               [0, 1, 0],
    ...               [0, 1, 0],
    ...               [0, 1, 0]])
    >>> ndimage.convolve(c, k, mode='nearest')
    array([[7, 0, 3],
           [5, 0, 2],
           [3, 0, 1]])

In [16]:
from scipy.ndimage import convolve
def test_ndimage(A, B, C, prefetching=False):
    if prefetching:
        for i_N in np.arange(A.shape[0]):
            for i_M in np.arange(B.shape[0]):
                C[i_N, i_M, :, :] = convolve(A[i_N, :, :], B[i_M, :, :], mode='constant', cval=0)
    else:
        for i_N in np.arange(A.shape[-1]):
            for i_M in np.arange(B.shape[-1]):
                C[:, :, i_N, i_M] = convolve(A[:, :, i_N], B[:, :, i_M], mode='constant', cval=0)                            
                
In [17]:
A, B, C = get_data(N, n_N, M, n_M)
In [18]:
%%timeit
test_ndimage(A, B, C)
6.67 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
In [20]:
%%timeit
test_ndimage(A, B, C, prefetching=True)
6.79 s ± 1.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

There is also a correlate function:

In [21]:
%%timeit
from scipy.ndimage import correlate
C = correlate(A, B[::-1, ::-1], mode='constant', cval=0) 
6.94 s ± 247 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

using skimage

The skimage has recently grown in popularity and includes many advanced computer vision algorithms. This operator is quick (using the Fourier transform), while allowing for many convenient option (mode, boundaries): https://github.com/scikit-image/scikit-image/blob/master/skimage/feature/template.py

Note: this function computes a normalized correlation so that you need to symmetrically invert the template to get the convolution.

In [22]:
from skimage.feature import match_template
def test_skimage(A, B, C, prefetching=False):
    if prefetching:
        for i_N in np.arange(A.shape[0]):
            for i_M in np.arange(B.shape[0]):
                C[i_N, i_M, :, :] = match_template(A[i_N, :, :], B[i_M, :, :], pad_input=True, mode='constant', constant_values=0.)
    else:
        for i_N in np.arange(A.shape[-1]):
            for i_M in np.arange(B.shape[-1]):
                C[:, :, i_N, i_M] = match_template(A[:, :, i_N], B[:, :, i_M], pad_input=True, mode='constant', constant_values=0.)                            
In [23]:
A, B, C = get_data(N, n_N, M, n_M)
In [24]:
%%timeit
test_skimage(A, B, C)
1.13 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [25]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)
In [26]:
%%timeit
test_skimage(A, B, C, prefetching=True)
995 ms ± 43.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

using numba

The numba package allows access to very fast routine: http://numba.pydata.org/numba-doc/0.15.1/examples.html#filterbank-correlation

In [27]:
from __future__ import print_function, division, absolute_import
import numpy as np
import numba
from numba.utils import IS_PY3
from numba.decorators import jit
from numba import double, jit

nd4type = numba.double[:,:,:,:]
nd3type = numba.double[:,:,:]

@jit(nopython=True)
def nb_get_data(N=N, n_N=n_N, M=M, n_M=n_M, seed=42, prefetching=False):
    np.random.seed(seed)
    if prefetching:
        A = np.random.rand(n_N, N, N)
        B = np.random.rand(n_M, M, M)
        C = np.zeros((n_N, n_M, N, N))
    else:
        A = np.random.rand(N, N, n_N)
        B = np.random.rand(M, M, n_M)
        C = np.zeros((N, N, n_N, n_M))
    return A, B, C

@jit((nd3type, nd3type, nd4type))
def nbcorr_prefetching(imgs, filters, output):
    n_imgs, n_rows, n_cols = imgs.shape
    n_filters, height, width = filters.shape

    for ii in range(n_imgs):
        for rr in range(n_rows - height + 1):
            for cc in range(n_cols - width + 1):
                for hh in range(height):
                    for ww in range(width):
                        for ff in range(n_filters):
                            imgval = imgs[ii, rr + hh, cc + ww]
                            filterval = filters[ff, hh, ww]
                            output[ii, ff, rr, cc] += imgval * filterval
                            
@jit((nd3type, nd3type, nd4type))                            
def nbcorr(imgs, filters, output):
    n_rows, n_cols, n_imgs = imgs.shape
    height, width, n_filters = filters.shape

    for ii in range(n_imgs):
        for rr in range(n_rows - height + 1):
            for cc in range(n_cols - width + 1):
                for hh in range(height):
                    for ww in range(width):
                        for ff in range(n_filters):
                            imgval = imgs[rr + hh, cc + ww, ii]
                            filterval = filters[hh, ww, ff]
                            output[rr, cc, ii, ff] += imgval * filterval
                            
def test_numba(A, B, C, prefetching=False):
    if prefetching:
        nbcorr_prefetching(A, B, C)
    else:
        nbcorr(A, B, C)        
In [28]:
A, B, C = nb_get_data(N, n_N, M, n_M)
In [29]:
%%timeit
test_numba(A, B, C)
4.25 s ± 78.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]:
A, B, C = nb_get_data(N, n_N, M, n_M, prefetching=True)
In [31]:
%%timeit
test_numba(A, B, C, prefetching=True)
8.55 s ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There may be an advantage when processing multiple images at once:

In [32]:
A, B, C = nb_get_data(N=256, n_N=10, M=32, n_M=10)
In [33]:
%%timeit
test_numba(A, B, C)
5.44 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]:
A, B, C = nb_get_data(N=256, n_N=10, M=32, n_M=10, prefetching=True)
In [35]:
%%timeit
test_numba(A, B, C, prefetching=True)
26.8 s ± 470 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Seems promising as many numpy operations are supported : http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html

using tensorflow

The tensorflow package is currently used in deep-learning architectures, which also often rely on convolutions:

https://www.tensorflow.org/api_docs/python/tf/nn/conv2d

In [36]:
import tensorflow as tf
def test_tf(img, kernel, strides=[1, 1, 1, 1], padding='SAME', prefetching=False):
    if prefetching: 
        with tf.Graph().as_default():
            img_shape = img.shape
            img_reshaped = img.reshape(1, img_shape[1], img_shape[2], img_shape[0])
            kernel_shape = kernel.shape
            kernel_reshaped = img.reshape(kernel_shape[1], kernel_shape[2], 1, kernel_shape[0])
            kernel_reshaped = np.repeat(kernel_reshaped, img_shape[0], axis=2)

            x = tf.placeholder('float32', [1, None, None, img_shape[0]])
            w = tf.get_variable('w', initializer=tf.to_float(kernel_reshaped))

            conv = tf.nn.conv2d(x, w, strides=strides, padding=padding)
            init = tf.global_variables_initializer()
            with tf.Session() as session:
                session.run(init)
                conv_op = session.run(conv, feed_dict={x: img_reshaped})
    else:

        with tf.Graph().as_default():
            img_shape = img.shape
            img_reshaped = img.reshape(1, img_shape[0], img_shape[1], img_shape[2])
            kernel_shape = kernel.shape
            kernel_reshaped = kernel[:, :, np.newaxis, :]
            kernel_reshaped = np.repeat(kernel_reshaped, img_shape[2], axis=2)

            x = tf.placeholder('float32', [1, None, None, img_shape[2]])
            w = tf.get_variable('w', initializer=tf.to_float(kernel_reshaped))

            conv = tf.nn.conv2d(x, w, strides=strides, padding=padding)
            init = tf.global_variables_initializer()
            with tf.Session() as session:
                session.run(init)
                conv_op = session.run(conv, feed_dict={x: img_reshaped})
In [37]:
A, B, C = get_data()
In [38]:
%%timeit
test_tf(A,B)
1.53 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Seems slower than scipy_fft, but it is quite unfair as it has quite a lot of boilerplate which we can run before getting the actual data ($A$ and $B$) - this needs more work. We will exclude it from the benchmark from now on.

wrapping things up

Now that we have an implementation for each library, it may be worth checking how they scale with respect to the different parameters:

number of images

This should scale linearly:

In [39]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tf = [], [], [], []
n_Ns = 2**np.arange(6)

for prefetching in [False, True]:
    for n_N in n_Ns:
        A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
        t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
        sk.append(t.timeit(number=reps))
        t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
        sp.append(t.timeit(number=reps))
        t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
        nb.append(t.timeit(number=reps))
        #t = Timer(lambda: test_tf(A, B, prefetching=prefetching))
        #tf.append(t.timeit(number=reps))
    
fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(n_Ns, sp[:len(n_Ns)], c='r', label='scipy')
ax.loglog(n_Ns, sk[:len(n_Ns)], c='b', label='skimage')
ax.loglog(n_Ns, nb[:len(n_Ns)], c='g', label='numba')
#ax.loglog(n_Ns, tf[:len(n_Ns)], c='g', label='tensorflow')
ax.loglog(n_Ns, sp[len(n_Ns):], '--', c='r')
ax.loglog(n_Ns, sk[len(n_Ns):], '--', c='b')
ax.loglog(n_Ns, nb[len(n_Ns):], '--', c='g')
#ax.loglog(n_Ns, tf[len(n_Ns):], '--', c='o')
ax.set_xlabel('n_N')
ax.set_ylabel('time (s)')
ax.legend();

size of images

This should scale supra linearly:

In [40]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tens = [], [], [], []
Ns = 2**np.arange(5, 10)
for prefetching in [False, True]:
    for N in Ns:
        A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
        t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
        sk.append(t.timeit(number=reps))
        t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
        sp.append(t.timeit(number=reps))
        t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
        nb.append(t.timeit(number=reps))

fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(Ns, sp[:len(Ns)], c='r', label='scipy')
ax.loglog(Ns, sk[:len(Ns)], c='b', label='skimage')
ax.loglog(Ns, nb[:len(Ns)], c='g', label='numba')
ax.loglog(Ns, sp[len(Ns):], '--', c='r')
ax.loglog(Ns, sk[len(Ns):], '--', c='b')
ax.loglog(Ns, nb[len(Ns):], '--', c='g')
ax.set_xlabel('N')
ax.set_ylabel('time (s)')
ax.legend();

number of kernels

This should scale supra linearly:

In [41]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tens = [], [], [], []
n_Ms = 2**np.arange(5)

for prefetching in [False, True]:
    for n_M in n_Ms:
        A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
        t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
        sk.append(t.timeit(number=reps))
        t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
        sp.append(t.timeit(number=reps))
        t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
        nb.append(t.timeit(number=reps))
    
fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(n_Ms, sp[:len(n_Ms)], c='r', label='scipy')
ax.loglog(n_Ms, sk[:len(n_Ms)], c='b', label='skimage')
ax.loglog(n_Ms, nb[:len(n_Ms)], c='g', label='numba')
ax.loglog(n_Ms, sp[len(n_Ms):], '--', c='r')
ax.loglog(n_Ms, sk[len(n_Ms):], '--', c='b')
ax.loglog(n_Ms, nb[len(n_Ms):], '--', c='g')
ax.set_xlabel('n_M')
ax.set_ylabel('time (s)')
ax.legend();

conclusion?

As a conclusion, there is not one single answer to all situations, the fastest method will depend on the task at hand. Still, the FFT solution with scipy seems the most rapid. Overall, there is also a slight advantage in using prefetching. In the end, knowing the relative advantage of each library is crucial in the final result.

%%bash cd .. nikola build ; nikola deploy cd posts