Convolutions are essential components of any neural networks, image processing, computer vision ... but these are also a bottleneck in terms of computations... I will here benchmark different solutions using numpy, scipy or tensorflow. This is work-in-progress, so that any suggestion is welcome (for instance on StackExchange!

Let's first initialize the notebook:

In [1]:
from __future__ import division, print_function
import numpy as np
np.set_printoptions(precision=6, suppress=True)
import os
%matplotlib inline
#%config InlineBackend.figure_format='retina'
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
phi = (np.sqrt(5)+1)/2
fig_width = 10
figsize = (fig_width, fig_width/phi)
%load_ext autoreload
%autoreload 2


## setting up the problem¶

The convolution operates on two 2-D matrices. We will here always consider the case which is most typical in computer vision:

• a first matrix $A$ is the input and is typically large ($N \times N$ where $N$ is typically larger than $2^{10}=1024$),
• a second matrix $B$ is the template and is typically smaller (say $M=128$),
• the result of the convolution $C = A \ast B$ is padded such that it is of the same size as $A$.

Often, you need to do that with many images on many kernels. In addition, we will test for the effect of prefetching.

Let's write a small function for this:

In [2]:
N, n_N, M, n_M = 512, 4, 64, 16
# DEBUG
N, n_N, M, n_M = 256, 10, 32, 8
def get_data(N=N, n_N=n_N, M=M, n_M=n_M, seed=42, prefetching=False):
np.random.seed(seed)
if prefetching:
A = np.random.rand(n_N, N, N)
B = np.random.rand(n_M, M, M)
C = np.zeros((n_N, n_M, N, N))
else:
A = np.random.rand(N, N, n_N)
B = np.random.rand(M, M, n_M)
C = np.zeros((N, N, n_N, n_M))
return A, B, C

for prefetching in [True, False]:
print ('with prefetching=', prefetching)
A, B, C = get_data(prefetching=prefetching)
print('Checking size of A =', A.shape, ' of B=', B.shape, ', and of C=', C.shape)

with prefetching= True
Checking size of A = (10, 256, 256)  of B= (8, 32, 32) , and of C= (10, 8, 256, 256)
with prefetching= False
Checking size of A = (256, 256, 10)  of B= (32, 32, 8) , and of C= (256, 256, 10, 8)


The call to this function will generate some initialization time, but makes further tests cleaner:

In [3]:
def test_get_data(N=N, n_N=n_N, M=M, n_M=n_M):
A, B, C = get_data(N, n_N, M, n_M)

In [4]:
%%timeit
test_get_data(N, n_N, M, n_M)

18.5 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


## using scipy¶

The scipy library as different solutions:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html

In [5]:
from scipy.signal import convolve2d
def test_scipy(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = convolve2d(A[i_N, :, :], B[i_M, :, :], mode='same', boundary='fill', fillvalue=0)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = convolve2d(A[:, :, i_N], B[:, :, i_M], mode='same', boundary='fill', fillvalue=0)

In [6]:
A, B, C = get_data(N, n_N, M, n_M)

In [7]:
%%timeit
test_scipy(A, B, C)

17.9 s ± 775 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)

In [9]:
%%timeit
test_scipy(A, B, C, prefetching=True)

14.4 s ± 163 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]:
A, B, C = get_data(N, n_N, M, n_M)
from scipy.signal import fftconvolve
def test_scipy_fft(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = fftconvolve(A[i_N, :, :], B[i_M, :, :], mode='same')
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = fftconvolve(A[:, :, i_N], B[:, :, i_M], mode='same')

In [11]:
A, B, C = get_data(N, n_N, M, n_M)

In [12]:
%%timeit
test_scipy_fft(A, B, C)

576 ms ± 70 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)

In [14]:
%%timeit
test_scipy_fft(A, B, C, prefetching=True)

394 ms ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Fruther profiling, shows that most of the computing time is divided between the three FFT (2 forward, one inverse):

%lprun -f fftconvolve test_scipy_fft(A, B, C, prefetching=True)

This shows the advantage of using the Fourier transform to perform the convolution. There is also a slight advantage in using prefetching. However, this solution imposes to have periodic bounds for the borders.

## using ndimage¶

Within scipy, the ndimage library as different solutions:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.correlate.html

In [15]:
from scipy.ndimage import convolve
help(convolve)

Help on function convolve in module scipy.ndimage.filters:

convolve(input, weights, output=None, mode='reflect', cval=0.0, origin=0)
Multidimensional convolution.

The array is convolved with the given kernel.

Parameters
----------
input : array_like
Input array to filter.
weights : array_like
Array of weights, same number of dimensions as input
output : ndarray, optional
The output parameter passes an array in which to store the
filter output. Output array should have different name as
compared to input array to avoid aliasing errors.
mode : {'reflect','constant','nearest','mirror', 'wrap'}, optional
the mode parameter determines how the array borders are
handled. For 'constant' mode, values beyond borders are set to be
cval. Default is 'reflect'.
cval : scalar, optional
Value to fill past edges of input if mode is 'constant'. Default
is 0.0
origin : array_like, optional
The origin parameter controls the placement of the filter,
relative to the centre of the current element of the input.
Default of 0 is equivalent to (0,)*input.ndim.

Returns
-------
result : ndarray
The result of convolution of input with weights.

See Also
--------
correlate : Correlate an image with a kernel.

Notes
-----
Each value in result is :math:C_i = \sum_j{I_{i+k-j} W_j}, where
W is the weights kernel,
j is the n-D spatial index over :math:W,
I is the input and k is the coordinate of the center of
W, specified by origin in the input parameters.

Examples
--------
Perhaps the simplest case to understand is mode='constant', cval=0.0,
because in this case borders (i.e. where the weights kernel, centered
on any one value, extends beyond an edge of input.

>>> a = np.array([[1, 2, 0, 0],
...               [5, 3, 0, 4],
...               [0, 0, 0, 7],
...               [9, 3, 0, 0]])
>>> k = np.array([[1,1,1],[1,1,0],[1,0,0]])
>>> from scipy import ndimage
>>> ndimage.convolve(a, k, mode='constant', cval=0.0)
array([[11, 10,  7,  4],
[10,  3, 11, 11],
[15, 12, 14,  7],
[12,  3,  7,  0]])

Setting cval=1.0 is equivalent to padding the outer edge of input
with 1.0's (and then extracting only the original region of the result).

>>> ndimage.convolve(a, k, mode='constant', cval=1.0)
array([[13, 11,  8,  7],
[11,  3, 11, 14],
[16, 12, 14, 10],
[15,  6, 10,  5]])

With mode='reflect' (the default), outer values are reflected at the
edge of input to fill in missing values.

>>> b = np.array([[2, 0, 0],
...               [1, 0, 0],
...               [0, 0, 0]])
>>> k = np.array([[0,1,0], [0,1,0], [0,1,0]])
>>> ndimage.convolve(b, k, mode='reflect')
array([[5, 0, 0],
[3, 0, 0],
[1, 0, 0]])

This includes diagonally at the corners.

>>> k = np.array([[1,0,0],[0,1,0],[0,0,1]])
>>> ndimage.convolve(b, k)
array([[4, 2, 0],
[3, 2, 0],
[1, 1, 0]])

With mode='nearest', the single nearest value in to an edge in
input is repeated as many times as needed to match the overlapping
weights.

>>> c = np.array([[2, 0, 1],
...               [1, 0, 0],
...               [0, 0, 0]])
>>> k = np.array([[0, 1, 0],
...               [0, 1, 0],
...               [0, 1, 0],
...               [0, 1, 0],
...               [0, 1, 0]])
>>> ndimage.convolve(c, k, mode='nearest')
array([[7, 0, 3],
[5, 0, 2],
[3, 0, 1]])


In [16]:
from scipy.ndimage import convolve
def test_ndimage(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = convolve(A[i_N, :, :], B[i_M, :, :], mode='constant', cval=0)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = convolve(A[:, :, i_N], B[:, :, i_M], mode='constant', cval=0)


In [17]:
A, B, C = get_data(N, n_N, M, n_M)

In [18]:
%%timeit
test_ndimage(A, B, C)

6.67 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)

In [20]:
%%timeit
test_ndimage(A, B, C, prefetching=True)

6.79 s ± 1.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


There is also a correlate function:

In [21]:
%%timeit
from scipy.ndimage import correlate
C = correlate(A, B[::-1, ::-1], mode='constant', cval=0)

6.94 s ± 247 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## using skimage¶

The skimage has recently grown in popularity and includes many advanced computer vision algorithms. This operator is quick (using the Fourier transform), while allowing for many convenient option (mode, boundaries): https://github.com/scikit-image/scikit-image/blob/master/skimage/feature/template.py

Note: this function computes a normalized correlation so that you need to symmetrically invert the template to get the convolution.

In [22]:
from skimage.feature import match_template
def test_skimage(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = match_template(A[i_N, :, :], B[i_M, :, :], pad_input=True, mode='constant', constant_values=0.)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = match_template(A[:, :, i_N], B[:, :, i_M], pad_input=True, mode='constant', constant_values=0.)

In [23]:
A, B, C = get_data(N, n_N, M, n_M)

In [24]:
%%timeit
test_skimage(A, B, C)

1.13 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [25]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)

In [26]:
%%timeit
test_skimage(A, B, C, prefetching=True)

995 ms ± 43.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## using numba¶

The numba package allows access to very fast routine: http://numba.pydata.org/numba-doc/0.15.1/examples.html#filterbank-correlation

In [27]:
from __future__ import print_function, division, absolute_import
import numpy as np
import numba
from numba.utils import IS_PY3
from numba.decorators import jit
from numba import double, jit

nd4type = numba.double[:,:,:,:]
nd3type = numba.double[:,:,:]

@jit(nopython=True)
def nb_get_data(N=N, n_N=n_N, M=M, n_M=n_M, seed=42, prefetching=False):
np.random.seed(seed)
if prefetching:
A = np.random.rand(n_N, N, N)
B = np.random.rand(n_M, M, M)
C = np.zeros((n_N, n_M, N, N))
else:
A = np.random.rand(N, N, n_N)
B = np.random.rand(M, M, n_M)
C = np.zeros((N, N, n_N, n_M))
return A, B, C

@jit((nd3type, nd3type, nd4type))
def nbcorr_prefetching(imgs, filters, output):
n_imgs, n_rows, n_cols = imgs.shape
n_filters, height, width = filters.shape

for ii in range(n_imgs):
for rr in range(n_rows - height + 1):
for cc in range(n_cols - width + 1):
for hh in range(height):
for ww in range(width):
for ff in range(n_filters):
imgval = imgs[ii, rr + hh, cc + ww]
filterval = filters[ff, hh, ww]
output[ii, ff, rr, cc] += imgval * filterval

@jit((nd3type, nd3type, nd4type))
def nbcorr(imgs, filters, output):
n_rows, n_cols, n_imgs = imgs.shape
height, width, n_filters = filters.shape

for ii in range(n_imgs):
for rr in range(n_rows - height + 1):
for cc in range(n_cols - width + 1):
for hh in range(height):
for ww in range(width):
for ff in range(n_filters):
imgval = imgs[rr + hh, cc + ww, ii]
filterval = filters[hh, ww, ff]
output[rr, cc, ii, ff] += imgval * filterval

def test_numba(A, B, C, prefetching=False):
if prefetching:
nbcorr_prefetching(A, B, C)
else:
nbcorr(A, B, C)

In [28]:
A, B, C = nb_get_data(N, n_N, M, n_M)

In [29]:
%%timeit
test_numba(A, B, C)

4.25 s ± 78.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]:
A, B, C = nb_get_data(N, n_N, M, n_M, prefetching=True)

In [31]:
%%timeit
test_numba(A, B, C, prefetching=True)

8.55 s ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


There may be an advantage when processing multiple images at once:

In [32]:
A, B, C = nb_get_data(N=256, n_N=10, M=32, n_M=10)

In [33]:
%%timeit
test_numba(A, B, C)

5.44 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]:
A, B, C = nb_get_data(N=256, n_N=10, M=32, n_M=10, prefetching=True)

In [35]:
%%timeit
test_numba(A, B, C, prefetching=True)

26.8 s ± 470 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Seems promising as many numpy operations are supported : http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html

## using tensorflow¶

The tensorflow package is currently used in deep-learning architectures, which also often rely on convolutions:

https://www.tensorflow.org/api_docs/python/tf/nn/conv2d

In [36]:
import tensorflow as tf
def test_tf(img, kernel, strides=[1, 1, 1, 1], padding='SAME', prefetching=False):
if prefetching:
with tf.Graph().as_default():
img_shape = img.shape
img_reshaped = img.reshape(1, img_shape[1], img_shape[2], img_shape[0])
kernel_shape = kernel.shape
kernel_reshaped = img.reshape(kernel_shape[1], kernel_shape[2], 1, kernel_shape[0])
kernel_reshaped = np.repeat(kernel_reshaped, img_shape[0], axis=2)

x = tf.placeholder('float32', [1, None, None, img_shape[0]])
w = tf.get_variable('w', initializer=tf.to_float(kernel_reshaped))

conv = tf.nn.conv2d(x, w, strides=strides, padding=padding)
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
conv_op = session.run(conv, feed_dict={x: img_reshaped})
else:

with tf.Graph().as_default():
img_shape = img.shape
img_reshaped = img.reshape(1, img_shape[0], img_shape[1], img_shape[2])
kernel_shape = kernel.shape
kernel_reshaped = kernel[:, :, np.newaxis, :]
kernel_reshaped = np.repeat(kernel_reshaped, img_shape[2], axis=2)

x = tf.placeholder('float32', [1, None, None, img_shape[2]])
w = tf.get_variable('w', initializer=tf.to_float(kernel_reshaped))

conv = tf.nn.conv2d(x, w, strides=strides, padding=padding)
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
conv_op = session.run(conv, feed_dict={x: img_reshaped})

In [37]:
A, B, C = get_data()

In [38]:
%%timeit
test_tf(A,B)

1.53 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Seems slower than scipy_fft, but it is quite unfair as it has quite a lot of boilerplate which we can run before getting the actual data ($A$ and $B$) - this needs more work. We will exclude it from the benchmark from now on.

## wrapping things up¶

Now that we have an implementation for each library, it may be worth checking how they scale with respect to the different parameters:

### number of images¶

This should scale linearly:

In [39]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tf = [], [], [], []
n_Ns = 2**np.arange(6)

for prefetching in [False, True]:
for n_N in n_Ns:
A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))
#t = Timer(lambda: test_tf(A, B, prefetching=prefetching))
#tf.append(t.timeit(number=reps))

fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(n_Ns, sp[:len(n_Ns)], c='r', label='scipy')
ax.loglog(n_Ns, sk[:len(n_Ns)], c='b', label='skimage')
ax.loglog(n_Ns, nb[:len(n_Ns)], c='g', label='numba')
#ax.loglog(n_Ns, tf[:len(n_Ns)], c='g', label='tensorflow')
ax.loglog(n_Ns, sp[len(n_Ns):], '--', c='r')
ax.loglog(n_Ns, sk[len(n_Ns):], '--', c='b')
ax.loglog(n_Ns, nb[len(n_Ns):], '--', c='g')
#ax.loglog(n_Ns, tf[len(n_Ns):], '--', c='o')
ax.set_xlabel('n_N')
ax.set_ylabel('time (s)')
ax.legend();


### size of images¶

This should scale supra linearly:

In [40]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tens = [], [], [], []
Ns = 2**np.arange(5, 10)
for prefetching in [False, True]:
for N in Ns:
A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))

fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(Ns, sp[:len(Ns)], c='r', label='scipy')
ax.loglog(Ns, sk[:len(Ns)], c='b', label='skimage')
ax.loglog(Ns, nb[:len(Ns)], c='g', label='numba')
ax.loglog(Ns, sp[len(Ns):], '--', c='r')
ax.loglog(Ns, sk[len(Ns):], '--', c='b')
ax.loglog(Ns, nb[len(Ns):], '--', c='g')
ax.set_xlabel('N')
ax.set_ylabel('time (s)')
ax.legend();


### number of kernels¶

This should scale supra linearly:

In [41]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tens = [], [], [], []
n_Ms = 2**np.arange(5)

for prefetching in [False, True]:
for n_M in n_Ms:
A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))

fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(n_Ms, sp[:len(n_Ms)], c='r', label='scipy')
ax.loglog(n_Ms, sk[:len(n_Ms)], c='b', label='skimage')
ax.loglog(n_Ms, nb[:len(n_Ms)], c='g', label='numba')
ax.loglog(n_Ms, sp[len(n_Ms):], '--', c='r')
ax.loglog(n_Ms, sk[len(n_Ms):], '--', c='b')
ax.loglog(n_Ms, nb[len(n_Ms):], '--', c='g')
ax.set_xlabel('n_M')
ax.set_ylabel('time (s)')
ax.legend();


### conclusion?¶

As a conclusion, there is not one single answer to all situations, the fastest method will depend on the task at hand. Still, the FFT solution with scipy seems the most rapid. Overall, there is also a slight advantage in using prefetching. In the end, knowing the relative advantage of each library is crucial in the final result.

%%bash cd .. nikola build ; nikola deploy cd posts