Convolutions are essential components of any neural networks, image processing, computer vision ... but these are also a bottleneck in terms of computations... I will here benchmark different solutions using numpy, scipy or tensorflow. This is work-in-progress, so that any suggestion is welcome (for instance on StackExchange!

Let's first initialize the notebook:

In [1]:
from __future__ import division, print_function
import numpy as np
np.set_printoptions(precision=6, suppress=True)
import os
%matplotlib inline
#%config InlineBackend.figure_format='retina'
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
phi = (np.sqrt(5)+1)/2
fig_width = 10
figsize = (fig_width, fig_width/phi)


## setting up the problem¶

The convolution operates on two 2-D matrices. We will here always consider the case which is most typical in computer vision:

• a first matrix $A$ is the input and is typically large ($N \times N$ where $N$ is typically larger than $2^{10}=1024$),
• a second matrix $B$ is the template and is typically smaller (say $M=128$),
• the result of the convolution $C = A \ast B$ is padded such that it is of the same size as $A$.

Often, you need to do that with many images on many kernels. In addition, we will test for the effect of prefetching.

Let's write a small function for this:

In [2]:
N, n_N, M, n_M = 512, 4, 64, 16
# DEBUG
N, n_N, M, n_M = 256, 10, 32, 8
def get_data(N=N, n_N=n_N, M=M, n_M=n_M, seed=42, prefetching=False):
np.random.seed(seed)
if prefetching:
A = np.random.rand(n_N, N, N)
B = np.random.rand(n_M, M, M)
C = np.zeros((n_N, n_M, N, N))
else:
A = np.random.rand(N, N, n_N)
B = np.random.rand(M, M, n_M)
C = np.zeros((N, N, n_N, n_M))
return A, B, C

for prefetching in [True, False]:
print ('with prefetching=', prefetching)
A, B, C = get_data(prefetching=prefetching)
print('Checking size of A =', A.shape, ' of B=', B.shape, ', and of C=', C.shape)

with prefetching= True
Checking size of A = (10, 256, 256)  of B= (8, 32, 32) , and of C= (10, 8, 256, 256)
with prefetching= False
Checking size of A = (256, 256, 10)  of B= (32, 32, 8) , and of C= (256, 256, 10, 8)


The call to this function will generate some initialization time, but makes further tests cleaner:

In [3]:
def test_get_data(N=N, n_N=n_N, M=M, n_M=n_M):
A, B, C = get_data(N, n_N, M, n_M)

In [4]:
%%timeit
test_get_data(N, n_N, M, n_M)

45.5 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## using scipy¶

The scipy library as different solutions:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html

In [5]:
from scipy.signal import convolve2d
def test_scipy(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = convolve2d(A[i_N, :, :], B[i_M, :, :], mode='same', boundary='fill', fillvalue=0)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = convolve2d(A[:, :, i_N], B[:, :, i_M], mode='same', boundary='fill', fillvalue=0)

In [6]:
A, B, C = get_data(N, n_N, M, n_M)

In [7]:
%%timeit
test_scipy(A, B, C)

49.6 s ± 1.8 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)

In [9]:
%%timeit
test_scipy(A, B, C, prefetching=True)

28.2 s ± 1.53 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]:
A, B, C = get_data(N, n_N, M, n_M)
from scipy.signal import fftconvolve
def test_scipy_fft(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = fftconvolve(A[i_N, :, :], B[i_M, :, :], mode='same')
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = fftconvolve(A[:, :, i_N], B[:, :, i_M], mode='same')

In [11]:
A, B, C = get_data(N, n_N, M, n_M)

In [12]:
%%timeit
test_scipy_fft(A, B, C)

928 ms ± 28.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)

In [14]:
%%timeit
test_scipy_fft(A, B, C, prefetching=True)

772 ms ± 56.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Fruther profiling, shows that most of the computing time is divided between the three FFT (2 forward, one inverse):

%lprun -f fftconvolve test_scipy_fft(A, B, C, prefetching=True)

This shows the advantage of using the Fourier transform to perform the convolution. There is also a slight advantage in using prefetching. However, this solution imposes to have periodic bounds for the borders.

## using ndimage¶

Within scipy, the ndimage library as different solutions:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.correlate.html

In [15]:
from scipy.ndimage import convolve
# help(convolve)

In [16]:
from scipy.ndimage import convolve
def test_ndimage(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = convolve(A[i_N, :, :], B[i_M, :, :], mode='constant', cval=0)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = convolve(A[:, :, i_N], B[:, :, i_M], mode='constant', cval=0)


In [17]:
A, B, C = get_data(N, n_N, M, n_M)

In [18]:
%%timeit
test_ndimage(A, B, C)

14.5 s ± 914 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)

In [20]:
%%timeit
test_ndimage(A, B, C, prefetching=True)

11.2 s ± 590 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


There is also a correlate function:

In [21]:
%%timeit
from scipy.ndimage import correlate
C = correlate(A, B[::-1, ::-1], mode='constant', cval=0)

13.3 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## using skimage¶

The skimage has recently grown in popularity and includes many advanced computer vision algorithms. This operator is quick (using the Fourier transform), while allowing for many convenient option (mode, boundaries): https://github.com/scikit-image/scikit-image/blob/master/skimage/feature/template.py

Note: this function computes a normalized correlation so that you need to symmetrically invert the template to get the convolution.

In [22]:
from skimage.feature import match_template
def test_skimage(A, B, C, prefetching=False):
if prefetching:
for i_N in np.arange(A.shape[0]):
for i_M in np.arange(B.shape[0]):
C[i_N, i_M, :, :] = match_template(A[i_N, :, :], B[i_M, :, :], pad_input=True, mode='constant', constant_values=0.)
else:
for i_N in np.arange(A.shape[-1]):
for i_M in np.arange(B.shape[-1]):
C[:, :, i_N, i_M] = match_template(A[:, :, i_N], B[:, :, i_M], pad_input=True, mode='constant', constant_values=0.)

In [23]:
A, B, C = get_data(N, n_N, M, n_M)

In [24]:
%%timeit
test_skimage(A, B, C)

1.62 s ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [25]:
A, B, C = get_data(N, n_N, M, n_M, prefetching=True)

In [26]:
%%timeit
test_skimage(A, B, C, prefetching=True)

1.52 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## using numba¶

The numba package allows access to very fast routine: http://numba.pydata.org/numba-doc/0.15.1/examples.html#filterbank-correlation

In [27]:
from __future__ import print_function, division, absolute_import
import numpy as np
import numba
from numba.utils import IS_PY3
from numba.decorators import jit
from numba import double, jit

nd4type = numba.double[:,:,:,:]
nd3type = numba.double[:,:,:]

@jit(nopython=True)
def nb_get_data(N=N, n_N=n_N, M=M, n_M=n_M, seed=42, prefetching=False):
np.random.seed(seed)
if prefetching:
A = np.random.rand(n_N, N, N)
B = np.random.rand(n_M, M, M)
C = np.zeros((n_N, n_M, N, N))
else:
A = np.random.rand(N, N, n_N)
B = np.random.rand(M, M, n_M)
C = np.zeros((N, N, n_N, n_M))
return A, B, C

@jit((nd3type, nd3type, nd4type))
def nbcorr_prefetching(imgs, filters, output):
n_imgs, n_rows, n_cols = imgs.shape
n_filters, height, width = filters.shape

for ii in range(n_imgs):
for rr in range(n_rows - height + 1):
for cc in range(n_cols - width + 1):
for hh in range(height):
for ww in range(width):
for ff in range(n_filters):
imgval = imgs[ii, rr + hh, cc + ww]
filterval = filters[ff, hh, ww]
output[ii, ff, rr, cc] += imgval * filterval

@jit((nd3type, nd3type, nd4type))
def nbcorr(imgs, filters, output):
n_rows, n_cols, n_imgs = imgs.shape
height, width, n_filters = filters.shape

for ii in range(n_imgs):
for rr in range(n_rows - height + 1):
for cc in range(n_cols - width + 1):
for hh in range(height):
for ww in range(width):
for ff in range(n_filters):
imgval = imgs[rr + hh, cc + ww, ii]
filterval = filters[hh, ww, ff]
output[rr, cc, ii, ff] += imgval * filterval

def test_numba(A, B, C, prefetching=False):
if prefetching:
nbcorr_prefetching(A, B, C)
else:
nbcorr(A, B, C)

In [28]:
A, B, C = nb_get_data(N, n_N, M, n_M)

In [29]:
%%timeit
test_numba(A, B, C)

8.64 s ± 386 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]:
A, B, C = nb_get_data(N, n_N, M, n_M, prefetching=True)

In [31]:
%%timeit
test_numba(A, B, C, prefetching=True)

13.5 s ± 1.7 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


There may be an advantage when processing multiple images at once:

In [32]:
A, B, C = nb_get_data(N=256, n_N=10, M=32, n_M=10)

In [33]:
%%timeit
test_numba(A, B, C)

7.87 s ± 504 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]:
A, B, C = nb_get_data(N=256, n_N=10, M=32, n_M=10, prefetching=True)

In [35]:
%%timeit
test_numba(A, B, C, prefetching=True)

26.8 s ± 345 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Seems promising as many numpy operations are supported : http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html

## using tensorflow¶

The tensorflow package is currently used in deep-learning architectures, which also often rely on convolutions:

https://www.tensorflow.org/api_docs/python/tf/nn/conv2d

In [36]:
import tensorflow as tf
def test_tf(img, kernel, strides=[1, 1, 1, 1], padding='SAME', prefetching=False):
if prefetching:
with tf.Graph().as_default():
img_shape = img.shape
img_reshaped = img.reshape(1, img_shape[1], img_shape[2], img_shape[0])
kernel_shape = kernel.shape
kernel_reshaped = img.reshape(kernel_shape[1], kernel_shape[2], 1, kernel_shape[0])
kernel_reshaped = np.repeat(kernel_reshaped, img_shape[0], axis=2)

x = tf.placeholder('float32', [1, None, None, img_shape[0]])
w = tf.get_variable('w', initializer=tf.to_float(kernel_reshaped))

init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
conv_op = session.run(conv, feed_dict={x: img_reshaped})
else:

with tf.Graph().as_default():
img_shape = img.shape
img_reshaped = img.reshape(1, img_shape[0], img_shape[1], img_shape[2])
kernel_shape = kernel.shape
kernel_reshaped = kernel[:, :, np.newaxis, :]
kernel_reshaped = np.repeat(kernel_reshaped, img_shape[2], axis=2)

x = tf.placeholder('float32', [1, None, None, img_shape[2]])
w = tf.get_variable('w', initializer=tf.to_float(kernel_reshaped))

init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
conv_op = session.run(conv, feed_dict={x: img_reshaped})

/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)

In [37]:
A, B, C = get_data()

In [38]:
%%timeit
test_tf(A,B)

1.71 s ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Seems slower than scipy_fft, but it is quite unfair as it has quite a lot of boilerplate which we can run before getting the actual data ($A$ and $B$) - this needs more work. We will exclude it from the benchmark from now on.

## wrapping things up¶

Now that we have an implementation for each library, it may be worth checking how they scale with respect to the different parameters:

### number of images¶

This should scale linearly:

In [39]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tf = [], [], [], []
n_Ns = 2**np.arange(6)

for prefetching in [False, True]:
for n_N in n_Ns:
A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))
#t = Timer(lambda: test_tf(A, B, prefetching=prefetching))
#tf.append(t.timeit(number=reps))

fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(n_Ns, sp[:len(n_Ns)], c='r', label='scipy')
ax.loglog(n_Ns, sk[:len(n_Ns)], c='b', label='skimage')
ax.loglog(n_Ns, nb[:len(n_Ns)], c='g', label='numba')
#ax.loglog(n_Ns, tf[:len(n_Ns)], c='g', label='tensorflow')
ax.loglog(n_Ns, sp[len(n_Ns):], '--', c='r')
ax.loglog(n_Ns, sk[len(n_Ns):], '--', c='b')
ax.loglog(n_Ns, nb[len(n_Ns):], '--', c='g')
#ax.loglog(n_Ns, tf[len(n_Ns):], '--', c='o')
ax.set_xlabel('n_N')
ax.set_ylabel('time (s)')
ax.legend();


### size of images¶

This should scale supra linearly:

In [40]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tens = [], [], [], []
Ns = 2**np.arange(5, 10)
for prefetching in [False, True]:
for N in Ns:
A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))

fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(Ns, sp[:len(Ns)], c='r', label='scipy')
ax.loglog(Ns, sk[:len(Ns)], c='b', label='skimage')
ax.loglog(Ns, nb[:len(Ns)], c='g', label='numba')
ax.loglog(Ns, sp[len(Ns):], '--', c='r')
ax.loglog(Ns, sk[len(Ns):], '--', c='b')
ax.loglog(Ns, nb[len(Ns):], '--', c='g')
ax.set_xlabel('N')
ax.set_ylabel('time (s)')
ax.legend();


### number of kernels¶

This should scale supra linearly:

In [41]:
N, n_N, M, n_M = 256, 4, 32, 16
from timeit import Timer
reps = 20
sp, sk, nb, tens = [], [], [], []
n_Ms = 2**np.arange(5)

for prefetching in [False, True]:
for n_M in n_Ms:
A, B, C = get_data(N, n_N, M, n_M, prefetching=prefetching)
t = Timer(lambda: test_skimage(A, B, C, prefetching=prefetching))
sk.append(t.timeit(number=reps))
t = Timer(lambda: test_scipy_fft(A, B, C, prefetching=prefetching))
sp.append(t.timeit(number=reps))
t = Timer(lambda: test_numba(A, B, C, prefetching=prefetching))
nb.append(t.timeit(number=reps))

fig , ax = plt.subplots(figsize=(8, 5))
ax.loglog(n_Ms, sp[:len(n_Ms)], c='r', label='scipy')
ax.loglog(n_Ms, sk[:len(n_Ms)], c='b', label='skimage')
ax.loglog(n_Ms, nb[:len(n_Ms)], c='g', label='numba')
ax.loglog(n_Ms, sp[len(n_Ms):], '--', c='r')
ax.loglog(n_Ms, sk[len(n_Ms):], '--', c='b')
ax.loglog(n_Ms, nb[len(n_Ms):], '--', c='g')
ax.set_xlabel('n_M')
ax.set_ylabel('time (s)')
ax.legend();


### conclusion?¶

As a conclusion, there is not one single answer to all situations, the fastest method will depend on the task at hand. Still, the FFT solution with scipy seems the most rapid. Overall, there is also a slight advantage in using prefetching. In the end, knowing the relative advantage of each library is crucial in the final result.

%%bash cd .. nikola build ; nikola deploy cd posts