Performance: Matlab vs Python

Performance: Matlab vs Python - python

I recently switched from Matlab to Python. While converting one of my lengthy codes, I was surprised to find Python being very slow. I profiled and traced the problem with one function hogging up time. This function is being called from various places in my code (being part of other functions which are recursively called). Profiler suggests that 300 calls are made to this function in both Matlab and Python.
In short, following codes summarizes the issue at hand:
MATLAB
The class containing the function:
classdef ExampleKernel1 < handle
methods (Static)
function [kernel] = kernel_2D(M,x,N,y)
kernel = zeros(M,N);
for i= 1 : M
for j= 1 : N
% Define the custom kernel function here
kernel(i , j) = sqrt((x(i , 1) - y(j , 1)) .^ 2 + ...
(x(i , 2) - y(j , 2)) .^2 );
end
end
end
end
end
and the script to call test.m:
xVec=[
49.7030 78.9590
42.6730 11.1390
23.2790 89.6720
75.6050 25.5890
81.5820 53.2920
44.9680 2.7770
38.7890 78.9050
39.1570 33.6790
33.2640 54.7200
4.8060 44.3660
49.7030 78.9590
42.6730 11.1390
23.2790 89.6720
75.6050 25.5890
81.5820 53.2920
44.9680 2.7770
38.7890 78.9050
39.1570 33.6790
33.2640 54.7200
4.8060 44.3660
];
N=size(xVec,1);
kex1=ExampleKernel1;
tic
for i=1:300
K=kex1.kernel_2D(N,xVec,N,xVec);
end
toc
Gives the output
clear all
>> test
Elapsed time is 0.022426 seconds.
>> test
Elapsed time is 0.009852 seconds.
PYTHON 3.4
Class containing the function CustomKernels.py:
from numpy import zeros
from math import sqrt
class CustomKernels:
"""Class for defining the custom kernel functions"""
#staticmethod
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
kernel = zeros([M, N])
for i in range(0, M):
for j in range(0, N):
# Define the custom kernel function here
kernel[i, j] = sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
and the script to call test.py:
import numpy as np
from CustomKernels import CustomKernels
from time import perf_counter
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
N = xVec.shape[0]
kex1 = CustomKernels.exampleKernelA
start=perf_counter()
for i in range(0,300):
K = kex1(N, xVec, N, xVec)
print(' %f secs' %(perf_counter()-start))
Gives the output
%run test.py
0.940515 secs
%run test.py
0.884418 secs
%run test.py
0.940239 secs
RESULTS
Comparing the results it seems Matlab is about 42 times faster after a "clear all" is called and then 100 times faster if script is run multiple times without calling "clear all". That is at least and order of magnitude if not two orders of magnitudes faster. This is a very surprising result for me. I was expecting the result to be the other way around.
Can someone please shed some light on this?
Can someone suggest a faster way to perform this?
SIDE NOTE
I have also tried to use numpy.sqrt which makes the performance worse, therefore I am using math.sqrt in Python.
EDIT
The for loops for calling the functions are purely fictitious. They are there just to "simulate" 300 calls to the function. As I described earlier, the kernel functions (kernel_2D in Matlab and kex1 in Python) are called from various different places in the program. To make the problem shorter, I "simulate" the 300 calls using the for loop. The for loops inside the kernel functions are essential and unavoidable because of the structure of the kernel matrix.
EDIT 2
Here is the larger problem: https://github.com/drfahdsiddiqui/bbfmm2d-python

You want to get rid of those for loops. Try this:
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
i, j = np.indices((N, M))
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
You can also do it with broadcasting, which may be even faster, but a little less intuitive coming from MATLAB.

Upon further investigation I have found that using indices as indicated in the answer is still slower.
Solution: Use meshgrid
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = meshgrid(y[:, 0], x[:, 0])
x1, y1 = meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
Result: Very very fast, 10 times faster than indices approach. I am getting times which are closer to C.
However: Using meshgrid with Matlab beats C and Numpy by being 10 times faster than both.
Still wondering why!

Matlab uses commerical MKL library. If you use free python distribution, check whether you have MKL or other high performance blas library used in python or it is the default ones, which could be much slower.

Comparing Jit-Compilers
It has been mentioned that Matlab uses an internal Jit-compiler to get good performance on such tasks. Let's compare Matlabs jit-compiler with a Python jit-compiler (Numba).
Code
import numba as nb
import numpy as np
import math
import time
#If the arrays are somewhat larger it makes also sense to parallelize this problem
#cache ==True may also make sense
#nb.njit(fastmath=True)
def exampleKernelA(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
def exampleKernelB(M, x, N, y):
"""Example kernel function A"""
# Euclidean norm function implemented using meshgrid idea.
# Fastest
x0, y0 = np.meshgrid(y[:, 0], x[:, 0])
x1, y1 = np.meshgrid(y[:, 1], x[:, 1])
# Define custom kernel here
kernel = np.sqrt((x0 - y0) ** 2 + (x1 - y1) ** 2)
return kernel
#nb.njit()
def exampleKernelC(M, x, N, y):
"""Example kernel function A"""
#explicitly declaring the size of the second dim also improves performance a bit
assert x.shape[1]==2
assert y.shape[1]==2
#Works with all dtypes, zeroing isn't necessary
kernel = np.empty((M,N),dtype=x.dtype)
for i in range(M):
for j in range(N):
# Define the custom kernel function here
kernel[i, j] = np.sqrt((x[i, 0] - y[j, 0]) ** 2 + (x[i, 1] - y[j, 1]) ** 2)
return kernel
#Your test data
xVec = np.array([
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660],
[49.7030, 78.9590],
[42.6730, 11.1390],
[23.2790, 89.6720],
[75.6050, 25.5890],
[81.5820, 53.2920],
[44.9680, 2.7770],
[38.7890, 78.9050],
[39.1570, 33.6790],
[33.2640, 54.7200],
[4.8060 , 44.3660]
])
#compilation on first callable
#can be avoided with cache=True
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
t1=time.time()
for i in range(10_000):
res=exampleKernelA(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelC(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
t1=time.time()
for i in range(10_000):
res=exampleKernelB(xVec.shape[0], xVec, xVec.shape[0], xVec)
print(time.time()-t1)
Performance
exampleKernelA: 0.03s
exampleKernelC: 0.03s
exampleKernelB: 1.02s
Matlab_2016b (your code, but 10000 rep., after few runs): 0.165s

I got ~5x speed improvement over the meshgrid solution using only broadcasting:
def exampleKernelD(M, x, N, y):
return np.sqrt((x[:,1:] - y[:,1:].T) ** 2 + (x[:,:1] - y[:,:1].T) ** 2)

Related

Numba parallel code slower than its sequential counterpart

I'm new to Numba and I'm trying to implement an old Fortran code in Python using Numba (version 0.54.1), but when I add parallel = True the program actually slows down. My program is very simple: I change the positions x and y in a L x L grid and for each position in the grid I perform a summation
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J = np.array([[1.0, -k*np.cos(x)], [1.0, 1.0 - k*np.cos(x)]])
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
# Compile
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1, 10)
# Parameters
N = int(1e3)
L = 128
pi = np.pi
k = 1.5
# Limits of the phase space
x0 = -pi
xf = pi
y0 = -pi
yf = pi
# Grid positions
x = np.linspace(x0, xf, L, endpoint=True)
y = np.linspace(y0, yf, L, endpoint=True)
lypnv = lyapunov_grid(x, y, k, N)
With parallel=False it takes about 8s to run, however with parallel=True it takes about 14s. I also tested with another code from https://github.com/animator/mandelbrot-numba and in this case the parallelization works.
import math
import numpy as np
import numba as nb
WIDTH = 1000
MAX_ITER = 1000
#nb.njit(parallel=True)
def mandelbrot(width, max_iter):
pixels = np.zeros((width, width, 3), dtype=np.uint8)
for y in nb.prange(width):
for x in range(width):
c0 = complex(3.0*x/width - 2, 3.0*y/width - 1.5)
c = 0
for i in range(1, max_iter):
if abs(c) > 2:
log_iter = math.log(i)
pixels[y, x, :] = np.array([int(255*(1+math.cos(3.32*log_iter))/2),
int(255*(1+math.cos(0.774*log_iter))/2),
int(255*(1+math.cos(0.412*log_iter))/2)],
dtype=np.uint8)
break
c = c * c + c0
return pixels
# compile
_ = mandelbrot(WIDTH, 10)
calcpixels = mandelbrot(WIDTH, MAX_ITER)

One main issue is that the second function call compile the function again. Indeed, the types of the provided arguments change: in the first call the third argument is an integer (int transformed to a np.int_) while in the second call the third argument (k) is a floating point number (float transformed to a np.float64). Numba recompiles the function for different parameter types because they are deduced from the type of the arguments and it does not know you want to use a np.float64 type for the third argument (since the first time the function is compiled with for a np.int_ type). One simple solution to fix the problem is to change the first call to:
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1.0, 10)
However, this is not a robust way to fix the problem. You can specify the parameter types to Numba so it will compile the function at declaration time. This also remove the need to artificially call the function (with useless parameters).
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
Note that (J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)) is zero the first time resulting in a division by 0.
Another main issue comes from the allocations of many small arrays in the loop causing a contention of the standard allocator (see this post for more information). While Numba could theoretically optimize it (ie. replace the array with local variables), it actually does not, resulting in a huge slowdown and a contention. Hopefully, in your case, you do not need to actually create the array. At last, you can create it only in the encompassing loop and modify it in the innermost loop. Here is the optimized code:
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
J = np.ones((2, 2), dtype=np.float64)
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J[0, 1] = -k*np.cos(x)
J[1, 1] = 1.0 - k*np.cos(x)
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
Here is the results on a old 2-core machine (with 4 hardware threads):
Original sequential: 15.9 s
Original parallel: 11.9 s
Fix-build sequential: 15.7 s
Fix-build parallel: 10.1 s
Optimized sequential: 2.73 s
Optimized parallel: 0.94 s
The optimized implementation is much faster than the others. The parallel optimized version scale very well compared than the original one (2.9 times faster than the sequential one). Finally, the best version is about 12 times faster than the original parallel version. I expect a much faster computation on a recent machine with many more cores.

Machine Learning - implementing a Gradient Descent in Python from Octave code

I am trying to implement a gradient descent function in Python from scratch which I have implemented and work in GNU Octave. Unfortunately I am stuck. I fiddled with it for a while and checked the NumPy documentation but so far no luck.
I am aware of libraries such as scikit-learn, however my purpose is to learn to code such a function from scratch. Perhaps I am going about it the wrong way.
Below you will find all the code necessary to reproduce the error.
Thanks in advance for your help.
Actual result: test fails with error -> "ValueError: matmul: Input operand 0 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)"
Expected result: and array with values [5.2148, -0.5733]
Function gradientDescent() in Octave:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
theta = theta - (alpha/m)*X'*(X*theta-y);
J_history(iter) = computeCost(X, y, theta);
end
Function gradient_descent() in python:
from numpy import zeros
def compute_cost(X, y, theta):
m = len(y)
ans = (X.T # theta).T - y
J = (ans # ans.T) / (2 * m)
return J[0, 0]
def gradient_descent(X, y, theta, alpha, num_iters):
m = len(y)
J_history = zeros((num_iters, 1), dtype=int)
for iter in range(num_iters):
theta = theta - (alpha / m) # X.T # (X # theta - y)
J_history[iter] = compute_cost(X, y, theta)
return theta
the test file: test_ml_utils.py
import unittest
import numpy as np
from ml.ml_utils import compute_cost, gradient_descent
class TestGradientDescent(unittest.TestCase):
# TODO: implement tests for Gradient Descent function
# [theta J_hist] = gradientDescent([1 5; 1 2; 1 4; 1 5],[1 6 4 2]',[0 0]',0.01,1000);
def test_gradient_descent_00(self):
X = np.array([[1, 5], [1, 2], [1, 4], [1, 5]])
y = np.array([1, 6, 4, 2])
theta = np.zeros(2)
alpha = 0.01
num_iter = 1000
r_theta = np.array([5.2148, -0.5733])
result = gradient_descent(X, y, theta, alpha, num_iter)
self.assertEqual((round(result, 4), r_theta), 'Result is wrong!')
if __name__ == '__main__':
unittest.main()

The __matmul__ operator # in Python binds more tightly than -. That means you're trying to do matrix multiplication with the operands (alpha / m), which is a scalar, and X.T, which is actually a matrix. See operator precedence.
In the Octave code, (alpha - m) * X' is doing scalar multiplication, not matrix, so if you want that same behavior in Python, use * rather than #. This seems to be because Octave overloads the * operator to perform scalar multiplication if one operand is a scalar, but matrix multiplication if both operands are matrices.

Adding to Adam's answer (which is correct in regards to the error you're getting).
However I wanted to add more generally, this code is meaningless to a reader without some sort of hint (whether programmatically or in the form of a comment) what dimensions the different variables take.
E.g., there is a hint in the code that y is likely to be 2-dimensional, and you're using len to get its size. Just as an example of how this might fail silently, consider this:
>>> y = numpy.array([[1,2,3,4,5]])
>>> len( y )
1
whereas presumably you want
>>> numpy.shape( y )
(1, 5)
or
>>> numpy.size( y )
5
I note in your unit test that you're passing a rank 1 vector instead of rank 2, so it turns out y is 1D instead of 2D, but operates with X which is 2D due to broadcasting. Your code therefore works despite the logic implied, but in the absence of making such things explicit, this is a runtime error waiting to happen.

Imitate ode45 function from MATLAB in Python

I am wondering how to export MATLAB function ode45 to python. According to the documentation is should be as follows:
MATLAB: [t,y]=ode45(#vdp1,[0 20],[2 0]);
Python: import numpy as np
def vdp1(t,y):
dydt= np.array([y[1], (1-y[0]**2)*y[1]-y[0]])
return dydt
import scipy integrate
l=scipy.integrate.ode(vdp1([0,20],[2,0])).set_integrator("dopri5")
The results are completely different, Matlab returns different dimensions than Python.

As #LutzL mentioned, you can use the newer API, solve_ivp.
results = solve_ivp(obj_func, t_span, y0, t_eval = time_series)
If t_eval is not specified, then you won't have one record per one timestamp, which is mostly the cases I assume.
Another side note is that for odeint and often other integrators, the output array is a ndarray of a shape of [len(time), len(states)], however for solve_ivp, the output is a list(length of state vector) of 1-dimension ndarray(which length is equal to t_eval).
So you have to merge it if you want the same order. You can do so by:
Y =results
merged = np.hstack([i.reshape(-1,1) for i in Y.y])
First you need to reshape to make it a [n,1] array, and merge it horizontally.
Hope this helps!

The interface of integrate.ode is not as intuitive as of a simpler method odeint which, however, does not support choosing an ODE integrator. The main difference is that ode does not run a loop for you; if you need a solution at a bunch of points, you have to say at what points, and compute it one point at a time.
import numpy as np
from scipy import integrate
import matplotlib.pyplot as plt
def vdp1(t, y):
return np.array([y[1], (1 - y[0]**2)*y[1] - y[0]])
t0, t1 = 0, 20 # start and end
t = np.linspace(t0, t1, 100) # the points of evaluation of solution
y0 = [2, 0] # initial value
y = np.zeros((len(t), len(y0))) # array for solution
y[0, :] = y0
r = integrate.ode(vdp1).set_integrator("dopri5") # choice of method
r.set_initial_value(y0, t0) # initial values
for i in range(1, t.size):
y[i, :] = r.integrate(t[i]) # get one more value, add it to the array
if not r.successful():
raise RuntimeError("Could not integrate")
plt.plot(t, y)
plt.show()

The function scipy.integrate.solve_ivp uses the method RK45 by default, similar the method used by Matlab's function ODE45 as both use the Dormand-Pierce formulas with fourth-order method accuracy.
vdp1 = #(T,Y) [Y(2); (1 - Y(1)^2) * Y(2) - Y(1)];
[T,Y] = ode45 (vdp1, [0, 20], [2, 0]);
from scipy.integrate import solve_ivp
vdp1 = lambda T,Y: [Y[1], (1 - Y[0]**2) * Y[1] - Y[0]]
sol = solve_ivp (vdp1, [0, 20], [2, 0])
T = sol.t
Y = sol.y
Ordinary Differential Equations (solve_ivp)

Optimizing Python function with Parakeet

I need this function to be optimized as I am trying to make my OpenGL simulation run faster. I want to use Parakeet, but I can't quite understand in what way I would need to modify the code below in order to do so. Can you see what I should do?
def distanceMatrix(self,x,y,z):
" ""Computes distances between all particles and places the result in a matrix such that the ij th matrix entry corresponds to the distance between particle i and j"" "
xtemp = tile(x,(self.N,1))
dx = xtemp - xtemp.T
ytemp = tile(y,(self.N,1))
dy = ytemp - ytemp.T
ztemp = tile(z,(self.N,1))
dz = ztemp - ztemp.T
# Particles 'feel' each other across the periodic boundaries
if self.periodicX:
dx[dx>self.L/2]=dx[dx > self.L/2]-self.L
dx[dx<-self.L/2]=dx[dx < -self.L/2]+self.L
if self.periodicY:
dy[dy>self.L/2]=dy[dy>self.L/2]-self.L
dy[dy<-self.L/2]=dy[dy<-self.L/2]+self.L
if self.periodicZ:
dz[dz>self.L/2]=dz[dz>self.L/2]-self.L
dz[dz<-self.L/2]=dz[dz<-self.L/2]+self.L
# Total Distances
d = sqrt(dx**2+dy**2+dz**2)
# Mark zero entries with negative 1 to avoid divergences
d[d==0] = -1
return d, dx, dy, dz
From what I can tell, Parakeet should be able to use the above function without modifications - it only uses Numpy and math. But, I always get the following error when calling the function from the Parakeet jit wrapper:
AssertionError: Unsupported function: <bound method Particles.distanceMatrix of <particles.Particles instance at 0x04CD8E90>>

Parakeet is still young, its NumPy support is incomplete, and your code touches on several features that don't yet work.
1) You're wrapping a method, while Parakeet so far only knows how to deal with functions. The common workaround is to make a #jit wrapped helper function and have your method call into that with all of the required member data. The reason that methods don't work is that it's non-trivial to assign a meaningful type to 'self'. It's not impossible, but tricky enough that methods won't make their way into Parakeet until lower hanging fruit are plucked. Speaking of low-hanging fruit...
2) Boolean indexing. Not yet implemented but will be in the next release.
3) np.tile: Also doesn't work, will also probably be in the next release. If you want to see which builtins and NumPy library functions will work, take a look at Parakeet's mappings module.
I rewrote your code to be a little friendlier to Parakeet:
#jit
def parakeet_dist(x, y, z, L, periodicX, periodicY, periodicZ):
# perform all-pairs computations more explicitly
# instead of tile + broadcasting
def periodic_diff(x1, x2, periodic):
diff = x1 - x2
if periodic:
if diff > (L / 2): diff -= L
if diff < (-L/2): diff += L
return diff
dx = np.array([[periodic_diff(x1, x2, periodicX) for x1 in x] for x2 in x])
dy = np.array([[periodic_diff(y1, y2, periodicY) for y1 in y] for y2 in y])
dz = np.array([[periodic_diff(z1, z2, periodicZ) for z1 in z] for z2 in z])
d= np.sqrt(dx**2 + dy**2 + dz**2)
# since we can't yet use boolean indexing for masking out zero distances
# have to fall back on explicit loops instead
for i in xrange(len(x)):
for j in xrange(len(x)):
if d[i,j] == 0: d[i,j] = -1
return d, dx, dy, dz
On my machine this runs only ~3x faster than NumPy for N = 2000 (0.39s for NumPy vs. 0.14s for Parakeet). If I rewrite the array traversals to use loops more explicitly then the performance goes up to ~6x faster than NumPy (Parakeet runs in ~0.06s):
#jit
def loopy_dist(x, y, z, L, periodicX, periodicY, periodicZ):
N = len(x)
dx = np.zeros((N,N))
dy = np.zeros( (N,N) )
dz = np.zeros( (N,N) )
d = np.zeros( (N,N) )
def periodic_diff(x1, x2, periodic):
diff = x1 - x2
if periodic:
if diff > (L / 2): diff -= L
if diff < (-L/2): diff += L
return diff
for i in xrange(N):
for j in xrange(N):
dx[i,j] = periodic_diff(x[j], x[i], periodicX)
dy[i,j] = periodic_diff(y[j], y[i], periodicY)
dz[i,j] = periodic_diff(z[j], z[i], periodicZ)
d[i,j] = dx[i,j] ** 2 + dy[i,j] ** 2 + dz[i,j] ** 2
if d[i,j] == 0: d[i,j] = -1
else: d[i,j] = np.sqrt(d[i,j])
return d, dx, dy, dz
With a little creative rewriting you can also get the above code running in Numba, but it only goes ~1.5x faster than NumPy (0.25 seconds). The compile times were Parakeet w/ comprehensions: 1 second, Parakeet w/ loops: .5 seconds, Numba w/ loops: 0.9 seconds.
Hopefully the next few releases will enable more idiomatic use of NumPy library functions, but for now comprehensions or loops are often the way to go.

Artefacts from Riemann sum in scipy.signal.convolve

Short summary: How do I quickly calculate the finite convolution of two arrays?
Problem description
I am trying to obtain the finite convolution of two functions f(x), g(x) defined by
To achieve this, I have taken discrete samples of the functions and turned them into arrays of length steps:
xarray = [x * i / steps for i in range(steps)]
farray = [f(x) for x in xarray]
garray = [g(x) for x in xarray]
I then tried to calculate the convolution using the scipy.signal.convolve function. This function gives the same results as the algorithm conv suggested here. However, the results differ considerably from analytical solutions. Modifying the algorithm conv to use the trapezoidal rule gives the desired results.
To illustrate this, I let
f(x) = exp(-x)
g(x) = 2 * exp(-2 * x)
the results are:
Here Riemann represents a simple Riemann sum, trapezoidal is a modified version of the Riemann algorithm to use the trapezoidal rule, scipy.signal.convolve is the scipy function and analytical is the analytical convolution.
Now let g(x) = x^2 * exp(-x) and the results become:
Here 'ratio' is the ratio of the values obtained from scipy to the analytical values. The above demonstrates that the problem cannot be solved by renormalising the integral.
The question
Is it possible to use the speed of scipy but retain the better results of a trapezoidal rule or do I have to write a C extension to achieve the desired results?
An example
Just copy and paste the code below to see the problem I am encountering. The two results can be brought to closer agreement by increasing the steps variable. I believe that the problem is due to artefacts from right hand Riemann sums because the integral is overestimated when it is increasing and approaches the analytical solution again as it is decreasing.
EDIT: I have now included the original algorithm 2 as a comparison which gives the same results as the scipy.signal.convolve function.
import numpy as np
import scipy.signal as signal
import matplotlib.pyplot as plt
import math
def convolveoriginal(x, y):
'''
The original algorithm from http://www.physics.rutgers.edu/~masud/computing/WPark_recipes_in_python.html.
'''
P, Q, N = len(x), len(y), len(x) + len(y) - 1
z = []
for k in range(N):
t, lower, upper = 0, max(0, k - (Q - 1)), min(P - 1, k)
for i in range(lower, upper + 1):
t = t + x[i] * y[k - i]
z.append(t)
return np.array(z) #Modified to include conversion to numpy array
def convolve(y1, y2, dx = None):
'''
Compute the finite convolution of two signals of equal length.
#param y1: First signal.
#param y2: Second signal.
#param dx: [optional] Integration step width.
#note: Based on the algorithm at http://www.physics.rutgers.edu/~masud/computing/WPark_recipes_in_python.html.
'''
P = len(y1) #Determine the length of the signal
z = [] #Create a list of convolution values
for k in range(P):
t = 0
lower = max(0, k - (P - 1))
upper = min(P - 1, k)
for i in range(lower, upper):
t += (y1[i] * y2[k - i] + y1[i + 1] * y2[k - (i + 1)]) / 2
z.append(t)
z = np.array(z) #Convert to a numpy array
if dx != None: #Is a step width specified?
z *= dx
return z
steps = 50 #Number of integration steps
maxtime = 5 #Maximum time
dt = float(maxtime) / steps #Obtain the width of a time step
time = [dt * i for i in range (steps)] #Create an array of times
exp1 = [math.exp(-t) for t in time] #Create an array of function values
exp2 = [2 * math.exp(-2 * t) for t in time]
#Calculate the analytical expression
analytical = [2 * math.exp(-2 * t) * (-1 + math.exp(t)) for t in time]
#Calculate the trapezoidal convolution
trapezoidal = convolve(exp1, exp2, dt)
#Calculate the scipy convolution
sci = signal.convolve(exp1, exp2, mode = 'full')
#Slice the first half to obtain the causal convolution and multiply by dt
#to account for the step width
sci = sci[0:steps] * dt
#Calculate the convolution using the original Riemann sum algorithm
riemann = convolveoriginal(exp1, exp2)
riemann = riemann[0:steps] * dt
#Plot
plt.plot(time, analytical, label = 'analytical')
plt.plot(time, trapezoidal, 'o', label = 'trapezoidal')
plt.plot(time, riemann, 'o', label = 'Riemann')
plt.plot(time, sci, '.', label = 'scipy.signal.convolve')
plt.legend()
plt.show()
Thank you for your time!

or, for those who prefer numpy to C. It will be slower than the C implementation, but it's just a few lines.
>>> t = np.linspace(0, maxtime-dt, 50)
>>> fx = np.exp(-np.array(t))
>>> gx = 2*np.exp(-2*np.array(t))
>>> analytical = 2 * np.exp(-2 * t) * (-1 + np.exp(t))
this looks like trapezoidal in this case (but I didn't check the math)
>>> s2a = signal.convolve(fx[1:], gx, 'full')*dt
>>> s2b = signal.convolve(fx, gx[1:], 'full')*dt
>>> s = (s2a+s2b)/2
>>> s[:10]
array([ 0.17235682, 0.29706872, 0.38433313, 0.44235042, 0.47770012,
0.49564748, 0.50039326, 0.49527721, 0.48294359, 0.46547582])
>>> analytical[:10]
array([ 0. , 0.17221333, 0.29682141, 0.38401317, 0.44198216,
0.47730244, 0.49523485, 0.49997668, 0.49486489, 0.48254154])
largest absolute error:
>>> np.max(np.abs(s[:len(analytical)-1] - analytical[1:]))
0.00041657780840698155
>>> np.argmax(np.abs(s[:len(analytical)-1] - analytical[1:]))
6

Short answer: Write it in C!
Long answer
Using the cookbook about numpy arrays I rewrote the trapezoidal convolution method in C. In order to use the C code one requires three files (https://gist.github.com/1626919)
The C code (performancemodule.c).
The setup file to build the code and make it callable from python (performancemodulesetup.py).
The python file that makes use of the C extension (performancetest.py)
The code should run upon downloading by doing the following
Adjust the include path in performancemodule.c.
Run the following
python performancemodulesetup.py build
python performancetest.py
You may have to copy the library file performancemodule.so or performancemodule.dll into the same directory as performancetest.py.
Results and performance
The results agree neatly with one another as shown below:
The performance of the C method is even better than scipy's convolve method. Running 10k convolutions with array length 50 requires
convolve (seconds, microseconds) 81 349969
scipy.signal.convolve (seconds, microseconds) 1 962599
convolve in C (seconds, microseconds) 0 87024
Thus, the C implementation is about 1000 times faster than the python implementation and a bit more than 20 times as fast as the scipy implementation (admittedly, the scipy implementation is more versatile).
EDIT: This does not solve the original question exactly but is sufficient for my purposes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performance: Matlab vs Python - python

Matlab uses commerical MKL library. If you use free python distribution, check whether you have MKL or other high performance blas library used in python or it is the default ones, which could be much slower.

I got ~5x speed improvement over the meshgrid solution using only broadcasting: def exampleKernelD(M, x, N, y): return np.sqrt((x[:,1:] - y[:,1:].T) 2 + (x[:,:1] - y[:,:1].T) 2)

Related

Numba parallel code slower than its sequential counterpart

Machine Learning - implementing a Gradient Descent in Python from Octave code

Imitate ode45 function from MATLAB in Python

Optimizing Python function with Parakeet

Artefacts from Riemann sum in scipy.signal.convolve

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performance: Matlab vs Python - python

Matlab uses commerical MKL library. If you use free python distribution, check whether you have MKL or other high performance blas library used in python or it is the default ones, which could be much slower.

I got ~5x speed improvement over the meshgrid solution using only broadcasting: def exampleKernelD(M, x, N, y): return np.sqrt((x[:,1:] - y[:,1:].T) ** 2 + (x[:,:1] - y[:,:1].T) ** 2)

Related

Numba parallel code slower than its sequential counterpart

Machine Learning - implementing a Gradient Descent in Python from Octave code

Imitate ode45 function from MATLAB in Python

Optimizing Python function with Parakeet

Artefacts from Riemann sum in scipy.signal.convolve

Categories

Resources

I got ~5x speed improvement over the meshgrid solution using only broadcasting: def exampleKernelD(M, x, N, y): return np.sqrt((x[:,1:] - y[:,1:].T) 2 + (x[:,:1] - y[:,:1].T) 2)