How speed-up python function (numpy) - python

I have grayscale image (matrix) and this function:
sz1, sz2 = image_layer.shape
x_end, y_end = (sz1 - N1 + 1), (sz2 - N1 + 1)
Av = np.zeros((N1 * N1, x_end * y_end), dtype=complex)
index = 0
for x in range(x_end):
for y in range(y_end):
Av[:, index] = np.fft.fft2(image_layer[x: x + N1, y: y + N1]).flatten()
index += 1
I want to speed up this function.
I tried different methods:
multithreading
multiprocessing
cpython
numba
I splitted x and y values on blocks but this doesn't help me.
I read about numpy.vectorize, but couldn't do it.
P.S. 100x100 matrix time: 0.23435211181640625 ms
P.S.S 1000x1000 matrix time: 28.731552600860596 ms
I think if I change the algorithm it will help speed up the function
Edited:
I want to make a block (8x8) for every element matrix and calculate fft2.

Related

Vectorizing python code to numpy

I have the following code snippet (for Hough circle transform):
for r in range(1, 11):
for t in range(0, 360):
trad = np.deg2rad(t)
b = x - r * np.cos(trad)
a = y - r * np.sin(trad)
b = np.floor(b).astype('int')
a = np.floor(a).astype('int')
A[a, b, r-1] += 1
Where A is a 3D array of shape (height, width, 10), and
height and width represent the size of a given image.
My goal is to convert the snippet exclusively to numpy code.
My attempt is this:
arr_r = np.arange(1, 11)
arr_t = np.deg2rad(np.arange(0, 360))
arr_cos_t = np.cos(arr_t)
arr_sin_t = np.sin(arr_t)
arr_rcos = arr_r[..., np.newaxis] * arr_cos_t[np.newaxis, ...]
arr_rsin = arr_r[..., np.newaxis] * arr_sin_t[np.newaxis, ...]
arr_a = (y - arr_rsin).flatten().astype('int')
arr_b = (x - arr_rcos).flatten().astype('int')
Where x and y are two scalar values.
I am having trouble at converting the increment part: A[a,b,r] += 1. I thought of this: A[a,b,r] counts the number of occurrences of the pair (a,b,r), so a clue was to use a Cartesian product (but the arrays are too large).
Any tips or tricks I can use?
Thank you very much!
Edit: after filling A, I need (a,b,r) as argmax(A). The tuple (a,b,r) identifies a circle and its value in A represents the confidence value. So I want that tuple with the highest value in A. This is part of the voting algorithm from Hough circle transform: find circle parameter with unknown radius.
Method #1
Here's one way leveraging broadcasting to get the counts and update A (this assumes the a and b values computed in the intermediate steps are positive ones) -
d0,d1,d2 = A.shape
arr_r = np.arange(1, 11)
arr_t = np.deg2rad(np.arange(0, 360))
arr_b = np.floor(x - arr_r[:,None] * np.cos(arr_t)).astype('int')
arr_a = np.floor(y - arr_r[:,None] * np.sin(arr_t)).astype('int')
idx = (arr_a*d1*d2) + (arr_b * d2) + (arr_r-1)[:,None]
A.flat[:idx.max()+1] += np.bincount(idx.ravel())
# OR A.flat += np.bincount(idx.ravel(), minlength=A.size)
Method #2
Alternatively, we could avoid bincount to replace the last step in approach #1, like so -
idx.ravel().sort()
idx.shape = (-1)
grp_idx = np.flatnonzero(np.concatenate(([True], idx[1:]!=idx[:-1],[True])))
A.flat[idx[grp_idx[:-1]]] += np.diff(grp_idx)
Improvement with numexpr
We could also leverage numexpr module for faster sine, cosine computations, like so -
import numexpr as ne
arr_r2D = arr_r[:,None]
arr_b = ne.evaluate('floor(x - arr_r2D * cos(arr_t))').astype(int)
arr_a = ne.evaluate('floor(y - arr_r2D * sin(arr_t))').astype(int)
np.add(np.array ([arr_a, arr_b, 10]), 1)

Calculate gradient only in a masked area

I have a very large array with only a few small areas of interest. I need to calculate the gradient of this array, but for performance reasons I need this calculation to be restricted to these areas of interest.
I can't do something like this:
phi_grad0[mask] = np.gradient(phi[mask], axis=0)
Because of how fancy indexing works, phi[mask] just becomes a 1D array of the masked pixels, losing spatial information and making the gradient calculation worthless.
np.gradient does handle np.ma.masked_arrays, but the performance is an order of magnitude worse:
import numpy as np
from timeit_context import timeit_context
phi = np.random.randint(low=-100, high=100, size=[100, 100])
phi_mask = np.random.randint(low=0, high=2, size=phi.shape, dtype=np.bool)
with timeit_context('full array'):
for i2 in range(1000):
phi_masked_grad1 = np.gradient(phi)
with timeit_context('masked_array'):
phi_masked = np.ma.masked_array(phi, ~phi_mask)
for i1 in range(1000):
phi_masked_grad2 = np.gradient(phi_masked)
This produces the output below:
[full array] finished in 143 ms
[masked_array] finished in 1961 ms
I think its because operations run on masked_arrays are not vectorized, but I'm not sure.
Is there any way of restricting np.gradient so as to achieve better performance?
This timeit_context is a handy timer that works like this, if anyone is interested:
from contextlib import contextmanager
import time
#contextmanager
def timeit_context(name):
"""
Use it to time a specific code snippet
Usage: 'with timeit_context('Testcase1'):'
:param name: Name of the context
"""
start_time = time.time()
yield
elapsed_time = time.time() - start_time
print('[{}] finished in {} ms'.format(name, int(elapsed_time * 1000)))
Not exactly an answer, but this is what I've managed to patch together for my situation, which works pretty well:
I get 1D indices of the pixels where the condition is true (in this case the condition being < 5 for example):
def get_indices_1d(image, band_thickness):
return np.where(image.reshape(-1) < 5)[0]
This gives me a 1D array with those indices.
Then I manually calculate the gradient at those positions, in different ways:
def gradient_at_points1(image, indices_1d):
width = image.shape[1]
size = image.size
# Using this instead of ravel() is more likely to produce a view instead of a copy
raveled_image = image.reshape(-1)
res_x = 0.5 * (raveled_image[(indices_1d + 1) % size] - raveled_image[(indices_1d - 1) % size])
res_y = 0.5 * (raveled_image[(indices_1d + width) % size] - raveled_image[(indices_1d - width) % size])
return [res_y, res_x]
def gradient_at_points2(image, indices_1d):
indices_2d = np.unravel_index(indices_1d, dims=image.shape)
# Even without doing the actual deltas this is already slower, and we'll have to check boundary conditions, etc
res_x = 0.5 * (image[indices_2d] - image[indices_2d])
res_y = 0.5 * (image[indices_2d] - image[indices_2d])
return [res_y, res_x]
def gradient_at_points3(image, indices_1d):
width = image.shape[1]
raveled_image = image.reshape(-1)
res_x = 0.5 * (raveled_image.take(indices_1d + 1, mode='wrap') - raveled_image.take(indices_1d - 1, mode='wrap'))
res_y = 0.5 * (raveled_image.take(indices_1d + width, mode='wrap') - raveled_image.take(indices_1d - width, mode='wrap'))
return [res_y, res_x]
def gradient_at_points4(image, indices_1d):
width = image.shape[1]
raveled_image = image.ravel()
res_x = 0.5 * (raveled_image.take(indices_1d + 1, mode='wrap') - raveled_image.take(indices_1d - 1, mode='wrap'))
res_y = 0.5 * (raveled_image.take(indices_1d + width, mode='wrap') - raveled_image.take(indices_1d - width, mode='wrap'))
return [res_y, res_x]
My test arrays look like this:
a = np.random.randint(-10, 10, size=[512, 512])
# Force edges to not pass the condition
a[:, 0] = 99
a[:, -1] = 99
a[0, :] = 99
a[-1, :] = 99
indices = get_indices_1d(a, 5)
mask = a < 5
Then I can run these tests:
with timeit_context('full gradient'):
for i in range(100):
grad1 = np.gradient(a)
with timeit_context('With masked_array'):
for im in range(100):
ma = np.ma.masked_array(a, mask)
grad6 = np.gradient(ma)
with timeit_context('gradient at points 1'):
for i1 in range(100):
grad2 = gradient_at_points1(image=a, indices_1d=indices)
with timeit_context('gradient at points 2'):
for i2 in range(100):
grad3 = gradient_at_points2(image=a, indices_1d=indices)
with timeit_context('gradient at points 3'):
for i3 in range(100):
grad4 = gradient_at_points3(image=a, indices_1d=indices)
with timeit_context('gradient at points 4'):
for i4 in range(100):
grad5 = gradient_at_points4(image=a, indices_1d=indices)
Which give the following results:
[full gradient] finished in 576 ms
[With masked_array] finished in 3455 ms
[gradient at points 1] finished in 421 ms
[gradient at points 2] finished in 451 ms
[gradient at points 3] finished in 112 ms
[gradient at points 4] finished in 102 ms
As you can see method 4 is by far the best (don't care much about how much memory its consuming however).
This probably only holds because my 2D array is relatively small (512x512). Maybe with much larger arrays this won't be true.
Another caveat is that ndarray.take(indices, mode='wrap') will do some weird stuff around the image edges (one row will 'loop' into the next, etc) to maintain good performance, so if edges are ever important for your application you might want to pad the input array with 1 pixel around the edges.
Still super interesting how slow masked_arrays are. Pulling the constructor ma = np.ma.masked_array(a, mask) outside the loop doesn't affect the time since the masked_array itself just keeps references to the array and its mask

Numpy speed up nested loop with fancy indexing

i currently implemented an algorithm which calculates a quality assesment of disparity maps based on total variation.
I'm relatively new to python, but already read numerous threads on speed up numpy code. Views vs Fancy indexing, tried using Cython, Vectorization of nested loops etc. I achieved a bit of speed up's but altogether, i ended more and more in messy code without achieving a proper speed up.
I wonder if someone can give me a hint if there is a clean and easy way to speed up this 2d loop.
TV is a 2D array with ~ 15k x 15k elements
footprint_ix and _iy are 2 lists of arrays which contain the index offset to the neighbor pixels if pixel x,y in a ringshaped manner. With m = 1 the 8 neighborpixels are selected, m = 2 the next 16, and so on
The algorithm sums up the neighbor pixels of x,y and increases m when a threshold TAU is not exceeded.
The best solution i come up with, so far uses row-wise multiprocessing.
# create footprints
footprint_ix = []
footprint_iy = []
for m in range(1, m_classes):
fp = np.ones((2 * m + 1, 2 * m + 1), dtype = int)
fp [ 1 : -1, 1 : -1] = 0
i, j = np.nonzero(fp)
i = i - m
j = j - m
footprint_ix.append(i)
footprint_iy.append(j)
m_classes = 21
for x in xrange( 0, rows):
for y in xrange ( 0, cols):
if disp[x,y] == np.inf:
continue
else:
tv_m = 0
for m_i in range (0, m_classes-1):
m = m_i + 1
try:
tv_m += np.sum( tv[footprint_ix[m_i] + x, footprint_iy[m_i] + y] ) / (8 * m)
except IndexError:
tv_m = np.inf
if tv_m >= TAU:
tv_classes[x,y] = m
break
if m == m_classes - 1:
tv_classes[x,y] = m

fast method with numpy for 2D Heat equation

I'm looking for a method for solve the 2D heat equation with python. I have already implemented the finite difference method but is slow motion (to make 100,000 simulations takes 30 minutes). The idea is to create a code in which the end can write,
for t in TIME:
DeltaU=f(U)
U=U+DeltaU*DeltaT
save(U)
How can I do that?
In the first form of my code, I used the 2D method of finite difference, my grill is 5000x250 (x, y). Now I would like to decrease the speed of computing and the idea is to find
DeltaU = f(u)
where U is a heat function. For implementation I used this source http://www.timteatro.net/2010/10/29/performance-python-solving-the-2d-diffusion-equation-with-numpy/ for 2D case, but the run time is more expensive for my necessity. Are there some methods to do this?
Maybe I must to work with the matrix
A=1/dx^2 (2 -1 0 0 ... 0
-1 2 -1 0 ... 0
0 -1 2 -1 ... 0
. .
. .
. .
0 ... -1 2)
but how to make this in the 2D problem? How to inserting Boundary conditions in A?
This is the code for the finite difference that I used:
Lx=5000 # physical length x vector in micron
Ly=250 # physical length y vector in micron
Nx = 100 # number of point of mesh along x direction
Ny = 50 # number of point of mesh along y direction
a = 0.001 # diffusion coefficent
dx = 1/Nx
dy = 1/Ny
dt = (dx**2*dy**2)/(2*a*(dx**2 + dy**2)) # it is 0.04
x = linspace(0.1,Lx, Nx)[np.newaxis] # vector to create mesh
y = linspace(0.1,Ly, Ny)[np.newaxis] # vector to create mesh
I=sqrt(x*y.T) #initial data for heat equation
u=np.ones(([Nx,Ny])) # u is the matrix referred to heat function
steps=100000
for m in range (0,steps):
du=np.zeros(([Nx,Ny]))
for i in range (1,Nx-1):
for j in range(1,Ny-1):
dux = ( u[i+1,j] - 2*u[i,j] + u[i-1, j] ) / dx**2
duy = ( u[i,j+1] - 2*u[i,j] + u[i, j-1] ) / dy**2
du[i,j] = dt*a*(dux+duy)
# Boundary Conditions
t1=(u[:,0]+u[:,1])/2
u[:,0]=t1
u[:,1]=t1
t2=(u[0,:]+u[1,:])/2
u[0,:]=t2
u[1,:]=t2
t3=(u[-1,:]+u[-2,:])/2
u[-1,:]=t3
u[-2,:]=t3
u[:,-1]=1
filename1='data_{:08d}.txt'
if m%100==0:
np.savetxt(filename1.format(m),u,delimiter='\t' )
For elaborate 100000 steps the run time is about 30 minutes. I would to optimize this code (with the idea presented in the initial lines) to have a run time about 5/10 minutes or minus. How can I do it?
There are some simple but tremendous improvements possible.
Just by introducing Dxx, Dyy = 1/(dx*dx), 1/(dy*dy) the runtime drops 25%. By using slices and avoiding for-loops, the code is now 400 times faster.
import numpy as np
def f_for(u):
for m in range(0, steps):
du = np.zeros_like(u)
for i in range(1, Nx-1):
for j in range(1, Ny-1):
dux = (u[i+1, j] - 2*u[i, j] + u[i-1, j]) / dx**2
duy = (u[i, j+1] - 2*u[i, j] + u[i, j-1]) / dy**2
du[i, j] = dt*a*(dux + duy)
return du
def f_slice(u):
du = np.zeros_like(u)
Dxx, Dyy = 1/dx**2, 1/dy**2
i = slice(1, Nx-1)
iw = slice(0, Nx-2)
ie = slice(2, Nx)
j = slice(1, Ny-1)
js = slice(0, Ny-2)
jn = slice(2, Ny)
for m in range(0, steps):
dux = Dxx * (u[ie, j] - 2*u[i, j] + u[iw, j])
duy = Dyy * (u[i, jn] - 2*u[i, j] + u[i, js])
du[i, j] = dt*a*(dux + duy)
return du
Nx = 100 # number of mesh points in the x-direction
Ny = 50 # number of mesh points in the y-direction
a = 0.001 # diffusion coefficent
dx = 1/Nx
dy = 1/Ny
dt = (dx**2*dy**2)/(2*a*(dx**2 + dy**2))
steps = 10000
U = np.ones((Nx, Ny))
%timeit f_for(U)
%timeit f_slice(U)
Have you considered paralellizing your code or using GPU acceleration.
It would help if you ran your code the python profiler (cProfile) so that you can figure out where you bottleneck in runtime is. I'm assuming it's in solving the matrix equation you get to which can be easily sped up by the methods I listed above.
I might be wrong but in your code in the loop you create for the time steps, "m in range(steps)"
in the one line below you continue with;
Du =np.zeros(----).
Not an expert of python but this may be resulting in creating a sparse matrix in the number of steps, in this case 100k times.

Fully vectorise numpy polyfit

Overview
I am running into issues with performance using polyfit because it doesn't appear able to accept broadcast arrays. I am aware from this post that the dependant data y can be multidimensional if you use numpy.polynomial.polynomial.polyfit. However, the x dimension cannot be multidimensional. Is there anyway around this?
Motivation
I need to compute the rate of change of some data. To match with an experiment I want to use the following method: take data y and x, for short sections of data fit a polynomial, then use the fitted coefficient as an estimate of the rate of change.
Illustration
import numpy as np
import matplotlib.pyplot as plt
n = 100
x = np.linspace(0, 10, n)
y = np.sin(x)
window_length = 10
ydot = [np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length)]
x_mids = [x[j+window_length/2] for j in range(n - window_length)]
plt.plot(x, y)
plt.plot(x_mids, ydot)
plt.show()
The blue line is the original data (a sine curve), while the green is the first differential (a cosine curve).
The problem
To vectorise this I did the following:
window_length = 10
vert_idx_list = np.arange(0, len(x) - window_length, 1)
hori_idx_list = np.arange(window_length)
A, B = np.meshgrid(hori_idx_list, vert_idx_list)
idx_array = A + B
x_array = x[idx_array]
y_array = y[idx_array]
This broadcasts the two 1D vectors to 2D vectors of shape (n-window_length, window_length). Now I was hoping that polyfit would have an axis argument so I could parallelise the calculation, but no such luck.
Does anyone have any suggestion for how to do this? I am open to
The way polyfit works is by solving a least-square problem of the form:
y = [X].a
where y are your dependent coordinates, [X] is the Vandermonde matrix of the corresponding independent coordinates, and a is the vector of fitted coefficients.
In your case you are always computing a 1st degree polynomial approximation, and are actually only interested in the coefficient of the 1st degree term. This has a well known closed form solution you can find in any statistics book, or produce your self by creating a 2x2 linear system of equation premultiplying both sides of the above equation by the transpose of [X]. This all adds up to the value you want to calculate being:
>>> n = 10
>>> x = np.random.random(n)
>>> y = np.random.random(n)
>>> np.polyfit(x, y, 1)[0]
-0.29207474654700277
>>> (n*(x*y).sum() - x.sum()*y.sum()) / (n*(x*x).sum() - x.sum()*x.sum())
-0.29207474654700216
On top of that you have a sliding window running over your data, so you can use something akin to a 1D summed area table as follows:
def sliding_fitted_slope(x, y, win):
x = np.concatenate(([0], x))
y = np.concatenate(([0], y))
Sx = np.cumsum(x)
Sy = np.cumsum(y)
Sx2 = np.cumsum(x*x)
Sxy = np.cumsum(x*y)
Sx = Sx[win:] - Sx[:-win]
Sy = Sy[win:] - Sy[:-win]
Sx2 = Sx2[win:] - Sx2[:-win]
Sxy = Sxy[win:] - Sxy[:-win]
return (win*Sxy - Sx*Sy) / (win*Sx2 - Sx*Sx)
With this code you can easily check that (notice I extended the range by 1):
>>> np.allclose(sliding_fitted_slope(x, y, window_length),
[np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length + 1)])
True
And:
%timeit sliding_fitted_slope(x, y, window_length)
10000 loops, best of 3: 34.5 us per loop
%%timeit
[np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length + 1)]
100 loops, best of 3: 10.1 ms per loop
So it is about 300x faster for your sample data.
Sorry for answering my own question, but 20 minutes more of trying to get to grips with it I have the following solution:
ydot = np.polynomial.polynomial.polyfit(x_array[0], y_array.T, 1)[-1]
One confusing part is that np.polyfit returns the coefficients with the highest power first. In np.polynomial.polynomial.polyfit the highest power is last (hence the -1 instead of 0 index).
Another confusion is that we use only the first slice of x (x_array[0]). I think that this is okay because it is not the absolute values of the independent vector x that are used, but the difference between them. Or alternatively it is like changing the reference x value.
If there is a better way to do this I am still happy to hear about it!
Using an alternative method for calculating the rate of change may be the solution for both speed and accuracy increase.
n = 1000
x = np.linspace(0, 10, n)
y = np.sin(x)
def timingPolyfit(x,y):
window_length = 10
vert_idx_list = np.arange(0, len(x) - window_length, 1)
hori_idx_list = np.arange(window_length)
A, B = np.meshgrid(hori_idx_list, vert_idx_list)
idx_array = A + B
x_array = x[idx_array]
y_array = y[idx_array]
ydot = np.polynomial.polynomial.polyfit(x_array[0], y_array.T, 1)[-1]
x_mids = [x[j+window_length/2] for j in range(n - window_length)]
return ydot, x_mids
def timingSimple(x,y):
dy = (y[2:] - y[:-2])/2
dx = x[1] - x[0]
dydx = dy/dx
return dydx, x[1:-1]
y1, x1 = timingPolyfit(x,y)
y2, x2 = timingSimple(x,y)
polyfitError = np.abs(y1 - np.cos(x1))
simpleError = np.abs(y2 - np.cos(x2))
print("polyfit average error: {:.2e}".format(np.average(polyfitError)))
print("simple average error: {:.2e}".format(np.average(simpleError)))
result = %timeit -o timingPolyfit(x,y)
result2 = %timeit -o timingSimple(x,y)
print("simple is {0} times faster".format(result.best / result2.best))
polyfit average error: 3.09e-03
simple average error: 1.09e-05
100 loops, best of 3: 3.2 ms per loop
100000 loops, best of 3: 9.46 µs per loop
simple is 337.995634151131 times faster
Relative error:
Results:

Categories