Fastest algorithm for computing 3-D curl

Fastest algorithm for computing 3-D curl - python

I'm trying to write a section of code that computes the curl of a vector field numerically to second order with periodic boundary conditions. However, the algorithm I made is very slow and I'm wondering if anyone knows of any alternative algorithms.
To give more specific context: I'm using a 3xAxBxC numpy array as my vector field where the first axis refers to the Cartesian direction (x,y,z) and A,B,C refer to the number of bins in that Cartesian direction (i.e the resolution). So for example, I might have a vector field F = np.zeros((3,64,64,64)) where Fx = F[0] is a 64x64x64 Cartesian lattice in its own right. So far, my solution was to use the 3-point centered difference stencil to calculate the derivatives and used a nested loop to iterate over all the different dimensions using modular arithmetic to enforce the periodic boundary conditions (see below for example). However, as my resolution increases (the size of A,B,C) this begins to take a long time (upwards 2 minutes, which adds up if I do this several hundred times for my simulation - this is just one small part of a larger algorithm). I was wondering if anyone know of an alternative method for doing this?
import numpy as np
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
Just to re-iterate, I'm looking for an algorithm that'll compute the curl of a vector field array to second order given periodic boundary conditions faster than the one I have. Maybe there's nothing that will do this, but I just want to check before I keep spending time running this algorithm. Thank. you everyone in advance!

There may be better tools for this, but here is a trivial 200x speedup with numba:
import numpy as np
from numba import jit
def pure_python():
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
return VxF
#jit(fastmath=True)
def with_numba():
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
return VxF
The pure Python version takes 13 seconds on my machine, while the numba version takes 65 ms.

Related

Custom vectorized non-linear filter in Numpy

In digital image processing, many filters are non-linear, such as Harmonic Mean Filter.
I know in Numpy, they provided many vectorized functions which could speed up the computing time tremendously, but currently I have not known any that could work well with non-linear masks.
In specific, I want to speed up the calculation of my implementation of the above filter, which removes two ugly, snail-paced Python for loops:
import math as m
def harmonic(im, ksize):
# Make a copy of the original image
result = im.copy().astype(np.float32)
# Calculate padding size, and pad the original image
psize = m.floor(ksize/2) # paddding size
im = cv.copyMakeBorder(im, psize, psize, psize, psize, cv.BORDER_REFLECT)
# Perform non-linear operations
for i in range(0, result.shape[0]):
for j in range(0, result.shape[1]):
# Get the neighborhood same size as kernel
neighbor = im[(i):(i+2*psize+1),(j):(j+2*psize+1)].astype(np.float32)
# ----------------------------------------
# Calculate the reciprocal sum
recp_sum = np.sum(np.reciprocal(neighbor,where= neighbor != 0).astype(np.float32))
# Harmonic mean for that neighborhood
if (recp_sum != 0):
result[i][j] = (float((ksize*ksize)/(recp_sum)))
# ----------------------------------------
return result.astype(np.uint8)
In general, could we utilize Numpy to create any custom vectorized operations on a array? Or only a limited number operations and what types are they? If yes, what could I do specifically to optimize the above code?
I have tried to explore Numpy vectorization recently, and np.vectorize really caught my attention. However, the examples provided on the documentation was a bit (as far as I feel) irrelevant to the problem I am trying to solve. (English was not my native language so I may miss something, I'd be happy to be elaborated!)
Related to np.vectorize, I do not really understand pyfunc param. Does it really eliminate the traditional Python loops wrapped in that pyfunc? Or it's there just to define a specific mapping at a specific pixel in the array?

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. That is,
tmp = 1 / im.astype(np.float32)
tmp = cv2.blur(tmp, (ksize, ksize))
out = 1 / tmp
You might want to add a bit of code there to avoid division by zero. The simplest way is to replace zeros with very small values.

problem with cooley-tukey FFT algorithm in python

I've recently learned about the Cooley-Tukey FFT algorithm. I want to gain a deeper understanding of this algorithm and thus decided to write my own (non-recursive) implementation of it. However I can't get it to work. I've been messing with it for a few days but it just won't give a good output.
The output splits the DFT into even and odd DFTs and does this recursively until the DFTs consist of just a single data point.
I combine the N DFTs from the ground up with twiddle factors, for every frequency to get the complete DFT.
import math
import matplotlib.pyplot as plt
#Using numpy to work with complex numbers
import numpy as np
def twiddle(k,bits):
#Generate twiddle factors for a frequency
N=2**bits
T=[]
sign=1
for i in range(bits):
#Check if the frequency is in the upper or lower half of the range
if k>=N//2:
k-=N//2
sign=-1
#Generate complex twiddle factor for every stage of the algorithm
temp=sign*np.exp(-1j*math.tau*k/N)
T.append(temp)
N=N//2
sign=1
return T
def FFT(data,bits):
#Slice data to ensure its length is always a power of 2
N=2**bits
data=data[:N]
F=[]
#Calculate Fourier coefficient for every frequency
for k in range(N):
#Obtain twiddle factors for frequency
T=twiddle(k,bits)
#Copy input data into temporary array
temp=[x for x in data]
#Run through all stages
for i in range(bits):
#Combine even and odd partial DFT's with twiddle factor
temp=[temp[2*j]+T[bits-i-1]*temp[2*j+1] for j in range(2**(bits-i-1))]
F.append(temp[0])
return F
#Generate some input data
bits=10
t=range(0,2**bits)
f=300
samplerate=5000
v=[10*math.sin(math.tau*f*x/samplerate) for x in t]
f=[samplerate*i*2**(-bits) for i in range(2**bits)]
#Run function and plot
F=FFT(v,bits)
F=np.array(F)
plt.plot(f,abs(F))
To give an idea here is the the plot this code yields. Obviously since the input is a single 300Hz sinewave it should only return one peak at 300, which is then mirrored in the Nyquist frequency.
Any help would be greatly appreciated, I'm sure I've overlooked something or am just not using the right method.

I think you forgot about a bit-reversal permutaion. Radix-2|4|8 FFT algorithm is supposed to operate in-place and to do so it requires the values to be in a bit-reversed order.
Also, if you gonna dig deeper and to implement mixed-radix algorithm which is a generalization of Cooley-Tukey algorithm then you will need to implement a mixed-radix reversal as well

Using pairwise_distances_chunked to compute nearest neighbor search

I have a long skinny data matrix (size: 250,000 x 10), which I will denote X. I also have a vector p measuring the quality of my data points. My goal is to compute the following function for each row x in my data matrix X:
r(x) = min{ ||x-y|| | p[y]>p[x], y in X }
On a smaller dataset, what I would use sklearn.metrics.pairwise_distances to precompute distances, like so:
from sklearn import metrics
n = len(X);
D_full = metrics.pairwise_distances(X);
r = np.zeros((n,1));
for i in range(n):
r[i] = (D_full[i,p>p[i]]).min();
However, the above approach is memory-expensive, since I need to store D_full: a full n x n matrix. It seems like sklearn.metrics.pairwise_distances_chunked could be a good tool for this sort of problem since the distance matrix is only stored one chunk at a time. I was hoping to get some assistance in how to use it though, as I'm currently unfamiliar with generator objects. Suppose I call the following:
from sklearn import metrics
D = metrics.pairwise_distances_chunked(X);
D_chunk = next(D)
The above code yields D (a generator object) and D_chunk (a 536 x n array). Does D_chunkcorrespond to the first 536 rows of the matrix D_full from my earlier approach? If so, does next(D_chunk) correspond to the next 536 rows?
Thank you very much for your help.

This is a outline of a possible solution, but details are missing. In short, I would do the following:
Create a BallTree to query, and initialise min_quality_distance of size 250000 with say zeros.
for k=2
For each vector, find the closest k neighbour (including itself).
If vector with most distance within k found, has sufficient quality, update min_quality_distance for that point.
For remaining, repeat with k=k+1
In each iteration, we have to query less vectors. The idea is that in each iteration you nibble a few nearest neighbors with the right condition away, and it will be easier with every step. (50% easier?) I will show how to do the first iteration, and with this is should be possible to build the loop.
You can do;
import numpy as np
size = 250000
X = np.random.random( size=(size,10))
p = np.random.random( size=size)
And create a BallTree with
from sklearn.neighbors import BallTree
tree = BallTree(X, leaf_size=10, metric='minkowski')
and query it for first iteration with (this will take about 5 minutes.)
k_nearest = 2
distances, indici = tree.query(X, k=k_nearest, return_distance=True, dualtree=False, sort_results=True)
The indici of most far away point within the nearest k is
most_far_away_indici = indici[:,-1:]
And its quality
p[most_far_away_indici]
So we can
quality_closeby = p[most_far_away_indici]
And check if it is sifficient with
indici_sufficient_quality = quality_closeby > np.expand_dims(p, axis=1)
And we have
found_closeby = np.all( indici_sufficient_quality, axis=1 )
Which is True is we have found a sufficient quality nearby.
We can update the vector with
distances_nearby = distances[:,-1:]
rx = np.zeros(size)
rx[found_closeby] = distances_nearby[found_closeby][:,0]
And we now need to take care for the remaining where we were unlucky, these are
~found_closeby
so
indici_not_found = ~found_closeby
and
distances, indici = tree.query(X[indici_not_found], k=3, return_distance=True, dualtree=False, sort_results=True)
etc..
I am sure the first few loops will take minutes, but after a few iterations the speeds will quickly go to seconds.
It is a little exercise with np.argwhere() etc to make sure the right indicis get updates.
It might not be the fastest, but it is a workable approach.

Since one cannot know the dimensions of some chunk, I suggest using np.ones_like instead of np.zeros.

How can I speed up nearest neighbor search with python?

I have a code, which calculates the nearest voxel (which is unassigned) to a voxel ( which is assigned). That is i have an array of voxels, few voxels already have a scalar (1,2,3,4....etc) values assigned, and few voxels are empty (lets say a value of '0'). This code below finds the nearest assigned voxel to an unassigned voxel and assigns that voxel the same scalar. So, a voxel with a scalar '0' will be assigned a value (1 or 2 or 3,...) based on the nearest voxel. This code below works, but it takes too much time.
Is there an alternative to this ? or if you have any feedback on how to improve it further?
""" #self.voxels is a 3D numpy array"""
def fill_empty_voxel1(self,argx, argy, argz):
""" where # argx, argy, argz are the voxel location where the voxel is zero"""
argx1, argy1, argz1 = np.where(self.voxels!=0) # find the non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= self.mean) # self.mean is a mean radius search value
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
self.voxels[argx,argy,argz] = self.voxels[argx2,argy2,argz2] # update the voxel array
Example
""" Here is a small example with small dataset:"""
import numpy as np
from scipy.spatial import cKDTree
import timeit
voxels = np.zeros((10,10,5), dtype=np.uint8)
voxels[1:2,:,:] = 5.
voxels[5:6,:,:] = 2.
voxels[:,3:4,:] = 1.
voxels[:,8:9,:] = 4.
argx, argy, argz = np.where(voxels==0)
tic=timeit.default_timer()
argx1, argy1, argz1 = np.where(voxels!=0) # non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5.)
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
voxels[argx,argy,argz] = voxels[argx2,argy2,argz2]
toc=timeit.default_timer()
timetaken = toc - tic #elapsed time in seconds
print '\nTime to fill empty voxels', timetaken
for visualization:
from mayavi import mlab
data = voxels.astype('float')
scalar_field = mlab.pipeline.scalar_field(data)
iso_surf = mlab.pipeline.iso_surface(scalar_field)
surf = mlab.pipeline.surface(scalar_field)
vol = mlab.pipeline.volume(scalar_field,vmin=0,vmax=data.max())
mlab.outline()
mlab.show()
Now, if I have the dimension of the voxels array as something like (500,500,500), then the time it takes to compute the nearest search is no longer efficient. How can I overcome this? Could parallel computation reduce the time (I have no idea whether I can parallelize the code, if you do, please let me know)?
A potential fix:
I could substantially improve the computation time by adding the n_jobs = -1 parameter in the cKDTree query.
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5., n_jobs=-1)
I was able to compute the distances in less than a hour for an array of (400,100,100) on a 13 core CPU. I tried with 1 processor and it takes around 18 hours to complete the same array.
Thanks to #gsamaras for the answer!

You can switch to approximate nearest neighbors (ANN) algorithms which usually take advantage of sophisticated hashing or proximity graph techniques to index your data quickly and perform faster queries. One example is Spotify's Annoy. Annoy's README includes a plot which shows precision-performance tradeoff comparison of various ANN algorithms published in recent years. The top-performing algorithm (at the time this comment was posted), hnsw, has a Python implementation under Non-Metric Space Library (NMSLIB).

It would be interesting to try sklearn.neighbors.NearestNeighbors, which offers n_jobs parameter:
The number of parallel jobs to run for neighbors search.
This package also provides the Ball Tree algorithm, which you can test versus the kd-tree one, however my hunch is that the kd-tree will be better (but that again does depend on your data, so research that!).
You might also want to use dimensionality reduction, which is easy. The idea is that you reduce your dimensions, thus your data contain less info, so that tackling the Nearest Neighbour Problem can be done much faster. Of course, there is a trade off here, accuracy!
You might/will get less accuracy with dimensionality reduction, but it might worth the try. However, this usually applies in a high dimensional space, and you are just in 3D. So I don't know if for your specific case it would make sense to use sklearn.decomposition.PCA.
A remark:
If you really want high performance though, you won't get it with python, you could switch to c++, and use CGAL for example.

Cython and numpy speed

I'm using cython for a correlation calculation in my python program. I have two audio data sets and I need to know the time difference between them. The second set is cut based on onset times and then slid across the first set. There are two for-loops: one slides the set and the inner loop calculates correlation at that point. This method works very well and it's accurate enough.
The problem is that with pure python this takes more than one minute. With my cython code, it takes about 17 seconds. This still is too much. Do you have any hints how to speed-up this code:
import numpy as np
cimport numpy as np
cimport cython
FTYPE = np.float
ctypedef np.float_t FTYPE_t
#cython.boundscheck(False)
def delay(np.ndarray[FTYPE_t, ndim=1] f, np.ndarray[FTYPE_t, ndim=1] g):
cdef int size1 = f.shape[0]
cdef int size2 = g.shape[0]
cdef int max_correlation = 0
cdef int delay = 0
cdef int current_correlation, i, j
# Move second data set frame by frame
for i in range(0, size1 - size2):
current_correlation = 0
# Calculate correlation at that point
for j in range(size2):
current_correlation += f[<unsigned int>(i+j)] * g[j]
# Check if current correlation is highest so far
if current_correlation > max_correlation:
max_correlation = current_correlation
delay = i
return delay

Edit:
There's now scipy.signal.fftconvolve which would be the preferred approach to doing the FFT based convolution approach that I describe below. I'll leave the original answer to explain the speed issue, but in practice use scipy.signal.fftconvolve.
Original answer:
Using FFTs and the convolution theorem will give you dramatic speed gains by converting the problem from O(n^2) to O(n log n). This is particularly useful for long data sets, like yours, and can give speed gains of 1000s or much more, depending on length. It's also easy to do: just FFT both signals, multiply, and inverse FFT the product. numpy.correlate doesn't use the FFT method in the cross-correlation routine and is better used with very small kernels.
Here's an example
from timeit import Timer
from numpy import *
times = arange(0, 100, .001)
xdata = 1.*sin(2*pi*1.*times) + .5*sin(2*pi*1.1*times + 1.)
ydata = .5*sin(2*pi*1.1*times)
def xcorr(x, y):
return correlate(x, y, mode='same')
def fftxcorr(x, y):
fx, fy = fft.fft(x), fft.fft(y[::-1])
fxfy = fx*fy
xy = fft.ifft(fxfy)
return xy
if __name__ == "__main__":
N = 10
t = Timer("xcorr(xdata, ydata)", "from __main__ import xcorr, xdata, ydata")
print 'xcorr', t.timeit(number=N)/N
t = Timer("fftxcorr(xdata, ydata)", "from __main__ import fftxcorr, xdata, ydata")
print 'fftxcorr', t.timeit(number=N)/N
Which gives the running times per cycle (in seconds, for a 10,000 long waveform)
xcorr 34.3761689901
fftxcorr 0.0768054962158
It's clear the fftxcorr method is much faster.
If you plot out the results, you'll see that they are very similar near zero time shift. Note, though, as you get further away the xcorr will decrease and the fftxcorr won't. This is because it's a bit ambiguous what to do with the parts of the waveform that don't overlap when the waveforms are shifted. xcorr treats it as zero and the FFT treats the waveforms as periodic, but if it's an issue it can be fixed by zero padding.

The trick with this sort of thing is to find a way to divide and conquer.
Currently, you're sliding to every position and check every point at every position -- effectively an O( n ^ 2 ) operation.
You need to reduce the check of every point and the comparison of every position to something that does less work to determine a non-match.
For example, you could have a shorter "is this even close?" filter that checks the first few positions. If the correlation is above some threshold, then keep going otherwise give up and move on.
You could have a "check every 8th position" that you multiply by 8. If this is too low, skip it and move on. If this is high enough, then check all of the values to see if you've found the maxima.
The issue is the time required to do all these multiplies -- (f[<unsigned int>(i+j)] * g[j]) In effect, you're filling a big matrix with all these products and picking the row with the maximum sum. You don't want to compute "all" the products. Just enough of the products to be sure you've found the maximum sum.
The issue with finding maxima is that you have to sum everything to see if it's biggest. If you can turn this into a minimization problem, it's easier to abandon computing products and sums once an intermediate result exceeds a threshold.
(I think this might work. I have't tried it.)
If you used max(g)-g[j] to work with negative numbers, you'd be looking for the smallest, not the biggest. You could compute the correlation for the first position. Anything that summed to a bigger value could be stopped immediately -- no more multiplies or adds for that offset, shift to another.

you can extract range(size2) from the external loop
you can use sum() instead of a loop to compute current_correlation
you can store correlations and delays in a list and then use max() to get the biggest one

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.