Cython and numpy speed

Cython and numpy speed - python

I'm using cython for a correlation calculation in my python program. I have two audio data sets and I need to know the time difference between them. The second set is cut based on onset times and then slid across the first set. There are two for-loops: one slides the set and the inner loop calculates correlation at that point. This method works very well and it's accurate enough.
The problem is that with pure python this takes more than one minute. With my cython code, it takes about 17 seconds. This still is too much. Do you have any hints how to speed-up this code:
import numpy as np
cimport numpy as np
cimport cython
FTYPE = np.float
ctypedef np.float_t FTYPE_t
#cython.boundscheck(False)
def delay(np.ndarray[FTYPE_t, ndim=1] f, np.ndarray[FTYPE_t, ndim=1] g):
cdef int size1 = f.shape[0]
cdef int size2 = g.shape[0]
cdef int max_correlation = 0
cdef int delay = 0
cdef int current_correlation, i, j
# Move second data set frame by frame
for i in range(0, size1 - size2):
current_correlation = 0
# Calculate correlation at that point
for j in range(size2):
current_correlation += f[<unsigned int>(i+j)] * g[j]
# Check if current correlation is highest so far
if current_correlation > max_correlation:
max_correlation = current_correlation
delay = i
return delay

Edit:
There's now scipy.signal.fftconvolve which would be the preferred approach to doing the FFT based convolution approach that I describe below. I'll leave the original answer to explain the speed issue, but in practice use scipy.signal.fftconvolve.
Original answer:
Using FFTs and the convolution theorem will give you dramatic speed gains by converting the problem from O(n^2) to O(n log n). This is particularly useful for long data sets, like yours, and can give speed gains of 1000s or much more, depending on length. It's also easy to do: just FFT both signals, multiply, and inverse FFT the product. numpy.correlate doesn't use the FFT method in the cross-correlation routine and is better used with very small kernels.
Here's an example
from timeit import Timer
from numpy import *
times = arange(0, 100, .001)
xdata = 1.*sin(2*pi*1.*times) + .5*sin(2*pi*1.1*times + 1.)
ydata = .5*sin(2*pi*1.1*times)
def xcorr(x, y):
return correlate(x, y, mode='same')
def fftxcorr(x, y):
fx, fy = fft.fft(x), fft.fft(y[::-1])
fxfy = fx*fy
xy = fft.ifft(fxfy)
return xy
if __name__ == "__main__":
N = 10
t = Timer("xcorr(xdata, ydata)", "from __main__ import xcorr, xdata, ydata")
print 'xcorr', t.timeit(number=N)/N
t = Timer("fftxcorr(xdata, ydata)", "from __main__ import fftxcorr, xdata, ydata")
print 'fftxcorr', t.timeit(number=N)/N
Which gives the running times per cycle (in seconds, for a 10,000 long waveform)
xcorr 34.3761689901
fftxcorr 0.0768054962158
It's clear the fftxcorr method is much faster.
If you plot out the results, you'll see that they are very similar near zero time shift. Note, though, as you get further away the xcorr will decrease and the fftxcorr won't. This is because it's a bit ambiguous what to do with the parts of the waveform that don't overlap when the waveforms are shifted. xcorr treats it as zero and the FFT treats the waveforms as periodic, but if it's an issue it can be fixed by zero padding.

The trick with this sort of thing is to find a way to divide and conquer.
Currently, you're sliding to every position and check every point at every position -- effectively an O( n ^ 2 ) operation.
You need to reduce the check of every point and the comparison of every position to something that does less work to determine a non-match.
For example, you could have a shorter "is this even close?" filter that checks the first few positions. If the correlation is above some threshold, then keep going otherwise give up and move on.
You could have a "check every 8th position" that you multiply by 8. If this is too low, skip it and move on. If this is high enough, then check all of the values to see if you've found the maxima.
The issue is the time required to do all these multiplies -- (f[<unsigned int>(i+j)] * g[j]) In effect, you're filling a big matrix with all these products and picking the row with the maximum sum. You don't want to compute "all" the products. Just enough of the products to be sure you've found the maximum sum.
The issue with finding maxima is that you have to sum everything to see if it's biggest. If you can turn this into a minimization problem, it's easier to abandon computing products and sums once an intermediate result exceeds a threshold.
(I think this might work. I have't tried it.)
If you used max(g)-g[j] to work with negative numbers, you'd be looking for the smallest, not the biggest. You could compute the correlation for the first position. Anything that summed to a bigger value could be stopped immediately -- no more multiplies or adds for that offset, shift to another.

you can extract range(size2) from the external loop
you can use sum() instead of a loop to compute current_correlation
you can store correlations and delays in a list and then use max() to get the biggest one

Related

Is there a way to speed up or vectorize a nested for loop?

I am fairly new to coding in general and to Python in particular. I am trying to apply a weighted average scheme into a big dataset, which at the moment is taking hours to complete and I would love to speed up the process also because this has to be repeated several times.
The weighted average represents a method used in marine biogeochemistry that includes the history of gas transfer velocities (k) in-between sampling dates, where k is weighted according to the fraction of water column (f) ventilated by the atmosphere as a function of the history of k and assigning more importance to values that are closer to sampling time (so the weight at sampling time step = 1 and then it decreases moving away in time):
Weight average equation extracted from (https://doi.org/10.1029/2017GB005874) pp. 1168
In my attempt I used a nested for loop where at each time step t I calculated the weighted average:
def kw_omega (k, depth, window, samples_day):
"""
calculate the scheme weights for gas transfer velocity of oxygen
over the previous window of time, where the most recent gas transfer velocity
has a weight of 1, and the weighting decreases going back in time. The rate of decrease
depends on the wind history and MLD.
Parameters
----------
k: ndarray
instantaneous O2 gas transfer velocity
depth: ndarray
Water depth
window: integer
weighting period in days which equals the residence time of oxygen at sampling day
samples_day: integer
number of samples in each day composing window
Returns
---------
weighted_kw: ndarray
Notes
---------
n = the weighting period / the time resolution of the wind data
samples_day = the time resolution of the wind data
omega = is the weighting coefficient at each time step within the weighting window
f = the fraction of the water column (mixed layer, photic zone or full water column) ventilated at each time
"""
Dt = 1./samples_day
f = (k*Dt)/depth
f = np.flip(f)
k = np.flip(k)
n = window*samples_day
weighted_kw = np.zeros(len(k))
for t in np.arange(len(k) - n):
omega = np.zeros((n))
omega[0] = 1.
for i in np.arange(1,len(omega)):
omega[i] = omega[i-1]*(1-f[t+(i-1)])
weighted_kw[t] = sum(k[t:t+n]*omega)/sum(omega)
print(f"t = {t}")
return np.flip(weighted_kw)
This should be used on model simulation data which was set to run for almost 2 years where the model time step was set to 60 seconds, and sampling is done at intervals of 7 days. Therefore k has shape (927360) and n, representing the number of minutes in 7 days has shape (10080). At the moment it is taking several hours to run. Is there a way to make this calculation faster?

I would recommend to use the package numba to speed up your calculation.
import numpy as np
from numba import njit
from numpy.lib.stride_tricks import sliding_window_view
#njit
def k_omega(k_win, f_win):
delta_t = len(k_win)
omega_sum = omega = 1.0
k_omega_sum = k_win[0]
for t in range(1, delta_t):
omega *= (1 - f_win[t])
omega_sum += omega
k_omega_sum = k_win[t] * omega
return k_omega_sum / omega_sum
#njit
def windows_k_omega(k_wins, f_wins):
size = len(k_wins)
result = np.empty(size)
for i in range(size):
result[i] = k_omega(k_wins[i], f_wins[i])
return result
def kw_omega(k, depth, window, samples_day):
n = window * samples_day # delta_t
f = k / depth / samples_day
k_wins = sliding_window_view(k, n)
f_wins = sliding_window_view(f, n)
k_omegas = windows_k_omega(k_wins, f_wins)
weighted_kw = np.pad(weighted_kw, (len(k)-len(k_omegas), 0))
return weighted_kw
Here, I have split up the function into three in order to make it more comprehensible. The function k_omega is basically applying your weighted mean function to a k and f window. The function windows_k_omega is just to speed up the loop to apply the function element wise on the windows. Finally, the outer function kw_omega implements your original function interface. It uses the numpy function sliding_window_view to create the moving windows (note that this is a fancy numpy indexing under the hood, so this is not creating a copy of the original array) and performs the calculation with the helper functions and takes care of the padding of the result array (initial zeros).
A short test with your original function showed some different results, which is likely due to your np.flip calls reverse the arrays for your indexing. I just implemented the formula which you provided without checking your indexing in depth, so I leave this task to you. You should maybe call it with some dummy inputs which you can check manually.
As an additional note to your code: If you want to loop on index, you should use the build in range instead of using np.arange. Internally, python uses a generator for range instead of creating the array of indexes first to iterate over each individually. Furthermore, you should try to reduce the number of arrays which you need to create, but instead re-use them, e.g. the omega = np.zeros(n) could be created outside the outer for loop using omega = np.empty(n) and internally only initialized on each new iteration omega[:] = 0.0. Note, that all kind of memory management which is typically the speed penalty, beside array element access by index, is something which you need to do with numpy yourself, because there is no compiler which helps you, therefore I recommend using numba, which compiles your python code and helps you in many ways to make your number crunching faster.

problem with cooley-tukey FFT algorithm in python

I've recently learned about the Cooley-Tukey FFT algorithm. I want to gain a deeper understanding of this algorithm and thus decided to write my own (non-recursive) implementation of it. However I can't get it to work. I've been messing with it for a few days but it just won't give a good output.
The output splits the DFT into even and odd DFTs and does this recursively until the DFTs consist of just a single data point.
I combine the N DFTs from the ground up with twiddle factors, for every frequency to get the complete DFT.
import math
import matplotlib.pyplot as plt
#Using numpy to work with complex numbers
import numpy as np
def twiddle(k,bits):
#Generate twiddle factors for a frequency
N=2**bits
T=[]
sign=1
for i in range(bits):
#Check if the frequency is in the upper or lower half of the range
if k>=N//2:
k-=N//2
sign=-1
#Generate complex twiddle factor for every stage of the algorithm
temp=sign*np.exp(-1j*math.tau*k/N)
T.append(temp)
N=N//2
sign=1
return T
def FFT(data,bits):
#Slice data to ensure its length is always a power of 2
N=2**bits
data=data[:N]
F=[]
#Calculate Fourier coefficient for every frequency
for k in range(N):
#Obtain twiddle factors for frequency
T=twiddle(k,bits)
#Copy input data into temporary array
temp=[x for x in data]
#Run through all stages
for i in range(bits):
#Combine even and odd partial DFT's with twiddle factor
temp=[temp[2*j]+T[bits-i-1]*temp[2*j+1] for j in range(2**(bits-i-1))]
F.append(temp[0])
return F
#Generate some input data
bits=10
t=range(0,2**bits)
f=300
samplerate=5000
v=[10*math.sin(math.tau*f*x/samplerate) for x in t]
f=[samplerate*i*2**(-bits) for i in range(2**bits)]
#Run function and plot
F=FFT(v,bits)
F=np.array(F)
plt.plot(f,abs(F))
To give an idea here is the the plot this code yields. Obviously since the input is a single 300Hz sinewave it should only return one peak at 300, which is then mirrored in the Nyquist frequency.
Any help would be greatly appreciated, I'm sure I've overlooked something or am just not using the right method.

I think you forgot about a bit-reversal permutaion. Radix-2|4|8 FFT algorithm is supposed to operate in-place and to do so it requires the values to be in a bit-reversed order.
Also, if you gonna dig deeper and to implement mixed-radix algorithm which is a generalization of Cooley-Tukey algorithm then you will need to implement a mixed-radix reversal as well

Fastest algorithm for computing 3-D curl

I'm trying to write a section of code that computes the curl of a vector field numerically to second order with periodic boundary conditions. However, the algorithm I made is very slow and I'm wondering if anyone knows of any alternative algorithms.
To give more specific context: I'm using a 3xAxBxC numpy array as my vector field where the first axis refers to the Cartesian direction (x,y,z) and A,B,C refer to the number of bins in that Cartesian direction (i.e the resolution). So for example, I might have a vector field F = np.zeros((3,64,64,64)) where Fx = F[0] is a 64x64x64 Cartesian lattice in its own right. So far, my solution was to use the 3-point centered difference stencil to calculate the derivatives and used a nested loop to iterate over all the different dimensions using modular arithmetic to enforce the periodic boundary conditions (see below for example). However, as my resolution increases (the size of A,B,C) this begins to take a long time (upwards 2 minutes, which adds up if I do this several hundred times for my simulation - this is just one small part of a larger algorithm). I was wondering if anyone know of an alternative method for doing this?
import numpy as np
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
Just to re-iterate, I'm looking for an algorithm that'll compute the curl of a vector field array to second order given periodic boundary conditions faster than the one I have. Maybe there's nothing that will do this, but I just want to check before I keep spending time running this algorithm. Thank. you everyone in advance!

There may be better tools for this, but here is a trivial 200x speedup with numba:
import numpy as np
from numba import jit
def pure_python():
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
return VxF
#jit(fastmath=True)
def with_numba():
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
return VxF
The pure Python version takes 13 seconds on my machine, while the numba version takes 65 ms.

Find plateau in Numpy array

I am looking for an efficient way to detect plateaus in otherwise very noisy data. The plateaus are always relatively broad A simple example of what this data could look like:
test=np.random.uniform(0.9,1,100)
test[10:20]=0
plt.plot(test)
Note that there can be multiple plateaus (which should all be detected) which can have different values.
I've tried using scipy.signal.argrelextrema, but it doesn't seem to be doing what I want it to:
peaks=argrelextrema(test,np.less,order=25)
plt.vlines(peaks,ymin=0, ymax=1)
I don't need the exact interval of the plateau- a rough range estimate would be enough, as long as that estimate is bigger or equal than the actual plateau range. It should be relatively efficient however.

There is a method scipy.signal.find_peaks that you can try, here is an exmple
import numpy
from scipy.signal import find_peaks
test = numpy.random.uniform(0.9, 1.0, 100)
test[10 : 20] = 0
peaks, peak_plateaus = find_peaks(- test, plateau_size = 1)
although find_peaks only finds peaks, it can be used to find valleys if the array is negated, then you do the following
for i in range(len(peak_plateaus['plateau_sizes'])):
if peak_plateaus['plateau_sizes'][i] > 1:
print('a plateau of size %d is found' % peak_plateaus['plateau_sizes'][i])
print('its left index is %d and right index is %d' % (peak_plateaus['left_edges'][i], peak_plateaus['right_edges'][i]))
it will print
a plateau of size 10 is found
its left index is 10 and right index is 19

This is really just a "dumb" machine learning task. You'll want to code a custom function to screen for them. You have two key characteristics to a plateau:
They're consecutive occurrences of the same value (or very nearly so).
The first and last points deviate strongly from a forward and backward moving average, respectively. (Try quantifying this based on the standard deviation if you expect additive noise, for geometric noise you'll have to take the magnitude of your signal into account too.)
A simple loop should then be sufficient to calculate a forward moving average, stdev of points in that forward moving average, reverse moving average, and stdev of points in that reverse moving average.
Read until you find a point well outside the regular noise (compare to variance). Start buffering those indices into a list.
Keep reading and buffering indices into that list while they have the same value (or nearly the same, if your plateaus can be a little rough; you'll want to use some tolerance plus the standard deviation of your plateaus, or just some tolerance if you expect them all to behave similarly).
If the variance of the points in your buffer gets too high, it's not a plateau, too rough; throw it out and start scanning again from your current position.
If the last value was very different from the previous (on the order of the change that triggered your code to start buffering indices) and in the opposite direction of the original impulse, cap your buffer here; you've got a plateau there.
Now do whatever you want with the points at those indices. Delete them, replace them with a linear interpolation between the two boundary points, whatever.
I could generate some noise and give you some sample code, but this is really something you're going to have to adapt to your application. (For example, there's a shortcoming in this method that a plateau which captures a point on the middle of the "cliff edge" may leave that point when it removes the rest of the plateau. If that's something you're worried about, you'll have to do a little more exploring after you ID the plateau.) You should be able to do this in a single pass over the data, but it might be wise to get some statistics on the whole set first to intelligently tweak your thresholds.
If you have an exact definition of what constitutes a plateau, you can make this a lot less hand-wavey and ML-looking, but so long as you're trying to identify fuzzy pattern, you're gonna have to take a statistics-based approach.

I had a similar problem, and found a simple heuristic solution shared below. I find plateaus as ranges of constant gradient of the signal. You could change the code to also check that the gradient is (close to) 0.
I apply a moving average (uniform_filter_1d) to filter out noise. Also, I calculate the first and second derivative of the signal numerically, so I'm not sure it matches the requirement of efficiency. But it worked perfectly for my signal and might be a good starting point for others.
def find_plateaus(F, min_length=200, tolerance = 0.75, smoothing=25):
'''
Finds plateaus of signal using second derivative of F.
Parameters
----------
F : Signal.
min_length: Minimum length of plateau.
tolerance: Number between 0 and 1 indicating how tolerant
the requirement of constant slope of the plateau is.
smoothing: Size of uniform filter 1D applied to F and its derivatives.
Returns
-------
plateaus: array of plateau left and right edges pairs
dF: (smoothed) derivative of F
d2F: (smoothed) Second Derivative of F
'''
import numpy as np
from scipy.ndimage.filters import uniform_filter1d
# calculate smooth gradients
smoothF = uniform_filter1d(F, size = smoothing)
dF = uniform_filter1d(np.gradient(smoothF),size = smoothing)
d2F = uniform_filter1d(np.gradient(dF),size = smoothing)
def zero_runs(x):
'''
Helper function for finding sequences of 0s in a signal
https://stackoverflow.com/questions/24885092/finding-the-consecutive-zeros-in-a-numpy-array/24892274#24892274
'''
iszero = np.concatenate(([0], np.equal(x, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# Find ranges where second derivative is zero
# Values under eps are assumed to be zero.
eps = np.quantile(abs(d2F),tolerance)
smalld2F = (abs(d2F) <= eps)
# Find repititions in the mask "smalld2F" (i.e. ranges where d2F is constantly zero)
p = zero_runs(np.diff(smalld2F))
# np.diff(p) gives the length of each range found.
# only accept plateaus of min_length
plateaus = p[(np.diff(p) > min_length).flatten()]
return (plateaus, dF, d2F)

How can I speed up nearest neighbor search with python?

I have a code, which calculates the nearest voxel (which is unassigned) to a voxel ( which is assigned). That is i have an array of voxels, few voxels already have a scalar (1,2,3,4....etc) values assigned, and few voxels are empty (lets say a value of '0'). This code below finds the nearest assigned voxel to an unassigned voxel and assigns that voxel the same scalar. So, a voxel with a scalar '0' will be assigned a value (1 or 2 or 3,...) based on the nearest voxel. This code below works, but it takes too much time.
Is there an alternative to this ? or if you have any feedback on how to improve it further?
""" #self.voxels is a 3D numpy array"""
def fill_empty_voxel1(self,argx, argy, argz):
""" where # argx, argy, argz are the voxel location where the voxel is zero"""
argx1, argy1, argz1 = np.where(self.voxels!=0) # find the non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= self.mean) # self.mean is a mean radius search value
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
self.voxels[argx,argy,argz] = self.voxels[argx2,argy2,argz2] # update the voxel array
Example
""" Here is a small example with small dataset:"""
import numpy as np
from scipy.spatial import cKDTree
import timeit
voxels = np.zeros((10,10,5), dtype=np.uint8)
voxels[1:2,:,:] = 5.
voxels[5:6,:,:] = 2.
voxels[:,3:4,:] = 1.
voxels[:,8:9,:] = 4.
argx, argy, argz = np.where(voxels==0)
tic=timeit.default_timer()
argx1, argy1, argz1 = np.where(voxels!=0) # non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5.)
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
voxels[argx,argy,argz] = voxels[argx2,argy2,argz2]
toc=timeit.default_timer()
timetaken = toc - tic #elapsed time in seconds
print '\nTime to fill empty voxels', timetaken
for visualization:
from mayavi import mlab
data = voxels.astype('float')
scalar_field = mlab.pipeline.scalar_field(data)
iso_surf = mlab.pipeline.iso_surface(scalar_field)
surf = mlab.pipeline.surface(scalar_field)
vol = mlab.pipeline.volume(scalar_field,vmin=0,vmax=data.max())
mlab.outline()
mlab.show()
Now, if I have the dimension of the voxels array as something like (500,500,500), then the time it takes to compute the nearest search is no longer efficient. How can I overcome this? Could parallel computation reduce the time (I have no idea whether I can parallelize the code, if you do, please let me know)?
A potential fix:
I could substantially improve the computation time by adding the n_jobs = -1 parameter in the cKDTree query.
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5., n_jobs=-1)
I was able to compute the distances in less than a hour for an array of (400,100,100) on a 13 core CPU. I tried with 1 processor and it takes around 18 hours to complete the same array.
Thanks to #gsamaras for the answer!

You can switch to approximate nearest neighbors (ANN) algorithms which usually take advantage of sophisticated hashing or proximity graph techniques to index your data quickly and perform faster queries. One example is Spotify's Annoy. Annoy's README includes a plot which shows precision-performance tradeoff comparison of various ANN algorithms published in recent years. The top-performing algorithm (at the time this comment was posted), hnsw, has a Python implementation under Non-Metric Space Library (NMSLIB).

It would be interesting to try sklearn.neighbors.NearestNeighbors, which offers n_jobs parameter:
The number of parallel jobs to run for neighbors search.
This package also provides the Ball Tree algorithm, which you can test versus the kd-tree one, however my hunch is that the kd-tree will be better (but that again does depend on your data, so research that!).
You might also want to use dimensionality reduction, which is easy. The idea is that you reduce your dimensions, thus your data contain less info, so that tackling the Nearest Neighbour Problem can be done much faster. Of course, there is a trade off here, accuracy!
You might/will get less accuracy with dimensionality reduction, but it might worth the try. However, this usually applies in a high dimensional space, and you are just in 3D. So I don't know if for your specific case it would make sense to use sklearn.decomposition.PCA.
A remark:
If you really want high performance though, you won't get it with python, you could switch to c++, and use CGAL for example.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.