problem with cooley-tukey FFT algorithm in python

problem with cooley-tukey FFT algorithm in python - python

I've recently learned about the Cooley-Tukey FFT algorithm. I want to gain a deeper understanding of this algorithm and thus decided to write my own (non-recursive) implementation of it. However I can't get it to work. I've been messing with it for a few days but it just won't give a good output.
The output splits the DFT into even and odd DFTs and does this recursively until the DFTs consist of just a single data point.
I combine the N DFTs from the ground up with twiddle factors, for every frequency to get the complete DFT.
import math
import matplotlib.pyplot as plt
#Using numpy to work with complex numbers
import numpy as np
def twiddle(k,bits):
#Generate twiddle factors for a frequency
N=2**bits
T=[]
sign=1
for i in range(bits):
#Check if the frequency is in the upper or lower half of the range
if k>=N//2:
k-=N//2
sign=-1
#Generate complex twiddle factor for every stage of the algorithm
temp=sign*np.exp(-1j*math.tau*k/N)
T.append(temp)
N=N//2
sign=1
return T
def FFT(data,bits):
#Slice data to ensure its length is always a power of 2
N=2**bits
data=data[:N]
F=[]
#Calculate Fourier coefficient for every frequency
for k in range(N):
#Obtain twiddle factors for frequency
T=twiddle(k,bits)
#Copy input data into temporary array
temp=[x for x in data]
#Run through all stages
for i in range(bits):
#Combine even and odd partial DFT's with twiddle factor
temp=[temp[2*j]+T[bits-i-1]*temp[2*j+1] for j in range(2**(bits-i-1))]
F.append(temp[0])
return F
#Generate some input data
bits=10
t=range(0,2**bits)
f=300
samplerate=5000
v=[10*math.sin(math.tau*f*x/samplerate) for x in t]
f=[samplerate*i*2**(-bits) for i in range(2**bits)]
#Run function and plot
F=FFT(v,bits)
F=np.array(F)
plt.plot(f,abs(F))
To give an idea here is the the plot this code yields. Obviously since the input is a single 300Hz sinewave it should only return one peak at 300, which is then mirrored in the Nyquist frequency.
Any help would be greatly appreciated, I'm sure I've overlooked something or am just not using the right method.

I think you forgot about a bit-reversal permutaion. Radix-2|4|8 FFT algorithm is supposed to operate in-place and to do so it requires the values to be in a bit-reversed order.
Also, if you gonna dig deeper and to implement mixed-radix algorithm which is a generalization of Cooley-Tukey algorithm then you will need to implement a mixed-radix reversal as well

Related

Fastest algorithm for computing 3-D curl

I'm trying to write a section of code that computes the curl of a vector field numerically to second order with periodic boundary conditions. However, the algorithm I made is very slow and I'm wondering if anyone knows of any alternative algorithms.
To give more specific context: I'm using a 3xAxBxC numpy array as my vector field where the first axis refers to the Cartesian direction (x,y,z) and A,B,C refer to the number of bins in that Cartesian direction (i.e the resolution). So for example, I might have a vector field F = np.zeros((3,64,64,64)) where Fx = F[0] is a 64x64x64 Cartesian lattice in its own right. So far, my solution was to use the 3-point centered difference stencil to calculate the derivatives and used a nested loop to iterate over all the different dimensions using modular arithmetic to enforce the periodic boundary conditions (see below for example). However, as my resolution increases (the size of A,B,C) this begins to take a long time (upwards 2 minutes, which adds up if I do this several hundred times for my simulation - this is just one small part of a larger algorithm). I was wondering if anyone know of an alternative method for doing this?
import numpy as np
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
Just to re-iterate, I'm looking for an algorithm that'll compute the curl of a vector field array to second order given periodic boundary conditions faster than the one I have. Maybe there's nothing that will do this, but I just want to check before I keep spending time running this algorithm. Thank. you everyone in advance!

There may be better tools for this, but here is a trivial 200x speedup with numba:
import numpy as np
from numba import jit
def pure_python():
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
return VxF
#jit(fastmath=True)
def with_numba():
F =np.array([np.ones([128,128,128]),2*np.ones([128,128,128]),
3*np.ones([128,128,128])])
VxF =np.array([np.zeros([128,128,128]),np.zeros([128,128,128]),
np.zeros([128,128,128])])
for i in range(0,128):
for j in range(0,128):
for k in range(0,128):
VxF[0][i,j,k] = 0.5*((F[2][i,(j+1)%128,k]-
F[2][i,j-1,k])-(F[1][i,j,(k+1)%128]-F[1][i,j,k-1]))
VxF[1][i,j,k] = 0.5*((F[0][i,j,(k+1)%128]-
F[0][i,j,k-1])-(F[2][(i+1)%128,j,k]-F[2][i-1,j,k]))
VxF[2][i,j,k] = 0.5*((F[1][(i+1)%128,j,k]-
F[1][i-1,j,k])-(F[0][i,(j+1)%128,k]-F[0][i,j-1,k]))
return VxF
The pure Python version takes 13 seconds on my machine, while the numba version takes 65 ms.

How can I speed up nearest neighbor search with python?

I have a code, which calculates the nearest voxel (which is unassigned) to a voxel ( which is assigned). That is i have an array of voxels, few voxels already have a scalar (1,2,3,4....etc) values assigned, and few voxels are empty (lets say a value of '0'). This code below finds the nearest assigned voxel to an unassigned voxel and assigns that voxel the same scalar. So, a voxel with a scalar '0' will be assigned a value (1 or 2 or 3,...) based on the nearest voxel. This code below works, but it takes too much time.
Is there an alternative to this ? or if you have any feedback on how to improve it further?
""" #self.voxels is a 3D numpy array"""
def fill_empty_voxel1(self,argx, argy, argz):
""" where # argx, argy, argz are the voxel location where the voxel is zero"""
argx1, argy1, argz1 = np.where(self.voxels!=0) # find the non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= self.mean) # self.mean is a mean radius search value
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
self.voxels[argx,argy,argz] = self.voxels[argx2,argy2,argz2] # update the voxel array
Example
""" Here is a small example with small dataset:"""
import numpy as np
from scipy.spatial import cKDTree
import timeit
voxels = np.zeros((10,10,5), dtype=np.uint8)
voxels[1:2,:,:] = 5.
voxels[5:6,:,:] = 2.
voxels[:,3:4,:] = 1.
voxels[:,8:9,:] = 4.
argx, argy, argz = np.where(voxels==0)
tic=timeit.default_timer()
argx1, argy1, argz1 = np.where(voxels!=0) # non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5.)
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
voxels[argx,argy,argz] = voxels[argx2,argy2,argz2]
toc=timeit.default_timer()
timetaken = toc - tic #elapsed time in seconds
print '\nTime to fill empty voxels', timetaken
for visualization:
from mayavi import mlab
data = voxels.astype('float')
scalar_field = mlab.pipeline.scalar_field(data)
iso_surf = mlab.pipeline.iso_surface(scalar_field)
surf = mlab.pipeline.surface(scalar_field)
vol = mlab.pipeline.volume(scalar_field,vmin=0,vmax=data.max())
mlab.outline()
mlab.show()
Now, if I have the dimension of the voxels array as something like (500,500,500), then the time it takes to compute the nearest search is no longer efficient. How can I overcome this? Could parallel computation reduce the time (I have no idea whether I can parallelize the code, if you do, please let me know)?
A potential fix:
I could substantially improve the computation time by adding the n_jobs = -1 parameter in the cKDTree query.
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5., n_jobs=-1)
I was able to compute the distances in less than a hour for an array of (400,100,100) on a 13 core CPU. I tried with 1 processor and it takes around 18 hours to complete the same array.
Thanks to #gsamaras for the answer!

You can switch to approximate nearest neighbors (ANN) algorithms which usually take advantage of sophisticated hashing or proximity graph techniques to index your data quickly and perform faster queries. One example is Spotify's Annoy. Annoy's README includes a plot which shows precision-performance tradeoff comparison of various ANN algorithms published in recent years. The top-performing algorithm (at the time this comment was posted), hnsw, has a Python implementation under Non-Metric Space Library (NMSLIB).

It would be interesting to try sklearn.neighbors.NearestNeighbors, which offers n_jobs parameter:
The number of parallel jobs to run for neighbors search.
This package also provides the Ball Tree algorithm, which you can test versus the kd-tree one, however my hunch is that the kd-tree will be better (but that again does depend on your data, so research that!).
You might also want to use dimensionality reduction, which is easy. The idea is that you reduce your dimensions, thus your data contain less info, so that tackling the Nearest Neighbour Problem can be done much faster. Of course, there is a trade off here, accuracy!
You might/will get less accuracy with dimensionality reduction, but it might worth the try. However, this usually applies in a high dimensional space, and you are just in 3D. So I don't know if for your specific case it would make sense to use sklearn.decomposition.PCA.
A remark:
If you really want high performance though, you won't get it with python, you could switch to c++, and use CGAL for example.

Python summing values including "nan" and finding clustering error

I am almost new in Python. In clustering problem, suppose we have k clusters. I wanted to find sum of errors between points of clusters from their centers.
Here is my distance function to find sum of distance of groups of points (i.e. g) from a reference point p.
import numpy as np
import scipy
def distgp(g,p):
dist = sum(scipy.spatial.distance.cdist(g,p))
return dist
# Now I am going to find sum of errors of clusters.
f = 0
for i in range(k):
ix = LabelX==i
if any(ix):
f+=distgp(X[ix,:],X[ix,:].mean(axis=0)[:,None].T)
Here X is a dataset with almost 500000 observations and LabelX is cluster labels for points. Sometimes X[ix,:] is empty so the X[ix,:].mean(axis=0) is empty (RuntimeWarning: Mean of empty slice), that is why I used "if any(ix)" to just consider nonempty ones.
The time is very important to me. The code works but I think there should be some efficient way for the function distgp and the loop to find sum of errors. I would appreciate your comments on improving the code in terms of speed.
Thanks,
Sam

Python scipy.fftpack.rfft frequency bin mapping

I'm trying to get the correct FFT bin index based on the given frequency. The audio is being sampled at 44.1k Hz and the FFT size is 1024. Given the signal is real (capture from PyAudio, decoded through numpy.fromstring, windowed by scipy.signal.hann), I then perform FFT through scipy.fftpack.rfft, and compute the decibel of the result, in whole, magnitude = 20 * scipy.log10(abs(rfft(audio_sample)))
Based on this, and this, I originally had my mapping from the FFT bin index, k, to any frequency, F, as:
F = k*Fs/N for k = 0 ... N/2-1 where Fs is the sampling rate, and N is the FFT bin size, in this case, 1024. And the reverse as:
k = F*N/Fs for F = 0Hz ... Fs/2-Fs/N
However, realizing that the rfft's result is no symmetric like fft, and provides the result, in an N size array. I now have some questions in regarding the mapping and the function. Documentation unfortunately did not provide much information as I'm novice in this area.
My questions:
To me, the result of rfft on an audio sample can be used directly from the first bin to the last bin, as no symmetry occurs in the output, is that correct?
Given the lack of symmetry from the above, the frequency resolution appears to have increased, is this interpretation correct?
Because of using rfft, my mapping function from bin index k to frequency F is now F = k*Fs/(2N) for k = 0 ... N-1 is this correct?
Conversely, the reverse mapping function from frequency F to bin index k now becomes k = 2*F*N/Fs for F = 0Hz ... Fs/2-(Fs/2/N), what about the correctness of this?
My general confusion arises from how rfft is related to fft, and how the mapping can be done correctly while using rfft. I believe my mapping is offset by a small amount, and that is crucial in my application. Please point out the mistake or advise on the matter if possible, thank you very much.

First to clear up a few things for you:
A quick reference to the fftpack documentation reveals that rfft only gives you an output vector from 0..512 (in your case). The reason for this is exactly because of the symmetry present when calculating the discrete Fourier transform of a real-valued input:
y[k] = y*[N-k] (see Wikipedia page on DFTs). Therefore, the rfft function only calculates and stores N/2+1 values since you can calculate the other half by just taking the complex conjugates (should you really want it for plotting (say)). The fft function makes no assumption on the input values (they can have both a real and imaginary part) and therefore no symmetry can be assumed in the output and it gives you a full output vector with N values. Admittedly, most applications use a real input, so people tend to assume the symmetry is always there. Note that the Fast Fourier Transform (FFT) is an (efficient) algorithm to calculate the Discrete Fourier Transform (DFT) and the rfft function also uses the FFT to do the calculation.
In light of the above, your indices for accessing the output vector are out of bounds, i.e. > 512. The reasons why/how you can do this depends on your code. You should clearly distinguish between the 'logical N' (that you use to map the bin frequencies, define the DFT etc.) and the 'computational N' (the actual number of values in your output vector), then all your problems should disappear.
To concretely answer your questions:
No. There is symmetry and you need to use this to calculate the last bins (but they give you no extra information).
No. The only way to increase resolution of a DFT is to increase your sample length.
No, but almost. F = k*Fs/N for k = 0..N/2
For an output vector with N bins you get frequencies from 0 to (N-1)/N*Fs. Using the rfft you will have an output vector with N/2+1 bins. You do the maths, but I get 0..Fs/2
Hope things are clearer now.

Cython and numpy speed

I'm using cython for a correlation calculation in my python program. I have two audio data sets and I need to know the time difference between them. The second set is cut based on onset times and then slid across the first set. There are two for-loops: one slides the set and the inner loop calculates correlation at that point. This method works very well and it's accurate enough.
The problem is that with pure python this takes more than one minute. With my cython code, it takes about 17 seconds. This still is too much. Do you have any hints how to speed-up this code:
import numpy as np
cimport numpy as np
cimport cython
FTYPE = np.float
ctypedef np.float_t FTYPE_t
#cython.boundscheck(False)
def delay(np.ndarray[FTYPE_t, ndim=1] f, np.ndarray[FTYPE_t, ndim=1] g):
cdef int size1 = f.shape[0]
cdef int size2 = g.shape[0]
cdef int max_correlation = 0
cdef int delay = 0
cdef int current_correlation, i, j
# Move second data set frame by frame
for i in range(0, size1 - size2):
current_correlation = 0
# Calculate correlation at that point
for j in range(size2):
current_correlation += f[<unsigned int>(i+j)] * g[j]
# Check if current correlation is highest so far
if current_correlation > max_correlation:
max_correlation = current_correlation
delay = i
return delay

Edit:
There's now scipy.signal.fftconvolve which would be the preferred approach to doing the FFT based convolution approach that I describe below. I'll leave the original answer to explain the speed issue, but in practice use scipy.signal.fftconvolve.
Original answer:
Using FFTs and the convolution theorem will give you dramatic speed gains by converting the problem from O(n^2) to O(n log n). This is particularly useful for long data sets, like yours, and can give speed gains of 1000s or much more, depending on length. It's also easy to do: just FFT both signals, multiply, and inverse FFT the product. numpy.correlate doesn't use the FFT method in the cross-correlation routine and is better used with very small kernels.
Here's an example
from timeit import Timer
from numpy import *
times = arange(0, 100, .001)
xdata = 1.*sin(2*pi*1.*times) + .5*sin(2*pi*1.1*times + 1.)
ydata = .5*sin(2*pi*1.1*times)
def xcorr(x, y):
return correlate(x, y, mode='same')
def fftxcorr(x, y):
fx, fy = fft.fft(x), fft.fft(y[::-1])
fxfy = fx*fy
xy = fft.ifft(fxfy)
return xy
if __name__ == "__main__":
N = 10
t = Timer("xcorr(xdata, ydata)", "from __main__ import xcorr, xdata, ydata")
print 'xcorr', t.timeit(number=N)/N
t = Timer("fftxcorr(xdata, ydata)", "from __main__ import fftxcorr, xdata, ydata")
print 'fftxcorr', t.timeit(number=N)/N
Which gives the running times per cycle (in seconds, for a 10,000 long waveform)
xcorr 34.3761689901
fftxcorr 0.0768054962158
It's clear the fftxcorr method is much faster.
If you plot out the results, you'll see that they are very similar near zero time shift. Note, though, as you get further away the xcorr will decrease and the fftxcorr won't. This is because it's a bit ambiguous what to do with the parts of the waveform that don't overlap when the waveforms are shifted. xcorr treats it as zero and the FFT treats the waveforms as periodic, but if it's an issue it can be fixed by zero padding.

The trick with this sort of thing is to find a way to divide and conquer.
Currently, you're sliding to every position and check every point at every position -- effectively an O( n ^ 2 ) operation.
You need to reduce the check of every point and the comparison of every position to something that does less work to determine a non-match.
For example, you could have a shorter "is this even close?" filter that checks the first few positions. If the correlation is above some threshold, then keep going otherwise give up and move on.
You could have a "check every 8th position" that you multiply by 8. If this is too low, skip it and move on. If this is high enough, then check all of the values to see if you've found the maxima.
The issue is the time required to do all these multiplies -- (f[<unsigned int>(i+j)] * g[j]) In effect, you're filling a big matrix with all these products and picking the row with the maximum sum. You don't want to compute "all" the products. Just enough of the products to be sure you've found the maximum sum.
The issue with finding maxima is that you have to sum everything to see if it's biggest. If you can turn this into a minimization problem, it's easier to abandon computing products and sums once an intermediate result exceeds a threshold.
(I think this might work. I have't tried it.)
If you used max(g)-g[j] to work with negative numbers, you'd be looking for the smallest, not the biggest. You could compute the correlation for the first position. Anything that summed to a bigger value could be stopped immediately -- no more multiplies or adds for that offset, shift to another.

you can extract range(size2) from the external loop
you can use sum() instead of a loop to compute current_correlation
you can store correlations and delays in a list and then use max() to get the biggest one

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.