I am using Python to perform a Fast Fourier Transform on some data. I then need to extract the locations of the peaks in the transform in the form of the x-values. Right now I am using Scipy's fft tool to perform the transform, which seems to be working. However, when i use Scipy's find_peaks I only get the y-values, not the x-position that I need. I also get the warning:
ComplexWarning: Casting complex values to real discards the imaginary part
Is there a better way for me to do this? Here is my code at the moment:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.fft import fft
from scipy.signal import find_peaks
headers = ["X","Y"]
original_data = pd.read_csv("testdata.csv",names=headers)
x = original_data["X"]
y = original_data["Y"]
a = fft(y)
peaks = find_peaks(a)
print(peaks)
plt.plot(x,a)
plt.title("Fast Fourier transform")
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()
There seem to be two points of confusion here:
What find_peaks is returning.
How to interpret complex values that the FFT is returning.
I will answer them separately.
Point #1
find_peaks returns the indices in "a" that correspond to peaks, so I believe they ARE values you seek, however you must plot them differently. You can see the first example from the documentation here. But basically "peaks" is the index, or x value, and a[peaks] will be the y value. So to plot all your frequencies, and mark the peaks you could do:
plt.plot(a)
plt.plot(peaks, a[peaks])
Point #2
As for the second point, you should probably read up more on the output of FFTs, this post is a short summary but you may need more background to understand it.
But basically, an FFT will return an array of complex numbers, which contains both phase and magnitude information. What you are currently doing is implicitly only looking at the real part of the solution (hence the warning that the imaginary portion is being discarded), what you probably want instead to take the magnitude of your "a" array, but without more information on your application it is impossible to say.
I tried to put as much details as possible:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.fft import fft, fftfreq
from scipy.signal import find_peaks
# First: Let's generate a dummy dataframe with X,Y
# The signal consists in 3 cosine signals with noise added. We terminate by creating
# a pandas dataframe.
import numpy as np
X=np.arange(start=0,stop=20,step=0.01) # 20 seconds long signal sampled every 0.01[s]
# Signal components given by [frequency, phase shift, Amplitude]
GeneratedSignal=np.array([[5.50, 1.60, 1.0], [10.2, 0.25, 0.5], [18.3, 0.70, 0.2]])
Y=np.zeros(len(X))
# Let's add the components one by one
for P in GeneratedSignal:
Y+=np.cos(2*np.pi*P[0]*X-P[1])*P[2]
# Let's add some gaussian random noise (mu=0, sigma=noise):
noise=0.5
Y+=np.random.randn(len(X))*noise
# Let's build the dataframe:
dummy_data=pd.DataFrame({'X':X,'Y':Y})
print('Dummy dataframe: ')
print(dummy_data.head())
# Figure-1: The dummy data
plt.plot(X,Y)
plt.title('Dummy data')
plt.xlabel('time [s]')
plt.ylabel('Amplitude')
plt.show()
# ----------------------------------------------------
# Processing:
headers = ["X","Y"]
#original_data = pd.read_csv("testdata.csv",names=headers)
# Let's take our dummy data:
original_data = dummy_data
x = np.array(original_data["X"])
y = np.array(original_data["Y"])
# Assuming the time step is constant:
# (otherwise you'll need to resample the data at a constant rate).
dt=x[1]-x[0] # time step of the data
# The fourier transform of y:
yf=fft(y, norm='forward')
# Note: see help(fft) --> norm. I chose 'forward' because it gives the amplitudes we put in.
# Otherwise, by default, yf will be scaled by a factor of n: the number of points
# The frequency scale
n = x.size # The number of points in the data
freq = fftfreq(n, d=dt)
# Let's find the peaks with height_threshold >=0.05
# Note: We use the magnitude (i.e the absolute value) of the Fourier transform
height_threshold=0.05 # We need a threshold.
# peaks_index contains the indices in x that correspond to peaks:
peaks_index, properties = find_peaks(np.abs(yf), height=height_threshold)
# Notes:
# 1) peaks_index does not contain the frequency values but indices
# 2) In this case, properties will contain only one property: 'peak_heights'
# for each element in peaks_index (See help(find_peaks) )
# Let's first output the result to the terminal window:
print('Positions and magnitude of frequency peaks:')
[print("%4.4f \t %3.4f" %(freq[peaks_index[i]], properties['peak_heights'][i])) for i in range(len(peaks_index))]
# Figure-2: The frequencies
plt.plot(freq, np.abs(yf),'-', freq[peaks_index],properties['peak_heights'],'x')
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()
The terminal output:
Dummy dataframe:
X Y
0 0.00 0.611829
1 0.01 0.723775
2 0.02 0.768813
3 0.03 0.798328
Positions and magnitude of frequency peaks:
5.5000 0.4980
10.2000 0.2575
18.3000 0.0999
-18.3000 0.0999
-10.2000 0.2575
-5.5000 0.4980
NOTE: Since the signal is real-valued, each frequency component will have a "double" that is negative (this is a property of the Fourier transform). This also explains why the amplitudes are half of those we gave at the beginning. But if, for a particular frequency, we add the amplitudes for the negative and positive components, we get the original amplitude of the real-valued signal.
For further exploration: You can change the length of the signal to 1 [s] (at the beginning of the script):
X=np.arange(start=0,stop=1,step=0.01) # 1 seconds long signal sampled every 0.01[s]
Since the length of the signal is now reduced, the frequencies are less well defined (the peaks have now a width)
So, add: width=0 to the line containing the find_peaks instruction:
peaks_index, properties = find_peaks(np.abs(yf), height=height_threshold, width=0)
Then look at what is contained inside properties:
print(properties)
You'll see that find_peaks gives you much more informations than just
the peaks positions. For more info about what is inside properties:
help(find_peaks)
Figures:
Related
I have a vibration data in time domain and want to convert it to frequency domain with fft. However the plot of the FFT only shows a big spike at zero and nothing else.
This is my vibration data: https://pastebin.com/7RK57kJW
My code:
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(3000)
a1_fft= np.fft.fft(a1, axis=0)
freq = np.fft.fftfreq(t.shape[-1])
plt.plot(freq, a1_fft)
My FFT Plot:
What am I doing wrong here? I am pretty sure my data is uniform, which provoces in other cases a similar problem with fft.
The bins of the FFT correspond to the frequencies at 0, df, 2df, 3df, ..., F-2df, F-df, where df is determined by the number of bins and F is 1 cycle per bin.
Notice the zero frequency at the beginning. This is called the DC offset. It's the mean of your data. In the data that you show, the mean is ~1.32, while the amplitude of the sine wave is around 0.04. It's not surprising that you can't see a peak that's 33x smaller than the DC term.
There are some common ways to visualize the data that help you get around this. One common methods is to keep the DC offset but use a log scale, at least for the y-axis:
plt.semilogy(freq, a1_fft)
OR
plt.loglog(freq, a1_fft)
Another thing you can do is zoom in on the bottom 1/33rd or so of the plot. You can do this manually, or by adjusting the span of the displayed Y-axis:
p = np.abs(a1_fft[1:]).max() * [-1.1, 1.1]
plt.ylim(p)
If you are plotting the absolute values already, use
p = np.abs(a1_fft[1:]).max() * [-0.1, 1.1]
Another method is to remove the DC offset. A more elegant way of doing this than what #J. Schmidt suggests is to simply not display the DC term:
plt.plot(freq[1:], a1_fft[1:])
Or for the positive frequencies only:
n = freq.size
plt.plot(freq[1:n//2], a1_fft[1:n//2])
The cutoff at n // 2 is only approximate. The correct cutoff depends on whether the FFT has an even or odd number of elements. For even numbers, the middle bin actual has energy from both sides of the spectrum and often gets special treatment.
The peak at 0 is the DC-gain, which is very high since you didn't normalize your data. Also, the Fourier transform is a complex number, you should plot the absolute value and phase separately. In this code I also plotted only the positive frequencies:
import numpy as np
import matplotlib.pyplot as plt
#Import data
a1 = np.loadtxt('a1.txt')
plt.plot(a1)
#Normalize a1
a1 -= np.mean(a1)
#Your code
t = np.arange(3000)
a1_fft= np.fft.fft(a1, axis=0)
freq = np.fft.fftfreq(t.shape[-1])
#Only plot positive frequencies
plt.figure()
plt.plot(freq[freq>=0], np.abs(a1_fft)[freq>=0])
I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:600]
sb.kdeplot(y)
But including 451 of the minimum values gives a very different output:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:601]
sb.kdeplot(y)
Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.
The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.
The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3 could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.
Here is some sample code to show the difference between bw='scott' and bw=0.3. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3})
for i, bw in enumerate(['scott', 0.3]):
for j, num_same in enumerate([400, 450, 500]):
y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)])
sns.kdeplot(y, bw=bw, ax=axs[i, j])
axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}')
plt.show()
The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.
PS: As mentioned by #mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde is used (not scipy.stats.gaussian_kde). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34). The min() clarifies the abrupt change in behavior. IQR is the "interquartile range", the difference between the 75th and 25th percentiles.
Edit: Since Seaborn 0.11, the statsmodel backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde.
If the sample has repeated values, this implies that the underlying distribution is not continuous. In the data that you show to illustrate the issue, we can see a Dirac distribution on the left. The kernel smoothing might be applied for such data, but with care. Indeed, to approximate such data, we might use a kernel smoothing where the bandwidth associated to the Dirac is zero. However, in most KDE methods, there is only one single bandwidth for all kernel atoms. Moreover, the various rules used to compute the bandwidth are based on some estimation of the rugosity of the second derivative of the PDF of the distribution. This cannot be applied to a discontinuous distribution.
We can, however, try to separate the sample into two sub-samples:
the sub-sample(s) with replications,
the sub-sample with unique realizations.
(This idea has already been mentionned by johanc).
Below is an attempt to perform this classification. The np.unique method is used to count the occurences of the replicated realizations. The replicated values are associated with Diracs and the weight in the mixture is estimated from the fraction of these replicated values in the sample. The remaining realizations, uniques, are then used to estimate the continuous distribution with KDE.
The following function will be useful in order to overcome a limitation with the current implementation of the draw method of Mixtures with OpenTURNS.
def DrawMixtureWithDiracs(distribution):
"""Draw a distributions which has Diracs.
https://github.com/openturns/openturns/issues/1489"""
graph = distribution.drawPDF()
graph.setLegends(["Mixture"])
for atom in distribution.getDistributionCollection():
if atom.getName() == "Dirac":
curve = atom.drawPDF()
curve.setLegends(["Dirac"])
graph.add(curve)
return graph
The following script creates a use-case with a Mixture containing a Dirac and a gaussian distributions.
import openturns as ot
import numpy as np
distribution = ot.Mixture([ot.Dirac(-3.0),
ot.Normal()], [0.5, 0.5])
DrawMixtureWithDiracs(distribution)
This is the result.
Then we create a sample.
sample = distribution.getSample(100)
This is where your problem begins. We count the number of occurences of each realizations.
array = np.array(sample)
unique, index, count = np.unique(array, axis=0, return_index=True,
return_counts=True)
For all realizations, replicated values are associated with Diracs and unique values are put in a separate list.
sampleSize = sample.getSize()
listOfDiracs = []
listOfWeights = []
uniqueValues = []
for i in range(len(unique)):
if count[i] == 1:
uniqueValues.append(unique[i][0])
else:
atom = ot.Dirac(unique[i])
listOfDiracs.append(atom)
w = count[i] / sampleSize
print("New Dirac =", unique[i], " with weight =", w)
listOfWeights.append(w)
The weight of the continuous atom is the complementary of the sum of the weights of the Diracs. This way, the sum of the weights will be equal to 1.
complementaryWeight = 1.0 - sum(listOfWeights)
weights = list(listOfWeights)
weights.append(complementaryWeight)
The easy part comes: the unique realizations can be used to fit a kernel smoothing. The KDE is then added to the list of atoms.
sampleUniques = ot.Sample(uniqueValues, 1)
factory = ot.KernelSmoothing()
kde = factory.build(sampleUniques)
atoms = list(listOfDiracs)
atoms.append(kde)
Et voilĂ : the Mixture is ready.
mixture_estimated = ot.Mixture(atoms, weights)
The following script compares the initial Mixture and the estimated one.
graph = DrawMixtureWithDiracs(distribution)
graph.setColors(["dodgerblue3", "dodgerblue3"])
curve = DrawMixtureWithDiracs(mixture_estimated)
curve.setColors(["darkorange1", "darkorange1"])
curve.setLegends(["Est. Mixture", "Est. Dirac"])
graph.add(curve)
graph
The figure seems satisfactory, since the continuous distribution is estimated from a sub-sample which size is only equal to 50, i.e. one half of the full sample.
I'm going to compare stft frequency data with another stft frequency data. I just can use stft method, but I don't know how to extract stft frequency data. Here is my data.
from scipy import signal
import matplotlib.pyplot as plt
import numpy as np
# Data load
data = open('data.txt', 'r').read().split('\n')
time = []
temperature = []
for i in range(0, len(data)):
time.append(float(data[i][0:8]))
temperature.append(float(data[i][9:len(data[i])]))
fs = len(time)/(max(time)-min(time)) # Sampling frequency
# FFT
f, t, Zxx = signal.stft(temperature, fs)
plt.pcolormesh(t, 2*np.pi*1.8*f/1e3, np.abs(Zxx), vmin=0, vmax=100)
enter image description here
How can I extract yellow line data? (x axis is time / y axis is frequency)
This is not perfect, but should work. It will give you the maxima of your fft. The trick is to use np.where
my_rand_fft = np.random.rand(20,80)
The next is to model the fact that your STFT contains a lot of constant value at the low frequencies. If I am wrong, change the later code accordingly
my_rand_fft[-1,:]=1
The brute force approach:
pos_of_max=[]
for n in range(np.shape(my_rand_fft)[1]):
pos_of_max.append((0,np.where(my_rand_fft[0:-1,n]==np.max(my_rand_fft[0:-1,n])[0])))
The more elegant solution
pos_of_max=np.where(my_rand_fft==np.max(my_rand_fft[0:-1,:], axis=0))
Make sure that the part with the maxima is excluded. If they are in the zero position, keep in mind that you need to add whatever was skipped to exclude them.
I am new to Python.
I intend to do Fourier Transform to an array of discrete points, (time, acceleration), and plot the result out.
I copy and paste the sample FFT code, and modify accordingly.
Please see codes:
import numpy as np
import matplotlib.pyplot as plt
# Load the .txt file in
myData = np.loadtxt('twenty_z_up.txt')
# Extract the time and acceleration columns
time = copy(myData[:,0])
# Extract the acceleration columns
zAcc = copy(myData[:,3])
t = np.arange(10080)
sp = np.fft.fft(zAcc)
freq = np.fft.fftfreq(t.shape[-1])
plt.plot(freq, sp.real)
myData is a rectangular matrix with 10080 rows and 10 columns.
Thus, zAcc is the row3 extracted from the matrix.
In the plot drawn by Spyder, most of the harmonics concentrated around 0.
They are all extremely small.
But my data are actually the accelerations of the phone carried by a walking person (including the gravity). So I expect the most significant harmonic happens around 2Hz.
Why is the graph non-sense?
Thanks in advance!
==============UPDATES: My Graphs======================
The first time domain one:
x-axis is in millisecond.
y-axis is in m/s^2, due to earth gravity, it has a DC offset of ~10.
You do get two spikes at (approximately) 2Hz. Your sampling period is around 2.8 ms (as best as I can infer from your first plot), giving +/-2Hz the normalized frequency of +/-0.056, which is about where your spikes are. fft.fftfreq by default returns the normalized frequency (which scales the sampling period). You can set the d argument to be the sampling period, and you'll get a vector containing the actual frequency.
Your huge spike in the middle is obviously the DC offset (which you can trivially remove by subtracting the mean).
As others said, we need to see the data, post it somewhere. Just to check, try first fixing the timestep size in fftfreq, then plot this synthetic signal, and then plot your signal to see how they compare:
timestep=1./50.#Assume sampling at 50Hz. Change this accordingly.
N=10080#the number of samples
T=N*timestep
t = np.linspace(0,T,N)#needed only to generate xAcc_synthetic
freq=2.#peak a frequency at 2Hz
#generate synthetic signal at 2Hz and add some noise to it
xAcc_synthetic = sin((2*np.pi)*freq*t)+np.random.rand(N)*0.2
sp_synthetic = np.fft.fft(xAcc_synthetic)
freq = np.fft.fftfreq(t.size,d=timestep)
print max(abs(freq))==(1/timestep)/2.#simple check highest freq.
plt.plot(freq, abs(sp_synthetic))
xlabel('Hz')
Now, at the x axis equal to 2 you actually have a physical frequency of 2Hz, and you may spot the more pronounced peak you are looking for. Moreover, you may want to have a look also at yAcc and zAcc.
Audio processing is pretty new for me. And currently using Python Numpy for processing wave files. After calculating FFT matrix I am getting noisy power values for non-existent frequencies. I am interested in visualizing the data and accuracy is not a high priority. Is there a safe way to calculate the clipping value to remove these values, or should I use all FFT matrices for each sample set to come up with an average number ?
regards
Edit:
from numpy import *
import wave
import pymedia.audio.sound as sound
import time, struct
from pylab import ion, plot, draw, show
fp = wave.open("500-200f.wav", "rb")
sample_rate = fp.getframerate()
total_num_samps = fp.getnframes()
fft_length = 2048.
num_fft = (total_num_samps / fft_length ) - 2
temp = zeros((num_fft,fft_length), float)
for i in range(num_fft):
tempb = fp.readframes(fft_length);
data = struct.unpack("%dH"%(fft_length), tempb)
temp[i,:] = array(data, short)
pts = fft_length/2+1
data = (abs(fft.rfft(temp, fft_length)) / (pts))[:pts]
x_axis = arange(pts)*sample_rate*.5/pts
spec_range = pts
plot(x_axis, data[0])
show()
Here is the plot in non-logarithmic scale, for synthetic wave file containing 500hz(fading out) + 200hz sine wave created using Goldwave.
Simulated waveforms shouldn't show FFTs like your figure, so something is very wrong, and probably not with the FFT, but with the input waveform. The main problem in your plot is not the ripples, but the harmonics around 1000 Hz, and the subharmonic at 500 Hz. A simulated waveform shouldn't show any of this (for example, see my plot below).
First, you probably want to just try plotting out the raw waveform, and this will likely point to an obvious problem. Also, it seems odd to have a wave unpack to unsigned shorts, i.e. "H", and especially after this to not have a large zero-frequency component.
I was able to get a pretty close duplicate to your FFT by applying clipping to the waveform, as was suggested by both the subharmonic and higher harmonics (and Trevor). You could be introducing clipping either in the simulation or the unpacking. Either way, I bypassed this by creating the waveforms in numpy to start with.
Here's what the proper FFT should look like (i.e. basically perfect, except for the broadening of the peaks due to the windowing)
Here's one from a waveform that's been clipped (and is very similar to your FFT, from the subharmonic to the precise pattern of the three higher harmonics around 1000 Hz)
Here's the code I used to generate these
from numpy import *
from pylab import ion, plot, draw, show, xlabel, ylabel, figure
sample_rate = 20000.
times = arange(0, 10., 1./sample_rate)
wfm0 = sin(2*pi*200.*times)
wfm1 = sin(2*pi*500.*times) *(10.-times)/10.
wfm = wfm0+wfm1
# int test
#wfm *= 2**8
#wfm = wfm.astype(int16)
#wfm = wfm.astype(float)
# abs test
#wfm = abs(wfm)
# clip test
#wfm = clip(wfm, -1.2, 1.2)
fft_length = 5*2048.
total_num_samps = len(times)
num_fft = (total_num_samps / fft_length ) - 2
temp = zeros((num_fft,fft_length), float)
for i in range(num_fft):
temp[i,:] = wfm[i*fft_length:(i+1)*fft_length]
pts = fft_length/2+1
data = (abs(fft.rfft(temp, fft_length)) / (pts))[:pts]
x_axis = arange(pts)*sample_rate*.5/pts
spec_range = pts
plot(x_axis, data[2], linewidth=3)
xlabel("freq (Hz)")
ylabel('abs(FFT)')
show()
FFT's because they are windowed and sampled cause aliasing and sampling in the frequency domain as well. Filtering in the time domain is just multiplication in the frequency domain so you may want to just apply a filter which is just multiplying each frequency by a value for the function for the filter you are using. For example multiply by 1 in the passband and by zero every were else. The unexpected values are probably caused by aliasing where higher frequencies are being folded down to the ones you are seeing. The original signal needs to be band limited to half your sampling rate or you will get aliasing. Of more concern is aliasing that is distorting the area of interest because for this band of frequencies you want to know that the frequency is from the expected one.
The other thing to keep in mind is that when you grab a piece of data from a wave file you are mathmatically multiplying it by a square wave. This causes a sinx/x to be convolved with the frequency response to minimize this you can multiply the original windowed signal with something like a Hanning window.
It's worth mentioning for a 1D FFT that the first element (index [0]) contains the DC (zero-frequency) term, the elements [1:N/2] contain the positive frequencies and the elements [N/2+1:N-1] contain the negative frequencies. Since you didn't provide a code sample or additional information about the output of your FFT, I can't rule out the possibility that the "noisy power values at non-existent frequencies" aren't just the negative frequencies of your spectrum.
EDIT: Here is an example of a radix-2 FFT implemented in pure Python with a simple test routine that finds the FFT of a rectangular pulse, [1.,1.,1.,1.,0.,0.,0.,0.]. You can run the example on codepad and see that the FFT of that sequence is
[0j, Negative frequencies
(1+0.414213562373j), ^
0j, |
(1+2.41421356237j), |
(4+0j), <= DC term
(1-2.41421356237j), |
0j, v
(1-0.414213562373j)] Positive frequencies
Note that the code prints out the Fourier coefficients in order of ascending frequency, i.e. from the highest negative frequency up to DC, and then up to the highest positive frequency.
I don't know enough from your question to actually answer anything specific.
But here are a couple of things to try from my own experience writing FFTs:
Make sure you are following Nyquist rule
If you are viewing the linear output of the FFT... you will have trouble seeing your own signal and think everything is broken. Make sure you are looking at the dB of your FFT magnitude. (i.e. "plot(10*log10(abs(fft(x))))" )
Create a unitTest for your FFT() function by feeding generated data like a pure tone. Then feed the same generated data to Matlab's FFT(). Do a absolute value diff between the two output data series and make sure the max absolute value difference is something like 10^-6 (i.e. the only difference is caused by small floating point errors)
Make sure you are windowing your data
If all of those three things work, then your fft is fine. And your input data is probably the issue.
Check the input data to see if there is clipping http://www.users.globalnet.co.uk/~bunce/clip.gif
Time doamin clipping shows up as mirror images of the signal in the frequency domain at specific regular intervals with less amplitude.