Related
I am using Python to perform a Fast Fourier Transform on some data. I then need to extract the locations of the peaks in the transform in the form of the x-values. Right now I am using Scipy's fft tool to perform the transform, which seems to be working. However, when i use Scipy's find_peaks I only get the y-values, not the x-position that I need. I also get the warning:
ComplexWarning: Casting complex values to real discards the imaginary part
Is there a better way for me to do this? Here is my code at the moment:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.fft import fft
from scipy.signal import find_peaks
headers = ["X","Y"]
original_data = pd.read_csv("testdata.csv",names=headers)
x = original_data["X"]
y = original_data["Y"]
a = fft(y)
peaks = find_peaks(a)
print(peaks)
plt.plot(x,a)
plt.title("Fast Fourier transform")
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()
There seem to be two points of confusion here:
What find_peaks is returning.
How to interpret complex values that the FFT is returning.
I will answer them separately.
Point #1
find_peaks returns the indices in "a" that correspond to peaks, so I believe they ARE values you seek, however you must plot them differently. You can see the first example from the documentation here. But basically "peaks" is the index, or x value, and a[peaks] will be the y value. So to plot all your frequencies, and mark the peaks you could do:
plt.plot(a)
plt.plot(peaks, a[peaks])
Point #2
As for the second point, you should probably read up more on the output of FFTs, this post is a short summary but you may need more background to understand it.
But basically, an FFT will return an array of complex numbers, which contains both phase and magnitude information. What you are currently doing is implicitly only looking at the real part of the solution (hence the warning that the imaginary portion is being discarded), what you probably want instead to take the magnitude of your "a" array, but without more information on your application it is impossible to say.
I tried to put as much details as possible:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.fft import fft, fftfreq
from scipy.signal import find_peaks
# First: Let's generate a dummy dataframe with X,Y
# The signal consists in 3 cosine signals with noise added. We terminate by creating
# a pandas dataframe.
import numpy as np
X=np.arange(start=0,stop=20,step=0.01) # 20 seconds long signal sampled every 0.01[s]
# Signal components given by [frequency, phase shift, Amplitude]
GeneratedSignal=np.array([[5.50, 1.60, 1.0], [10.2, 0.25, 0.5], [18.3, 0.70, 0.2]])
Y=np.zeros(len(X))
# Let's add the components one by one
for P in GeneratedSignal:
Y+=np.cos(2*np.pi*P[0]*X-P[1])*P[2]
# Let's add some gaussian random noise (mu=0, sigma=noise):
noise=0.5
Y+=np.random.randn(len(X))*noise
# Let's build the dataframe:
dummy_data=pd.DataFrame({'X':X,'Y':Y})
print('Dummy dataframe: ')
print(dummy_data.head())
# Figure-1: The dummy data
plt.plot(X,Y)
plt.title('Dummy data')
plt.xlabel('time [s]')
plt.ylabel('Amplitude')
plt.show()
# ----------------------------------------------------
# Processing:
headers = ["X","Y"]
#original_data = pd.read_csv("testdata.csv",names=headers)
# Let's take our dummy data:
original_data = dummy_data
x = np.array(original_data["X"])
y = np.array(original_data["Y"])
# Assuming the time step is constant:
# (otherwise you'll need to resample the data at a constant rate).
dt=x[1]-x[0] # time step of the data
# The fourier transform of y:
yf=fft(y, norm='forward')
# Note: see help(fft) --> norm. I chose 'forward' because it gives the amplitudes we put in.
# Otherwise, by default, yf will be scaled by a factor of n: the number of points
# The frequency scale
n = x.size # The number of points in the data
freq = fftfreq(n, d=dt)
# Let's find the peaks with height_threshold >=0.05
# Note: We use the magnitude (i.e the absolute value) of the Fourier transform
height_threshold=0.05 # We need a threshold.
# peaks_index contains the indices in x that correspond to peaks:
peaks_index, properties = find_peaks(np.abs(yf), height=height_threshold)
# Notes:
# 1) peaks_index does not contain the frequency values but indices
# 2) In this case, properties will contain only one property: 'peak_heights'
# for each element in peaks_index (See help(find_peaks) )
# Let's first output the result to the terminal window:
print('Positions and magnitude of frequency peaks:')
[print("%4.4f \t %3.4f" %(freq[peaks_index[i]], properties['peak_heights'][i])) for i in range(len(peaks_index))]
# Figure-2: The frequencies
plt.plot(freq, np.abs(yf),'-', freq[peaks_index],properties['peak_heights'],'x')
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()
The terminal output:
Dummy dataframe:
X Y
0 0.00 0.611829
1 0.01 0.723775
2 0.02 0.768813
3 0.03 0.798328
Positions and magnitude of frequency peaks:
5.5000 0.4980
10.2000 0.2575
18.3000 0.0999
-18.3000 0.0999
-10.2000 0.2575
-5.5000 0.4980
NOTE: Since the signal is real-valued, each frequency component will have a "double" that is negative (this is a property of the Fourier transform). This also explains why the amplitudes are half of those we gave at the beginning. But if, for a particular frequency, we add the amplitudes for the negative and positive components, we get the original amplitude of the real-valued signal.
For further exploration: You can change the length of the signal to 1 [s] (at the beginning of the script):
X=np.arange(start=0,stop=1,step=0.01) # 1 seconds long signal sampled every 0.01[s]
Since the length of the signal is now reduced, the frequencies are less well defined (the peaks have now a width)
So, add: width=0 to the line containing the find_peaks instruction:
peaks_index, properties = find_peaks(np.abs(yf), height=height_threshold, width=0)
Then look at what is contained inside properties:
print(properties)
You'll see that find_peaks gives you much more informations than just
the peaks positions. For more info about what is inside properties:
help(find_peaks)
Figures:
I can compute the autocorrelation using numpy's built in functionality:
numpy.correlate(x,x,mode='same')
However the resulting correlation is naturally noisy. I can partition my data, and compute the correlation on each resulting window, then average them all together to compute cleaner autocorrelation, similar to what signal.welch does. Is there a handy function in either numpy or scipy that does this, possibly faster than I would get if I were to compute partition and loop through the data myself?
UPDATE
This is motivated by #kazemakase answer. I have tried to show what I mean with some code used to generate the figure below.
One can see that #kazemakase is correct with the fact that the AC function naturally averages out the noise. However the averaging of the AC has the advantage that it is much faster! np.correlate seems to scale as the slow O(n^2) rather than O(nlogn) that I would expect if the correlation was calculated using circular convolution via the FFT...
from statsmodels.tsa.arima_model import ARIMA
import statsmodels as sm
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(12345)
arparams = np.array([.75, -.25, 0.2, -0.15])
maparams = np.array([.65, .35])
ar = np.r_[1, -arparams] # add zero-lag and negate
ma = np.r_[1, maparams] # add zero-lag
x = sm.tsa.arima_process.arma_generate_sample(ar, ma, 10000)
def calc_rxx(x):
x = x-x.mean()
N = len(x)
Rxx = np.correlate(x,x,mode="same")[N/2::]/N
#Rxx = np.correlate(x,x,mode="same")[N/2::]/np.arange(N,N/2,-1)
return Rxx/x.var()
def avg_rxx(x,nperseg=1024):
rxx_windows = []
Nw = int(np.floor(len(x)/nperseg))
print Nw
first = True
for i in range(Nw-1):
xw = x[i*nperseg:nperseg*(i+1)]
y = calc_rxx(xw)
if i%1 == 0:
if first:
plt.semilogx(y,"k",alpha=0.2,label="Short AC")
first = False
else:
plt.semilogx(y,"k",alpha=0.2)
rxx_windows.append(y)
print np.shape(rxx_windows)
return np.mean(rxx_windows,axis=0)
plt.figure()
r_avg = avg_rxx(x,nperseg=300)
r = calc_rxx(x)
plt.semilogx(r_avg,label="Average AC")
plt.semilogx(r,label="Long AC")
plt.xlabel("Lag")
plt.ylabel("Auto-correlation")
plt.legend()
plt.xlim([0,150])
plt.show()
TL-DR: To decrease noise in the autocorrelation function increase the length of your signal x.
Partitioning the data and averaging like in spectral estimation is an interesting idea. I wish it would work...
The autocorrelation is defined as
Let's say we partition the data into two windows. Their autocorrelations become
Note how they are only different in the limits of the sumations. Basically, we split the summation of the autocorrelation into two parts. When we add these back together we are back to the original autocorrelation! So we did not gain anything.
The conclusion is, there is no such thing implemented in numpy/scipy because there is no point in doing so.
Remarks:
I hope it's easy to see that this extends to any number of partitions.
to keep it simple I left the normalization out. If you divide Rxx by n and the partial Rxx by n/2 you get Rxx / n == (Rxx1 * 2/n + Rxx2 * 2/n) / 2. I.e. The mean of the normalized partial autocorrelation is equal to the complete normalized autocorrelation.
to keep it even simpler I assumed the signal x could be indexed beyond the limits of 0 and n-1. In practice, if the signal is stored in an array this is often not possible. In this case there is a small difference between the full and the partialized autocorrelations that increases with the lag l. Unfortunately, this is merely a loss of precision and does not reduce noise.
Code heretic! I don't belive your evil math!
Of course we can try things out and see:
import matplotlib.pyplot as plt
import numpy as np
n = 2**16
n_segments = 8
x = np.random.randn(n) # data
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
segments = x.reshape(n_segments, -1)
m = segments.shape[1]
rs = []
for y in segments:
ry = np.correlate(y, y, mode='same') / m # partial ACF
rs.append(ry)
l2 = np.arange(-m//2, m//2) # lags of partial ACFs
plt.plot(l1, rx, label='full ACF')
plt.plot(l2, np.mean(rs, axis=0), label='partial ACF')
plt.xlim(-m, m)
plt.legend()
plt.show()
Although we used 8 segments to average the ACF, the noise level visually stays the same.
Okay, so that's why it does not work but what is the solution?
Here are the good news: Autocorrelation is already a noise reduction technique! Well, in some way at least: An application of the ACF is to find periodic signals hidden by noise.
Since noise (ideally) has zero mean, its influence diminishes the more elements we sum up. In other words, you can reduce noise in the autocorrelation by using longer signals. (I guess this is probably not true for every type of noise, but should hold for the usual Gaussian white noise and its relatives.)
Behold the noise getting lower with more data samples:
import matplotlib.pyplot as plt
import numpy as np
for n in [2**6, 2**8, 2**12]:
x = np.random.randn(n)
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
plt.plot(l1, rx, label='n={}'.format(n))
plt.legend()
plt.xlim(-20, 20)
plt.show()
Despite having searched for two day in related questions, I have not really found an answer to this Problem yet...
In the following code, I generate n normally distributed random variables, which are then represented in a histogram:
import numpy as np
import matplotlib.pyplot as plt
n = 10000 # number of generated random variables
x = np.random.normal(0,1,n) # generate n random variables
# plot this in a non-normalized histogram:
plt.hist(x, bins='auto', normed=False)
# get the arrays containing the bin counts and the bin edges:
histo, bin_edges = np.histogram(x, bins='auto', normed=False)
number_of_bins = len(bin_edges)-1
After that, a curve fitting function and its parameters are found.
It is normally distributed with the parameters a1 and b1, and scaled with scaling_factor to meet the fact that the sample is unnormalized.
It indeed fits the histogram quite well:
import scipy as sp
a1, b1 = sp.stats.norm.fit(x)
scaling_factor = n*(x.max()-x.min())/number_of_bins
plt.plot(x_achse,scaling_factor*sp.stats.norm.pdf(x_achse,a1,b1),'b')
Here's the plot of the histogram with the fitting function in red.
After that, I want to test how well this function fits the histogram using the chi-squared test.
This test uses the observed values and the expected values in those points. To calculate the expected values, I first calculate the location of the middle of each bin, this information is contained in the array x_middle. I then calculate the value of the fitting function at the middle point of each bin, which gives the expected_value array:
observed_values = histo
bin_width = bin_edges[1] - bin_edges[0]
# array containing the middle point of each bin:
x_middle = np.linspace( bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins)*bin_width,
num = number_of_bins)
expected_values = scaling_factor*sp.stats.norm.pdf(x_middle,a1,b1)
Plugging this into the chisquare function of Scipy, I get p-values of approximately e-5 to e-15 order of magnitude, which tells me the fitting function does not describe the histogram:
print(sp.stats.chisquare(observed_values,expected_values,ddof=2))
But this is not true, the function fits the histogram very well!
Does anybody know where I made a mistake?
Thanks a lot!!
Charles
p.s.: I set the number of delta degrees of freedom to 2, because the 2 parameters a1 and b1 are estimated from the sample. I tried using other ddof, but the results were still as poor!
Your calculation of the end-point of the array x_middle is off by one; it should be:
x_middle = np.linspace(bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins - 1)*bin_width,
num=number_of_bins)
Note the extra - 1 in the second argument of linspace().
A more concise version is
x_middle = 0.5*(bin_edges[1:] + bin_edges[:-1])
A different (and possibly more accurate) approach to computing expected_values is to use the differences of the CDF, instead of approximating those differences using the PDF in the middle of each interval:
In [75]: from scipy import stats
In [76]: cdf = stats.norm.cdf(bin_edges, a1, b1)
In [77]: expected_values = n * np.diff(cdf)
With that calculation, I get the following result from the chi-squared test:
In [85]: stats.chisquare(observed_values, expected_values, ddof=2)
Out[85]: Power_divergenceResult(statistic=61.168393496775181, pvalue=0.36292223875686402)
How can I calculate in python the Cumulative Distribution Function (CDF)?
I want to calculate it from an array of points I have (discrete distribution), not with the continuous distributions that, for example, scipy has.
(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsum divided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsum of the array multiplied by the distances between the points will do.)
If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.
Let us have a closer look at this with a simple example:
import matplotlib.pyplot as plt
import numpy as np
# create some randomly ddistributed data:
data = np.random.randn(10000)
# sort the data:
data_sorted = np.sort(data)
# calculate the proportional values of samples
p = 1. * np.arange(len(data)) / (len(data) - 1)
# plot the sorted data:
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')
ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')
This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally, it is not as long as the number of points is finite.
This function is easy to invert, and it depends on your application which form you need.
Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipy does support discrete data when calculating cdf's
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete
# plot the cdf
sns.lineplot(x=x, y=norm_cdf)
plt.show()
We can even print the first few values of the cdf to show they are discrete
print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])
The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate
mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix
# generate 2d normally distributed samples using 0 mean and the covariance matrix above
x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)
In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm() - there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.
The empirical cumulative distribution function is a CDF that jumps exactly at the values in your data set. It is the CDF for a discrete distribution that places a mass at each of your values, where the mass is proportional to the frequency of the value. Since the sum of the masses must be 1, these constraints determine the location and height of each jump in the empirical CDF.
Given an array a of values, you compute the empirical CDF by first obtaining the frequencies of the values. The numpy function unique() is helpful here because it returns not only the frequencies, but also the values in sorted order. To calculate the cumulative distribution, use the cumsum() function, and divide by the total sum. The following function returns the values in sorted order and the corresponding cumulative distribution:
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
cusum = np.cumsum(counts)
return x, cusum / cusum[-1]
To plot the empirical CDF you can use matplotlib's plot() function. The option drawstyle='steps-post' ensures that jumps occur at the right place. However, you need to force a jump at the smallest data value, so it's necessary to insert an additional element in front of x and y.
import matplotlib.pyplot as plt
def plot_ecdf(a):
x, y = ecdf(a)
x = np.insert(x, 0, x[0])
y = np.insert(y, 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
Example usages:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
plot_ecdf(xvec)
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
plot_ecdf(df['x'])
with output:
For calculating CDF for array of discerete numbers:
import numpy as np
pdf, bin_edges = np.histogram(
data, # array of data
bins=500, # specify the number of bins for distribution function
density=True # True to return probability density function (pdf) instead of count
)
cdf = np.cumsum(pdf*np.diff(bins_edges))
Note that the return array pdf has the length of bins (500 here) and bin_edges has the length of bins+1 (501 here).
So, to calculate the CDF which is nothing but the area below the PDF distribution curve, we can simply calculate the cumulative sum of bin widths (np.diff(bins_edges)) times pdf using Numpy cumsum function
Here's an alternative pandas solution to calculating the empirical CDF, using pd.cut to sort the data into evenly spaced bins first, and then cumsum to compute the distribution.
def empirical_cdf(s: pd.Series, n_bins: int = 100):
# Sort the data into `n_bins` evenly spaced bins:
discretized = pd.cut(s, n_bins)
# Count the number of datapoints in each bin:
bin_counts = discretized.value_counts().sort_index().reset_index()
# Calculate the locations of each bin as just the mean of the bin start and end:
bin_counts["loc"] = (pd.IntervalIndex(bin_counts["index"]).left + pd.IntervalIndex(bin_counts["index"]).right) / 2
# Compute the CDF with cumsum:
return bin_counts.set_index("loc").iloc[:, -1].cumsum()
Below is an example use of the function to discretize the distribution of 10000 datapoints into 100 evenly spaced bins:
s = pd.Series(np.random.randn(10000))
cdf = empirical_cdf(s, n_bins=100)
fig, ax = plt.subplots()
ax.scatter(cdf.index, cdf.values)
import random
import numpy as np
import matplotlib.pyplot as plt
def get_discrete_cdf(values):
values = (values - np.min(values)) / (np.max(values) - np.min(values))
values_sort = np.sort(values)
values_sum = np.sum(values)
values_sums = []
cur_sum = 0
for it in values_sort:
cur_sum += it
values_sums.append(cur_sum)
cdf = [values_sums[np.searchsorted(values_sort, it)]/values_sum for it in values]
return cdf
rand_values = [np.random.normal(loc=0.0) for _ in range(1000)]
_ = plt.hist(rand_values, bins=20)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("nums")
cdf = get_discrete_cdf(rand_values)
x_p = list(zip(rand_values, cdf))
x_p.sort(key=lambda it: it[0])
x = [it[0] for it in x_p]
y = [it[1] for it in x_p]
_ = plt.plot(x, y)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("prob")
I have a question concerning fitting and getting random numbers.
Situation is as such:
Firstly I have a histogram from data points.
import numpy as np
"""create random data points """
mu = 10
sigma = 5
n = 1000
datapoints = np.random.normal(mu,sigma,n)
""" create normalized histrogram of the data """
bins = np.linspace(0,20,21)
H, bins = np.histogram(data,bins,density=True)
I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.
Thanks for your help
You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.
Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.
Example:
import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot
# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])
# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
u = random.uniform( x[0], x[-1] )
b[i] = inverse_density_function( u )
# plot both
pyplot.hist(a, 100)
pyplot.hist(b, 100)
pyplot.show()
This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.
P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.