I am trying to convert a MATLAB program to python. I am having problems setting up the cross power spectral density function and obtain matching results with Matlab.
The function used in the MATLAB code is written as follows:
[Pxy,f] = cpsd(x,y,M,round(M/2),M,fs);
In the documentation available in my code, I read that: M = 128 (number of FFT points) and fs = 25.0 (sampling frequency [Hz]). x and y are row-vectors 1x751 of acceleration data.
The function used has six arguments, so I assume that this: [pxy,f] = cpsd(x,y,window,noverlap,f,fs) is the the function which the programmer intended to call from the MATLAB library, as the only one available in the documentation with six arguments (see here)
This function returns the cross power spectral density estimates at the frequencies specified in f.
(It bugs me that f is not defined as a frequency but the variable M is passed there and it's the number of FFT points, but let's assume this was not a mistake).
Now, I would like to use scipy.signal.csd to convert this function, but there are two problems:
The window is defined in MatLab as an integer, but the csd from scipy allows windows only as tuples, strings, or array_like objects;
In the csd from scipy there is not an argument that allows to returns the cross power spectral density estimates at specific frequencies.
For number 1 I defined a window as follows:
window = hamming(M, sym=False)
I choose the hamming window as it's the default one specified for when a window is passed as an integer in the csd in MATLAB ("If window is an integer, then cpsd divides x and y into segments of length window and windows each segment with a Hamming window of that length.") and didn't make it symmetric given that I am doing spectral analysis, so it makes sense to use a periodic window.
For number 2 I have no solution.
This is the function I set up in my python code:
M = 128
fs = 25.0
window = hamming(M, sym=False)
noverlap = np.round(M/2)
f, fxx = signal.csd(x,y,fs=fs,window=window,noverlap=noverlap
The results are not matching in terms of Pxy (cross power spectral density), but they are perfect in terms of frequencies.
These are the first elements in the matlab results:
1.3590e-05
3.4354e-05
4.5282e-05
6.2549e-05
5.7965e-05
4.9697e-05
5.5413e-05
While this is what I get from Python:
2.04688576e-06
3.37540142e-05
4.51821900e-05
6.19997501e-05
5.78926181e-05
5.00058106e-05
5.53681683e-05
I tried using the simple cross spectral density function in matlplotlib (documented here) as follows:
fxx, f = mlab.csd(x,y,NFFT=M,Fs=fs,noverlap=noverlap)
And I obtain more matching results, but still not perfect.
1.18939608e-05
3.45206717e-05
4.56902859e-05
6.45083475e-05
5.73594952e-05
5.01539145e-05
5.34534933e-05
The objective is not to get rid of a possible numerical error in the conversion, but to operate the cross power spectral density with a matching input
Can anybody help?
Thanks a lot in advance!!!
Related
During a project I've been working on, I collected experimental temperature data as a function of time from a metal casting at 4 locations in the sytsem. This temperature profile is complex with a period of initial cooling, followed by an enormous, sudden rise in temperature as the alloy arrives followed by a final period of cooling.
To understand the temperature environment between the measurement locations, I'm trying to use Python to solve the Heat equation which requires a combination of symbolic derivatives and integrals (for which I've used Sympy) and numerical calculations (which which I've used lambdify and numpy).
The issue comes when I want to use the collected temperature data as a boundary conditions in the calculation. I've use Scipy to interpolate between the data points to obtain a complete temperature data set for all times (and to obtain a new spline representing the derivative of the original spline) but I cannot obtain this interpolation response in a format the Sympy will understand for derivatives and integrals in the calculus
Any advice or suggestions?
########################################################################################################
The code I've used for the open step of defining the interpolation is detailed below. I appreciate it would be more efficiently written as a matrix, but I find it easier to see it all when its written out long hand (I generally simplify later if needed):
Note: The time (the x-axis parameter in the interpolation) is a strictly increasing parameter
import numpy as np #import the relevant packages/items
from scipy import signal
import sympy as sp
from scipy.interpolate import *
from scipy.interpolate import UnivariateSpline
x=sp.Symbol("x", real=True, positive=True) #define the sympy symbols
L=sp.Symbol("L", real=True, positive=True)
filename='.../Temperature Data for Code.csv'
data = np.array(pd.read_csv(filename,skiprows=1, header=None)) #reading in the datasets from file
Time_exp=np.array(data[:,0]) #assign the time (x-axis)
T_Alloy_centre_orig=np.array(data[:,1]) +273 #assign 4 temperature (y-axis) and convert to K
T_Alloy_edge_orig=np.array(data[:,2]) +273
T_Inner_orig=np.array(data[:,9]) +273
T_Outer_orig=np.array(data[:,11]) +273
T_Alloy_centre=T_Alloy_centre_orig #create copy of the original data before manipulation
T_Alloy_edge=T_Alloy_edge_orig
T_Inner=T_Inner_orig
T_Outer=T_Outer_orig
T_Alloy_centre=signal.savgol_filter(T_Alloy_centre,3,1) #basic filter to smooth experimental noise
T_Alloy_edge=signal.savgol_filter(T_Alloy_edge,3,1)
T_Inner=signal.savgol_filter(T_Inner,3,1)
T_Outer=signal.savgol_filter(T_Outer,3,1)
T_Alloy_centre_xt = UnivariateSpline(Time_exp, T_Alloy_centre,k=3) #perform spline interpolation
T_Alloy_edge_xt = UnivariateSpline(Time_exp, T_Alloy_edge,k=3)
T_Inner_xt = UnivariateSpline(Time_exp, T_Inner,k=3)
T_Outer_xt = UnivariateSpline(Time_exp, T_Outer,k=3)
diff_T_Alloy_centre_xt = T_Alloy_centre_xt.derivative(n=1) #new spline for derivative of previous
diff_T_Alloy_edge_xt = T_Alloy_edge_xt.derivative(n=1)
diff_T_Inner_xt = T_Inner_xt.derivative(n=1)
diff_T_Outer_xt = T_Outer_xt.derivative(n=1)
#############################################################################
here is where the speculation begins. I've tried several things to try and convert the resulting interpolation into something that can be used by Sympy but unsuccessfully.
First, I tried an sympy.implemented_function approach and just using lambda, although this just gave a string as far as I can tell:
T_Alloy_centre_f = implemented_function('T_Alloy_centre_f', lambda t: T_Alloy_centre_xt)
T_Alloy_centre_f = lambda t: T_Alloy_centre_xt
Secondly, I tried using the interpolation functions available within Sympy (interpolating_spline) although this was running for 15 minutes without achieving a result for only one of the 4 measurements. It would be useful if this worked as it is already within Sympy, although the calculation time is extreme. Possibly as the data is not smooth, featuring a sudden, massive increase in temperature on arrival of the molten alloy.
T_Alloy_centre_xt = sp.interpolating_spline(3, x, Time_exp, T_Alloy_centre)
Finally, to pulled the spline coefficients and knots out of the interpolation with the aim of building the function manually before converting, but I could not come up with a convenient way of getting the piecewise function. Nor did the previous implemented_function approach seem to work here either.
spline_coeffs = T_Alloy_centre_xt.get_coeffs()
spline_knots = T_Alloy_centre_xt.get_knots()
I'm not sure how to proceed from here. I need something from this interpolation that can be passed through sp.diff and sp.integrate
###########################################################################
#if I can get past the above conversion, the next step in the code is evaluating the derivative at a specified value and performing an integral as below:
F_A=-1*sp.diff(T_Inner_f, t) - x/L*(sp.diff(T_Outer_f,t)-sp.diff(T_Inner_f,t))
f_n_A=(2/L)*sp.integrate(F_A*sp.sin(lamda*x),(x,0,L))
Any assistance would be hugely appreciated.
I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!
The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.
Let's say I have a column x with uniform distributed values.
To these values, I applied a cdf-function.
Now I want to calculate the Gaussian Copula, but I can't find the function in python. I read already, that Gaussian Copula is something like the "inverse of the cdf function".
The reason why I'm doing it comes from this paragraph:
A visual depiction of applying the Gaussian Copula process to normalize
an observation by applying 𝑛 = Phi^-1(𝐹(𝑥)). Calculating 𝐹(𝑥) yields a value 𝑢 ∈ [0, 1]
representing the proportion of shaded area at the left. Then Phi^−1(𝑢) yields a value 𝑛
by matching the shaded area in a Gaussian distribution.
I need your help, does everyone has an idea how to calculate that?
I have 2 ideas so far:
1) gauss = 1/(sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2)
--> so transform all the values with this to a new value
2) norm.ppf(array,loc,scale)
--> So give the ppf function the mean and the std and the array and it will calculate me the inverse of the CDF... But I doubt #2
The thing is
n.cdf(n.ppf(0.95))
Is not what I want. The idea why I'm doing it, is transforming a not normal/gaussian distribution to a normal distribution.
Like here:
Transform from a non gaussian distribution to a gaussian distribution with Gaussian Copula
Any other ideas or tipps?
Thank you very much :)
EDIT:
I found 2 links which are quite usefull:
1. https://stats.stackexchange.com/questions/197283/how-to-transform-an-arcsine-distribution-to-a-normal-distribution
2. https://stats.stackexchange.com/questions/125648/transformation-chi-squared-to-normal-distribution/125653#125653
In this posts its said that you have to
All the details are in the answer already - you take your random variable, and transform it by its own cdf ..... yielding a uniform result.
Thats true for me. If I take a random distirbution and apply the norm.cdf(data, mean,std) function, I get a uniform distributed cdf
Compare: import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
cdf=n.cdf(data, n.mean(data),n.std(data))
print cdf
But How can I do the
You then transform again, applying the quantile function (inverse cdf) of the desired distribution (in this case by the standard normal quantile function /inverse of the normal cdf, producing a variable with a standard normal distribution).
Because when I use f.e. the norm.ppf function, the values are not reasonable
I am attempting to calculate the MTF from a test target. I calculate the spread function easily enough, but the FFT results do not quite make sense to me. To summarize,the values seem to alternate giving me a reflection of what I would expect. To test, I used a simple square wave and numpy:
from numpy import fft
data = []
for x in range (0, 20):
data.append(0)
data[9] = 10
data[10] = 10
data[11] = 10
dataFFT = fft.fft(data)
The results look correct, with the exception of the sign... I am seeing the following for the first 4 values as an example:
30.00000000 +0.00000000e+00j
-29.02113033 +7.10542736e-15j
26.18033989 -1.24344979e-14j
-21.75570505 +1.24344979e-14j
So my question is why positive->negative->positive->negative in the real plane? This is not what I would expect... It I plot it, it almost appears that the correct function is mirrored around the x axis.
Note: I was expecting the following as an example:
This is what I am getting:
Your pulse is symmetric and positioned in the center of your FFT window (around N/2). Symmetric real data corresponds to only the cosine or "real" components of an FFT result. Note that the cosine function alternates between being -1 and 1 at the center of the FFT window, depending on the frequency bin index (representing cosine periods per FFT width). So the correlation of these FFT basis functions with a positive going pulse will also alternate as long as the pulse is narrower than half the cosine period.
If you want the largest FFT coefficients to be mostly positive, try centering your narrow rectangular pulse around time 0 (or circularly, time N), where the cosine function is always 1 for any frequency.
It works if you shift the data around 0 instead of half your array, with:
dataFFT = fft.fft(np.fftshift(data))
This isn't all that unexpected. If you want to check against conventional plots, make sure you convert that info to magnitude and phase before coming to any conclusions.
I did a quick check using your code and numpy.abs for mag, numpy,angle for phase. It sure looks like a sinc() function to me, which is what would be expected if the time-domain is a square pulse. If you do this, you'll find a pretty wide sinc, as would be expeceted for a short duration pulse on so few samples.
you forget to specify if your data is Real or Complex
not everyone code in python/numpy (including me) and if you do not know this then you probably handle data to/from FFT the wrong way.
FFT input can be both real or complex domain
FFT output is complex domain
so check the docs for your FFT implementation and specify it and also repair your data handling accordingly. Complex domain usually have first value Re and Second Im but that depends on FFT implementation/configuration.
signal
here is an example of impulse response from FFT
first is input Real domain signal (Im=0) single finite nonzero width pulse and second is the Re part of FFT output. The third is the Im part of FFT output. If you zoom it a bit then you will see amplitude range of y axis of each signal (on left).
Do not forget that different FFT implementations can have different normalization constants which will change the amplitude of signal. If you want magnitude and phase convert it like this:
mag=sqrt(Re*Re+Im*Im); // power
ang=atanxy(Re,Im); // phase angle
atanxy(dx,dy) is 4 quadrant arctan also called atan2 but be careful to get the operand order the same as your atanxy/atan2 implementation needs. Also can use mine C++ atanxy implementation
[Notes]
if your input signal is Real domain then FFT output is symmetric. Both Re and Im signals will be like:
{ a0,a1,a2,a3,...,a(n-1),a(n-1)...,a3,a2,a1,a0 }
exactly like on the image above. On the left are low frequencies and in the middle is the top frequency. If your input signal is Complex domain then the output can be anything.
I have a power-law distribution of energies and I want to pick n random energies based on the distribution. I tried doing this manually using random numbers but it is too inefficient for what I want to do. I'm wondering is there a method in numpy (or other) that works like numpy.random.normal, except instead of a using normal distribution, the distribution may be specified. So in my mind an example might look like (similar to numpy.random.normal):
import numpy as np
# Energies from within which I want values drawn
eMin = 50.
eMax = 2500.
# Amount of energies to be drawn
n = 10000
photons = []
for i in range(n):
# Method that I just made up which would work like random.normal,
# i.e. return an energy on the distribution based on its probability,
# but take a distribution other than a normal distribution
photons.append(np.random.distro(eMin, eMax, lambda e: e**(-1.)))
print(photons)
Printing photons should give me a list of length 10000 populated by energies in this distribution. If I were to histogram this it would have much greater bin values at lower energies.
I am not sure if such a method exists but it seems like it should. I hope it is clear what I want to do.
EDIT:
I have seen numpy.random.power but my exponent is -1 so I don't think this will work.
Sampling from arbitrary PDFs well is actually quite hard. There are large and dense books just about how to efficiently and accurately sample from the standard families of distributions.
It looks like you could probably get by with a custom inversion method for the example that you gave.
If you want to sample from an arbitrary distribution you need the inverse of the cumulative density function (not the pdf).
You then sample a probability uniformly from range [0,1] and feed this into the inverse of the cdf to get the corresponding value.
It is often not possible to obtain the cdf from the pdf analytically.
However, if you're happy to approximate the distribution, you could do so by calculating f(x) at regular intervals over its domain, then doing a cumsum over this vector to get an approximation of the cdf and from this approximate the inverse.
Rough code snippet:
import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate
def f(x):
"""
substitute this function with your arbitrary distribution
must be positive over domain
"""
return 1/float(x)
#you should vary inputVals to cover the domain of f (for better accurracy you can
#be clever about spacing of values as well). Here i space them logarithmically
#up to 1 then at regular intervals but you could definitely do better
inputVals = np.hstack([1.**np.arange(-1000000,0,100),range(1,10000)])
#everything else should just work
funcVals = np.array([f(x) for x in inputVals])
cdf = np.zeros(len(funcVals))
diff = np.diff(funcVals)
for i in xrange(1,len(funcVals)):
cdf[i] = cdf[i-1]+funcVals[i-1]*diff[i-1]
cdf /= cdf[-1]
#you could also improve the approximation by choosing appropriate interpolator
inverseCdf = scipy.interpolate.interp1d(cdf,inputVals)
#grab 10k samples from distribution
samples = [inverseCdf(x) for x in np.random.uniform(0,1,size = 100000)]
plt.hist(samples,bins=500)
plt.show()
Why don't you use eval and put the distribution in a string?
>>> cmd = "numpy.random.normal(500)"
>>> eval(cmd)
you can manipulate the string as you wish to set the distribution.