So I've got a signal, and I've tried fitting a curve to it using two methods that I thought should have been numerically equivalent, but apparently are not.
Method 1: Explicit fitting of sinusoids by least squares:
def curve(x, a0, a1, b1, a2, b2):
return a0 + a1*np.cos(x/720*2*math.pi) + b1*np.sin(x/720*2*math.pi) + a2*np.cos(x/720*2*math.pi*2) + b2*np.sin(x/720*2*math.pi*2)
def fit_curve(xdata, ydata):
guess = [10, 0, 0, 0, 0]
params, params_covariance = optimize.curve_fit(curve, xdata, ydata, guess)
return params, params_covariance
Method 2: Use of inbuilt FFT algorithm to do the same thing:
f = np.fft.rfft(y,3)
curve = np.fft.irfft(f, width)
I have two problems. The first one is minor, the FFT is 'out of scale', so I apply a scaling factor mean(y)/mean(curve) to fix it, which is a bit of a hack. I'm not sure why this is the case.
The main problem I have is that I believe these should produce almost identical results, but they don't. The explicit fitting produces a tighter fit every time than the FFT results- my question is, should it?
One can find discrete fourier transform coefficients using linear algebra, though I imagine it's only useful to understand the DFT better. The code below demonstrates this. To find coefficients and phases for a sine series will take a little bit more work, but shouldn't be too hard. The Wikipedia article cited in the code comments might help.
Note that one doesn't need scipy.optimize.curve_fit, or even linear least-squares. In fact, although I've used numpy.linalg.solve below, that is unnecessary since basis is a unitary matrix times a scale factor.
from __future__ import division, print_function
import numpy
# points in time series
n= 101
# final time (initial time is 0)
tfin= 10
# *end of changeable parameters*
# stepsize
dt= tfin/(n-1)
# sample count
s= numpy.arange(n)
# signal; somewhat arbitrary
y= numpy.sinc(dt*s)
# DFT
fy= numpy.fft.fft(y)
# frequency spectrum in rad/sample
wps= numpy.linspace(0,2*numpy.pi,n+1)[:-1]
# basis for DFT
# see, e.g., http://en.wikipedia.org/wiki/Discrete_Fourier_transform#equation_Eq.2
# and section "Properties -> Orthogonality"; the columns of 'basis' are the u_k vectors
# described there
basis= 1.0/n*numpy.exp(1.0j * wps * s[:,numpy.newaxis])
# reconstruct signal from DFT coeffs and basis
recon_y= numpy.dot(basis,fy)
# expect yerr to be "small"
yerr= numpy.max(numpy.abs(y-recon_y))
print('yerr:',yerr)
# find coefficients by fitting to basis
lin_fy= numpy.linalg.solve(basis,y)
# fyerr should also be "small"
fyerr= numpy.max(numpy.abs(fy-lin_fy))
print('fyerr',fyerr)
On my system this gives
yerr: 2.20721480995e-14
fyerr 1.76885950227e-13
Tested on Ubuntu 14.04 with Python 2.7 and 3.4.
Take a look at the docstring for np.fft.rfft. In particular, this: "If n is smaller than the length of the input, the input is cropped." When you do this:
f = np.fft.rfft(y,3)
you are computing the FFT of the first three data points in y, not the first three Fourier coefficients of y.
Related
I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.
Suppose 'h' is a function of x,y,z and t and it gives us a graph line (t,h) (simulated). At the same time we also have observed graph (observed values of h against t). How can I reduce the difference between observed (t,h) and simulated (t,h) graph by optimizing values of x,y and z? I want to change the simulated graph so that it imitates closer and closer to the observed graph in MATLAB/Python. In literature I have read that people have done same thing by Lavenberg-marquardt algorithm but don't know how to do it?
You are actually trying to fit the parameters x,y,z of the parametrized function h(x,y,z;t).
MATLAB
You're right that in MATLAB you should either use lsqcurvefit of the Optimization toolbox, or fit of the Curve Fitting Toolbox (I prefer the latter).
Looking at the documentation of lsqcurvefit:
x = lsqcurvefit(fun,x0,xdata,ydata);
It says in the documentation that you have a model F(x,xdata) with coefficients x and sample points xdata, and a set of measured values ydata. The function returns the least-squares parameter set x, with which your function is closest to the measured values.
Fitting algorithms usually need starting points, some implementations can choose randomly, in case of lsqcurvefit this is what x0 is for. If you have
h = #(x,y,z,t) ... %// actual function here
t_meas = ... %// actual measured times here
h_meas = ... %// actual measured data here
then in the conventions of lsqcurvefit,
fun <--> #(params,t) h(params(1),params(2),params(3),t)
x0 <--> starting guess for [x,y,z]: [x0,y0,z0]
xdata <--> t_meas
ydata <--> h_meas
Your function h(x,y,z,t) should be vectorized in t, such that for vector input in t the return value is the same size as t. Then the call to lsqcurvefit will give you the optimal set of parameters:
x = lsqcurvefit(#(params,t) h(params(1),params(2),params(3),t),[x0,y0,z0],t_meas,h_meas);
h_fit = h(x(1),x(2),x(3),t_meas); %// best guess from curve fitting
Python
In python, you'd have to use the scipy.optimize module, and something like scipy.optimize.curve_fit in particular. With the above conventions you need something along the lines of this:
import scipy.optimize as opt
popt,pcov = opt.curve_fit(lambda t,x,y,z: h(x,y,z,t), t_meas, y_meas, p0=[x0,y0,z0])
Note that the p0 starting array is optional, but all parameters will be set to 1 if it's missing. The result you need is the popt array, containing the optimal values for [x,y,z]:
x,y,z = popt
h_fit = h(x,y,z,t_meas)
The data that i have is stored in a 2D list where one column represents a frequency and the other column is its corresponding dB. I would like to programmatically identify the frequency of the 3db points on either end of the passband. I have two ideas on how to do this but they both have drawbacks.
Find maximum point then the average of points in the passband then find points about 3dB lower
Use the sympy library to perform numerical differentiation and identify the critical points/inflection points
use a histogram/bin function to find the amplitude of the passband.
drawbacks
sensitive to spikes, not quite sure how to do this
i don't under stand the math involved and the data is noisy which could lead to a lot of false positives
correlating the amplitude values with list index values could be tricky
Can you think of better ideas and/or ways to implement what I have described?
Assuming that you've loaded multiple readings of the PSD from the signal analyzer, try averaging them before attempting to find the bandedges. If the signal isn't changing too dramatically, the averaging process might smooth away any peaks and valleys and noise within the passband, making it easier to find the edges. This is what many spectrum analyzers can do to make for a smoother PSD.
In case that wasn't clear, assume that each reading gives you 128 tuples of the frequency and power and that you capture 100 of these buffers of data. Now average the 100 samples from bin 0, then samples from 1, 2, ..., 128. Now try and locate the bandpass on this data. It should be easier than on any single buffer. Note I used 100 as an example. If your data is very noisy, it may require more. If there isn't much noise, fewer.
Be careful when doing the averaging. Your data is in dB. To add the samples together in order to find an average, you must first convert the dB data back to decimal, do the adds, do the divide to find the average, and then convert the averaged power back into dB.
Ok it seems this has to be solved by data analysis. I would propose these steps:
Preprocess you data if you suspect it to bee too noisy. I'd suggest either moving-average filter (sp.convolve(data, sp.ones(n)/n, "same")) or better a savitzky-golay-filter (sp.signal.savgol_filter(data, n, polyorder=3)) because you will be interested in extrema of the data, which will be unnecessarily distorted by the ma filter. You might also want to get rid of artifacts like 60Hz noise at this stage.
If the signal you are interested in lives in a narrow band, the spectrum will be a single pronounced peak. In that case you could just fit a curve to your data, a gaussian would be appropriate in that case.
import scipy as sp
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
freq, pow = read_in_your_data_here()
freq, pow = sp.asarray(freq), sp.asarray(pow)
def gauss(x, a, mu, sig):
return a**sp.exp(-(x-mu)**2/(2.*sig**2))
(a, mu, sig), _ = curve_fit(gauss, freq, pow)
fitted_curve = gauss(freq, a, mu, sig)
plt.plot(freq, pow)
plt.plot(freq, fitted_curve)
plt.vlines(mu, min(pow)-2, max(pow)+2)
plt.show()
center_idx = sp.absolute(freq-mu).argmin()
pow_center = pow[center_idx]
pow_3db = pow_center - 3.
def interv_from_binvec(data):
indicator = sp.convolve(data, [-1,1], "same")
return indicator.argmin(), indicator.argmax()
passband_idx = interv_from_binvec(pow > pow_3db)
passband = freq[passband_idx[0]], freq[passband_idx[1]]
This is more an example than a solution, and relies heavily on the assumption the you are searching and finding a high SNR signal with a narrow band. It could be extended to handle more than one signal by use of a mixture model.
You can use scipy's UnivariateSpline and leastsq methods:
Create a spline of y-(np.max(y)-3)
Find the roots of it.
Calculate the difference between the two roots.
from scipy.interpolate import UnivariateSpline
from scipy.optimize import leastsq
x = df["Wavelength / nm"]
y = df["Power / dBm"]
#create spline
spline = UnivariateSpline(x, y-(np.max(y)-3), s=0)
# find the roots
r1, r2 = spline.roots()
# calculate the difference
threedB_bandwidth = abs(r2-r1)
I am following this link to do a smoothing of my data set.
The technique is based on the principle of removing the higher order terms of the Fourier Transform of the signal, and so obtaining a smoothed function.
This is part of my code:
N = len(y)
y = y.astype(float) # fix issue, see below
yfft = fft(y, N)
yfft[31:] = 0.0 # set higher harmonics to zero
y_smooth = fft(yfft, N)
ax.errorbar(phase, y, yerr = err, fmt='b.', capsize=0, elinewidth=1.0)
ax.plot(phase, y_smooth/30, color='black') #arbitrary normalization, see below
However some things do not work properly.
Indeed, you can check the resulting plot :
The blue points are my data, while the black line should be the smoothed curve.
First of all I had to convert my array of data y by following this discussion.
Second, I just normalized arbitrarily to compare the curve with data, since I don't know why the original curve had values much higher than the data points.
Most importantly, the curve is like "specular" to the data point, and I don't know why this happens.
It would be great to have some advices especially to the third point, and more generally how to optimize the smoothing with this technique for my particular data set shape.
Your problem is probably due to the shifting that the standard FFT does. You can read about it here.
Your data is real, so you can take advantage of symmetries in the FT and use the special function np.fft.rfft
import numpy as np
x = np.arange(40)
y = np.log(x + 1) * np.exp(-x/8.) * x**2 + np.random.random(40) * 15
rft = np.fft.rfft(y)
rft[5:] = 0 # Note, rft.shape = 21
y_smooth = np.fft.irfft(rft)
plt.plot(x, y, label='Original')
plt.plot(x, y_smooth, label='Smoothed')
plt.legend(loc=0)
plt.show()
If you plot the absolute value of rft, you will see that there is almost no information in frequencies beyond 5, so that is why I choose that threshold (and a bit of playing around, too).
Here the results:
From what I can gather you want to build a low pass filter by doing the following:
Move to the frequency domain. (Fourier transform)
Remove undesired frequencies.
Move back to the time domain. (Inverse fourier transform)
Looking at your code, instead of doing 3) you're just doing another fourier transform. Instead, try doing an actual inverse fourier transform to move back to the time domain:
y_smooth = ifft(yfft, N)
Have a look at scipy signal to see a bunch of already available filters.
(Edit: I'd be curious to see the results, do share!)
I would be very cautious in using this technique. By zeroing out frequency components of the FFT you are effectively constructing a brick wall filter in the frequency domain. This will result in convolution with a sinc in the time domain and likely distort the information you want to process. Look up "Gibbs phenomenon" for more information.
You're probably better off designing a low pass filter or using a simple N-point moving average (which is itself a LPF) to accomplish the smoothing.
I have two lists .
import numpy
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
They make a U shaped curve.
Now I want to fit a gaussian to that curve.
from scipy.optimize import curve_fit
n = len(x)
mean = sum(y)/n
sigma = sum(y - mean)**2/n
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popt, pcov = curve_fit(gaus,x,y,p0=[-1,mean,sigma,-5])
pylab.plot(x,y,'r-')
pylab.plot(x,gaus(x,*popt),'k-')
pylab.show()
I just end up with the noisy original U-shaped curve and a straight horizontal line running through the curve.
I am not sure what the -1 and the -5 represent in the above code but I am sure that I need to adjust them or something else to get the gaussian curve. I have been playing around with possible values but to no avail.
Any ideas?
First of all, your variable sigma is actually variance, i.e. sigma squared --- http://en.wikipedia.org/wiki/Variance#Definition.
This confuses the curve_fit by giving it a suboptimal starting estimate.
Then, your fitting ansatz, gaus, includes an amplitude a and an offset, is this what you actually need? And the starting values are a=-1 (negated bell shape) and offset c=-5. Where do they come from?
Here's what I'd do:
fix your fitting model. Do you want just a gaussian, does it need to be normalized. If it does, then the amplitude a is fixed by sigma etc.
Have a look at the actual data. What's the tail (offset), what's the sign (amplitude sign).
If you're actually want just a gaussian without any bells and whistles, you might not actually need curve_fit: a gaussian is fully defined by two first moments, mean and sigma. Calculate them as you do, plot them over the data and see if you're not all set.
p0 in your call to curve_fit gives the initial guesses for the additional parameters of you function in addition to x. In the above code you are saying that I want the curve_fit function to use -1 as the initial guess for a, -5 as the initial guess for c, mean as the initial guess for x0, and sigma as the guess for sigma. The curve_fit function will then adjust these parameters to try and get a better fit. The problem is your initial guesses at your function parameters are really bad given the order of (x,y)s.
Think a little bit about the order of magnitude of your different parameters for the Gaussian. a should be around the size of your y values (10**-16) as at the peak of the Gaussian the exponential part will never be larger than 1. x0 will give the position within your x values at which the exponential part of your Gaussian will be 1, so x0 should be around 7500, probably somewhere in the centre of your data. Sigma indicates the width, or spread of your Gaussian, so perhaps something in the 100's just a guess. Finally c is just an offset to shift the whole Gaussian up and down.
What I would recommend doing, is before fitting the curve, pick some values for a, x0, sigma, and c that seem reasonable and just plot the data with the Gaussian, and play with a, x0, sigma, and c until you get something that looks at least some what the way you want the Gaussian to fit, then use those as the starting points for curve_fit p0 values. The values I gave should get you started, but may not do exactly what you want. For instance a probably needs to be negative if you want to flip the Gaussian to get a "U" shape.
Also printing out the values that curve_fit thinks are good for your a,x0,sigma, and c might help you see what it is doing and if that function is on the right track to minimizing the residual of the fit.
I have had similar problems doing curve fitting with gnuplot, if the initial values are too far from what you want to fit it goes in completely the wrong direction with the parameters to minimize the residuals, and you could probably do better by eye. Think of these functions as a way to fine tune your by eye estimates of these parameters.
hope that helps
I don't think you are estimating your initial guesses for mean and sigma correctly.
Take a look at the SciPy Cookbook here
I think it should look like this.
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
n = len(x)
mean = sum(x*y)/sum(y)
sigma = sqrt(abs(sum((x-mean)**2*y)/sum(y)))
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popy, pcov = curve_fit(gaus,x,y,p0=[-max(y),mean,sigma,min(x)+((max(x)-min(x)))/2])
pylab.plot(x,gaus(x,*popt))
If anyone has a link to a simple explanation why these are the correct moments I would appreciate it. I am going on faith that SciPy Cookbook got it right.
Here is the solution thanks to everyone .
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
n = len(x)
mean = sum(x)/n
sigma = math.sqrt(sum((x-mean)**2)/n)
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popy, pcov = curve_fit(gaus,x,y,p0=[-max(y),mean,sigma,min(x)+((max(x)-min(x)))/2])
pylab.plot(x,gaus(x,*popt))
Maybe it is because I use matlab and fminsearch or my fits have to work on much fewer datapoints (~ 5-10), I have much better results with the following starter values (as simple as they are):
a = max(y)-min(y);
imax= find(y==max(y),1);
mean = x(imax);
avg = sum(x.*y)./sum(y);
sigma = sqrt(abs(sum((x-avg).^2.*y) ./ sum(y)));
c = min(y);
The sigma works fine.