I have two datasets from two sources vovering one signal and want to find the factor between the two.
They have different resolutions in x and y as well as one set being more noisy than the other.
the following gives a simple approximation, though the actual data does not follow an easy-to-fit polynomial.
import numpy as np
import matplotlib.pyplot as plt
datax1 = np.linspace(0,100,1000)
datay1 = np.around(datax1,-1)**2
datax2 = np.linspace(0,100,80)+np.random.normal(0,0.2,80)
datay2 = (datax2**2)*np.random.normal(5,0.5)+np.random.normal(0,500,80)
plt.title('Data 1 VS Data 2')
plt.plot(datax1,datay1,'b',label='Data 1')
plt.plot(datax2,datay2,'r',label='Data 2')
plt.legend()
plt.savefig('img.png', bbox_inches='tight', dpi=72)
similar data, different noise and rez
I need to automate finding this factor since I have more datasets to analyse, but SciPy's curve_fit does not play nice with interpolate as
import scipy.optimize as opt
import scipy.interpolate as interp
def func(x,k):
fun=interp(datax1,datay1*k)
return fun(x)
print opt.curve_fit(func,datax2,datay2)
only returns TypeError: 'module' object is not callable at the definition of fun
Is there any way to do this with numpy or scipy or do I have to build my own least-squares function to find the scaling of the data?
Related
savgol_filter gives me the series.
I want to get the underlying polynormial function.
The function of the red line in a below picture.
So that I can extrapolate a point beyond the given x range.
Or I can find the slope of the function at the two extreme data points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
yhat = savgol_filter(y, 51, 3) # window size 51, polynomial order 3
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
** edit**
Since the filter uses least squares regression to fit the data in a small window to a polynomial of given degree, you can probably only extrapolate from the ends. I think the fitted curve is a piecewise function of these 'fits' and each function would not be a good representation of the entire data as a whole. What you could do is take the end windows of your data, and fit them to the same polynomial degree as the savitzy golay filter (using scipy's polyfit). It likely will not be accurate very far from the window though.
You can also use scipy.signal.savgol_coeffs() to get the coefficients of the filter. I think you dot product the coefficient array with your array of data to get the value at each point. You can include a derivative argument to get the slope at the ends of your data.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_coeffs.html#scipy.signal.savgol_coeffs
Click here to see traceback
I have the following question:
Write a Python program to generate data that uses the sum of a random variable (which has a Gaussian distribution) and a 4th-degree polynomial equation (3x4+x3+3x2+4x+5). Using least squares polynomial fit, curve the generated data using a model until your model can accurately predict all values
with the following start on the question:
import random
import numpy as np
import matplotlib.pyplot as plt
def mainFunc():
poly_coeff=[3,1,3,4,5]
poly=np.poly1d(poly_coeff)
print(poly)
y = poly(random.randint(0,10)) + min(10,max(0,random.gauss(2,3)))
x=np.arange(-10,10)
curvefit=np.polyfit(x,y,4)
y_new=np.polyfit(curvefit,x)
plt.plot(x,y, '-or')
plt.plot(x,y_new, '-b')
plt.show()
mainFunc()
Can anyone help with the array error that is being generated?
I measured the fluorescence intensity of thousands of particles and made the histogram, which showed two adjacent gaussian curves. How to use python or its package to separate them into two Gaussian curves and make two new plots?
Thank you.
Basically, you need to infer parameters for your Gaussian mixture. I will generate a similar dataset for the illustration.
Generating mixtures with known parameters
from itertools import starmap
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import mlab
sns.set(color_codes=True)
# inline plots in jupyter notebook
%matplotlib inline
# generate synthetic data from a mixture of two Gaussians with equal weights
# the solution below readily generalises to more components
nsamples = 10000
means = [30, 120]
sds = [10, 50]
weights = [0.5, 0.5]
draws = np.random.multinomial(nsamples, weights)
samples = np.concatenate(
list(starmap(np.random.normal, zip(means, sds, draws)))
)
Plot the distribution
sns.distplot(samples)
Infer parameters
from sklearn.mixture import GaussianMixture
mixture = GaussianMixture(n_components=2).fit(samples.reshape(-1, 1))
means_hat = mixture.means_.flatten()
weights_hat = mixture.weights_.flatten()
sds_hat = np.sqrt(mixture.covariances_).flatten()
print(mixture.converged_)
print(means_hat)
print(sds_hat)
print(weights_hat)
We get:
True
[ 122.57524745 29.97741112]
[ 48.18013893 10.44561398]
[ 0.48559771 0.51440229]
You can tweak GaussianMixture's hyper-parameters to improve fit, but this looks fine enough. Now we can plot each component (I'm only plotting the first one):
mu1_h, sd1_h = means_hat[0], sds_hat[0]
x_axis = np.linspace(mu1_h-3*sd1_h, mu1_h+3*sd1_h, 1000)
plt.plot(x_axis, mlab.normpdf(x_axis, mu1_h, sd1_h))
P.S.
On a sidenote. It seems like you are dealing with constrained data, and your observations are pretty close to the left constraint (zero). While Gaussians might approximate your data well enough, you should tread carefully, because Gaussians assume unconstrained geometry.
I have the following graph that I want to digitize to a high-quality publication grade figure using Python and Matplotlib:
I used a digitizer program to grab a few samples from one of the 3 data sets:
x_data = np.array([
1,
1.2371,
1.6809,
2.89151,
5.13304,
9.23238,
])
y_data = np.array([
0.0688824,
0.0490012,
0.0332843,
0.0235889,
0.0222304,
0.0245952,
])
I have already tried 3 different methods of fitting a curve through these data points. The first method being to draw a spline through the points using scipy.interpolate import spline
This results in (with the actual data points drawn as blue markers):
This is obvisously no good.
My second attempt was to draw a curve fit using a series of different order polinimials using scipy.optimize import curve_fit. Even up to a fourth order polynomial the answer is useless (the lower order ones were even more useless):
Finally, I used scipy.interpolate import interp1d to try and interpolate between the data points. Linear interpolation obviously yields expected results but the line are straight and the whole purpose of this exercise is to get a nice smooth curve:
If I then use cubic interpolation I get a rubish result, however quadratic interpolation yields a slightly better result:
But it's not quite there yet, and I don't think interp1d can do higher order interpolation.
Is there anyone out there who has a good method of doing this? Maybe I would be better off trying to do it in IPE or something?
Thank you!
A standard cubic spline is not very good at reasonable looking interpolations between data points that are very unevenly spaced. Fortunately, there are plenty of other interpolation algorithms and Scipy provides a number of them. Here are a few applied to your data:
import numpy as np
from scipy.interpolate import spline, UnivariateSpline, Akima1DInterpolator, PchipInterpolator
import matplotlib.pyplot as plt
x_data = np.array([1, 1.2371, 1.6809, 2.89151, 5.13304, 9.23238])
y_data = np.array([0.0688824, 0.0490012, 0.0332843, 0.0235889, 0.0222304, 0.0245952])
x_data_smooth = np.linspace(min(x_data), max(x_data), 1000)
fig, ax = plt.subplots(1,1)
spl = UnivariateSpline(x_data, y_data, s=0, k=2)
y_data_smooth = spl(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'b')
bi = Akima1DInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'g')
bi = PchipInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'k')
ax.plot(x_data_smooth, y_data_smooth)
ax.scatter(x_data, y_data)
plt.show()
I suggest looking through these, and also a few others, and finding one that matches what you think looks right. Also, though, you may want to sample a few more points. For example, I think the PCHIP algorithm wants to keep the fit monotonic between data points, so digitizing your minimum point would be useful (and probably a good idea regardless of the algorithm you use).
How can I plot the following noisy data with a smooth, continuous line without considering each individual value? I would like to only show the behavior in a nicer way, without caring about noisy and extreme values. This is the code I am using:
import numpy
import sys
import matplotlib.pyplot as plt
from scipy.interpolate import spline
dataset = numpy.genfromtxt(fname='data', delimiter=",")
dic = {}
for d in dataset:
dic[d[0]] = d[1]
plt.plot(range(len(dic)), dic.values(),linestyle='-', linewidth=2)
plt.savefig('plot.png')
plt.show()
In a previous answer, I was introduced to the Savitzky Golay filter, a particular type of low-pass filter, well adapted for data smoothing. How "smooth" you want your resulting curve to be is a matter of preference, and this can be adjusted by both the window-size and the order of the interpolating polynomial. Using the cookbook example for sg_filter:
import numpy as np
import sg_filter
import matplotlib.pyplot as plt
# Generate some sample data similar to your post
X = np.arange(1,1000,1)
Y = np.log(X**3) + 10*np.random.random(X.shape)
Y2 = sg_filter.savitzky_golay(Y, 101, 3)
plt.plot(X,Y,linestyle='-', linewidth=2,alpha=.5)
plt.plot(X,Y2,color='r')
plt.show()
There is more than one way to do it!
Here I show how to reduce noise using a variety of techniques:
Moving average
LOWESS regression
Low pass filter
Interpolation
Sticking with #Hooked example data for consistency:
import numpy as np
import matplotlib.pyplot as plt
X = np.arange(1, 1000, 1)
Y = np.log(X ** 3) + 10 * np.random.random(X.shape)
plt.plot(X, Y, alpha = .5)
plt.show()
Moving average
Sometimes all you need is a moving average.
For example, using pandas with a window size of 100:
import pandas as pd
df = pd.DataFrame(Y, X)
df_mva = df.rolling(100).mean() # moving average with a window size of 100
df_mva.plot(legend = False);
You will probably have to try several window sizes with your data. Note that the first 100 values of df_mva will be NaN but these can be removed with the dropna method.
Usage details for the pandas rolling function.
LOWESS regression
I've used LOWESS (Locally Weighted Scatterplot Smoothing) successfully to remove noise from repeated measures datasets. More information on local regression methods, including LOWESS and LOESS, here. It's a simple method with only one parameter to tune which in my experience gives good results.
Here is how to apply the LOWESS technique using the statsmodels implementation:
import statsmodels.api as sm
y_lowess = sm.nonparametric.lowess(Y, X, frac = 0.3) # 30 % lowess smoothing
plt.plot(y_lowess[:, 0], y_lowess[:, 1]) # some noise removed
plt.show()
It may be necessary to vary the frac parameter, which is the fraction of the data used when estimating each y value. Increase the frac value to increase the amount of smoothing. The frac value must be between 0 and 1.
Further details on statsmodels lowess usage.
Low pass filter
Scipy provides a set of low pass filters which may be appropriate.
After application of the lfiter:
from scipy.signal import lfilter
n = 50 # larger n gives smoother curves
b = [1.0 / n] * n # numerator coefficients
a = 1 # denominator coefficient
y_lf = lfilter(b, a, Y)
plt.plot(X, y_lf)
plt.show()
Check scipy lfilter documentation for implementation details regarding how numerator and denominator coefficients are used in the difference equations.
There are other filters in the scipy.signal package.
Interpolation
Finally, here is an example of radial basis function interpolation:
from scipy.interpolate import Rbf
rbf = Rbf(X, Y, function = 'multiquadric', smooth = 500)
y_rbf = rbf(X)
plt.plot(X, y_rbf)
plt.show()
Smoother approximation can be achieved by increasing the smooth parameter. Alternative function parameters to consider include 'cubic' and 'thin_plate'. When considering the function value, I usually try 'thin_plate' first followed by 'cubic'; however both 'thin_plate' and 'cubic' seemed to struggle with the noise in this dataset.
Check other Rbf options in the scipy docs. Scipy provides other univariate and multivariate interpolation techniques (see this tutorial).