I am trying to fit a curve with lmfit but the data set I'm working with does not contain a lot of points and this makes the resulting fit look jagged instead of curved.
I'm simply using the line:
out = mod.fit(SV, pars, x=VR)
were VR and SV are the coordinates of the points I'm trying to fit.
I've tried using scipy.interpolate.UnivariateSpline and the fitted the resulting data but I want to know if there is a built-in or faster way to do this.
Thank you
There is not a built-in way to automatically interpolate with lmfit. With a lmfit Model, you provide the array on independent values at which the Model should be evaluated, and an array of data to compared to that model.
You're free to interpolate or smooth the data or perform some other transformation (I sometimes Fourier transform data and model to emphasize some frequencies), but you'll have to include that as part of the model.
While you might be able to do the job with scipy.interpolate.UnivariateSpline, you would basically be fitting to the fit you already did.
Instead you can use the components that are given to you already from your original fit. It's very trivial once you know how, but the lmfit documentation does not provide a clear case.
import numpy as np
from lmfit.models import GaussianModel
import matplotlib.pyplot as plt
y, _ = np.histogram(np.random.normal(size=1000), bins=10, density=True)
x = np.linspace(0, 1, y.size)
# Replace with whatever model you are using (with the caveat that the above dataset is gaussian).
model = GaussianModel()
params = model.guess(y, x=x)
result = model.fit(y, params, x=x)
x_interp = np.linspace(0, 1, 100*y.size)
# The model is attached to the result, which makes it easier if you're sending it somewhere.
y_interp = result.model.func(x_interp, **result.best_values)
plt.plot(x, y, label='original')
plt.plot(x_interp, y_interp, label='interpolated')
plt.legend()
plt.show()
Related
I have daily data like below as a result.
And my daily prediction is like the one below most of the day.
I found a mild trend that value of Y-axis in my prediction where X-axis is between 0-3000 always need to be amplified much more..
How could I find a function that can close the prediction data to result data? I'd imagine I could do it if I Fourier transform both array but I also guess there would be a simpler way to do it..
I assume what you want to get as output is a kind of 'best fit' scaling function to match your output to your prediction. One straightforward approach would be to just calculate the difference of your output and prediction and then apply a smoothing algorithm of your choice to get a scaling function. (e.g. a Savitzky Golay Filter)
Minimal example below:
import numpy as np
from scipy.signal import savgol_filter
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,100)
data = np.sin(x) + np.random.normal(0, 0.2, 100) # a noisy sine function
prediction = x # line with slope 1, a bad initial fit
deviation = prediction - data
fit = savgol_filter(deviation, window_length=71, polyorder=2)
plt.plot(x, prediction-fit)
plt.plot(x, data)
play around with window_length and polyorder to find a suitable degree of noise for your dataset.
I am relatively new to python. I am trying to do a multivariate linear regression and plot scatter plots and the line of best fit using one feature at a time.
This is my code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
And this is the graph that I'm getting-
I have tried searching a lot but to no avail. I wanted to understand why this is not showing a line of best-fit and why instead it is connecting all the points on the scatter plot.
Thank you!
See linear regression means, that you are predicting the value linearly which will always give you a best fit line. Anything else is not possible, in your code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Use the right variables to plot the line ie:
plt.plot(x_test,y_pred)
Plot the graph between the values that you put for test and the predictions that you get from that ie:
y_pred=regr.predict(x_test)
Also your model must be trained for the same, otherwise you will get the straight line but the results will be unexpected.
This is a multivariant data so you need to get the pairwise line
http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text=wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix.
or change the model for a linearly dependent data that will change the model completely
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train['lag_7'],y_train)
y_pred=regr.predict(x_test['lag_7'])
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Assuming your graphical library is matplotlib, imported with import matplotlib.pyplot as plt, the problem is that you passed the same data to both plt.scatter and plt.plot. The former draws the scatter plot, while the latter passes a line through all points in the order given (it first draws a straight line between (x_test['lag_7'][0], y_pred[0]) and (x_test['lag_7'][1], y_pred[1]), then one between (x_test['lag_7'][1], y_pred[1]) and (x_test['lag_7'][2], y_pred[2]), etc.)
Concerning the more general question about how to do multivariate regression and plot the results, I have two remarks:
Finding the line of best fit one feature at a time amounts to performing 1D regression on that feature: it is an altogether different model from the multivariate linear regression you want to perform.
I don't think it makes much sense to split your data into train and test samples, because linear regression is a very simple model with little risk of overfitting. In the following, I consider the whole data set df.
I like to use OpenTURNS because it has built-in linear regression viewing facilities. The downside is that to use it, we need to convert your pandas tables (DataFrame or Series) to OpenTURNS objects of the class Sample.
import pandas as pd
import numpy as np
import openturns as ot
from openturns.viewer import View
# convert pandas DataFrames to numpy arrays and then to OpenTURNS Samples
X = ot.Sample(np.array(df[['lag_7','rolling_mean', 'expanding_mean']]))
X.setDescription(['lag_7','rolling_mean', 'expanding_mean']) # keep labels
Y = ot.Sample(np.array(df[['sales']]))
Y.setDescription(['sales'])
You did not provide your data, so I need to generate some:
func = ot.SymbolicFunction(['x1', 'x2', 'x3'], ['4*x1 + 0.05*x2 - 2*x3'])
inputs_distribution = ot.ComposedDistribution([ot.Uniform(0, 3.0e6)]*3)
residuals_distribution = ot.Normal(0.0, 2.0e6)
ot.RandomGenerator.SetSeed(0)
X = inputs_distribution.getSample(30)
X.setDescription(['lag_7','rolling_mean', 'expanding_mean'])
Y = func(X) + residuals_distribution.getSample(30)
Y.setDescription(['sales'])
Now, let us find the best-fitting line one feature at a time (1D linear regression):
linear_regression_1 = ot.LinearModelAlgorithm(X[:, 0], Y)
linear_regression_1.run()
linear_regression_1_result = linear_regression_1.getResult()
ot.VisualTest_DrawLinearModel(X[:, 0], Y, linear_regression_1_result)
linear_regression_2 = ot.LinearModelAlgorithm(X[:, 1], Y)
linear_regression_2.run()
linear_regression_2_result = linear_regression_2.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 1], Y, linear_regression_2_result))
linear_regression_3 = ot.LinearModelAlgorithm(X[:, 2], Y)
linear_regression_3.run()
linear_regression_3_result = linear_regression_3.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 2], Y, linear_regression_3_result))
As you can see, in this example, none of the one-feature linear regressions are able to very accurately predict the output.
Now let us do multivariate linear regression. To plot the result, it is best to view the actual vs. predicted values.
full_linear_regression = ot.LinearModelAlgorithm(X, Y)
full_linear_regression.run()
full_linear_regression_result = full_linear_regression.getResult()
full_linear_regression_analysis = ot.LinearModelAnalysis(full_linear_regression_result)
View(full_linear_regression_analysis.drawModelVsFitted())
As you can see, in this example, the fit is much better with multivariate linear regression than with 1D regressions one feature at a time.
There are two ways to specify the noise level for Gaussian Process Regression (GPR) in scikit-learn.
The first way is to specify the parameter alpha in the constructor of the class GaussianProcessRegressor which just adds values to the diagonal as expected.
The second way is incorporate the noise level in the kernel with WhiteKernel.
The documentation of GaussianProcessRegressor (see documentation here) says that specifying alpha is "equivalent to adding a WhiteKernel with c=alpha". However, I am experiencing a different behavior and want to find out what the reason is for that (and, of course, what the "correct" way or "truth" is).
Here is a code snippet plotting two different regression fits for a perturbed version of the function f(x)=x^2 although they should show the same:
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel as C, RBF, WhiteKernel
rnd.seed(0)
n = 40
xs = np.linspace(-1, 1, num=n)
noise = 0.1
kernel1 = C()*RBF() + WhiteKernel(noise_level=noise)
kernel2 = C()*RBF()
data = xs**2 + rnd.multivariate_normal(mean=np.zeros(n), cov=noise*np.eye(n))
gpr1 = GaussianProcessRegressor(kernel=kernel1, alpha=0.0, optimizer=None)
gpr1.fit(xs[:, np.newaxis], data)
gpr2 = GaussianProcessRegressor(kernel=kernel2, alpha=noise, optimizer=None)
gpr2.fit(xs[:, np.newaxis], data)
xs_plt = np.linspace(-1., 1., num=100)
for gpr in [gpr1, gpr2]:
pred, pred_std = gpr.predict(xs_plt[:, np.newaxis], return_std=True)
plt.figure()
plt.plot(xs_plt, pred, 'C0', lw=2)
plt.scatter(xs, data, c='C1', s=20)
plt.fill_between(xs_plt, pred - 1.96*pred_std, pred + 1.96*pred_std,
alpha=0.2, color='C0')
plt.title("Kernel: %s\n Log-Likelihood: %.3f"
% (gpr.kernel_, gpr.log_marginal_likelihood(gpr.kernel_.theta)),
fontsize=12)
plt.ylim(-1.2, 1.2)
plt.tight_layout()
plt.show()
I already was looking into the implementation in the scikit-learn package, but was not able to find out what is going wrong. Or maybe I am just overseeing something and the outputs make perfect sense.
Does anyone have an idea of what is going on here or had a similar experience?
Thanks a lot!
I might be wrong here, but I believe the claim 'specifying alpha is "equivalent to adding a WhiteKernel with c=alpha"' is subtly incorrect.
When setting the GP-Regression noise, the noise is added only to K, the covariance between the training points. When adding a Whitenoise-Kernel, the noise is also added to K**, the covariance between test points.
In your case, the test points and training points are identical. However, the three different matrices are likely still created. This could lead to the discrepancy observed here.
I argue that the documentation is incorrect. See github issue #13267 about this with (which I opened).
In practice, what I do is fit a GP with the WhiteKernel then take that noice level. I then add that value to alpha and recompute the necessary variables. An easier alternative is to make a new GP with the alpha set and the same length scales but do not fit it.
I should note that it is not universally accepted as to whether or not this is the right approach. I had this discussion with a colleague and we came to the following conclusion. This pertains to the data bei.ng noise from experimental error
If you want to sample the GP to predict what a new experiment with more independent measurements, you want the WhiteKernel
If you want to sample the possible underlying truth, you do not want the WhiteKernel since you want a smooth response
https://gpflow.readthedocs.io/en/awav-documentation/notebooks/regression.html
Maybe you can use the GPflow package, which makes separate prediction for latent function f and observation y (f+ noise).
m.predict_f returns the mean and variance of the latent function (f) at the points Xnew.
m.predict_y returns the mean and variance of a new data point (i.e. includes the noise variance).
I have the following graph that I want to digitize to a high-quality publication grade figure using Python and Matplotlib:
I used a digitizer program to grab a few samples from one of the 3 data sets:
x_data = np.array([
1,
1.2371,
1.6809,
2.89151,
5.13304,
9.23238,
])
y_data = np.array([
0.0688824,
0.0490012,
0.0332843,
0.0235889,
0.0222304,
0.0245952,
])
I have already tried 3 different methods of fitting a curve through these data points. The first method being to draw a spline through the points using scipy.interpolate import spline
This results in (with the actual data points drawn as blue markers):
This is obvisously no good.
My second attempt was to draw a curve fit using a series of different order polinimials using scipy.optimize import curve_fit. Even up to a fourth order polynomial the answer is useless (the lower order ones were even more useless):
Finally, I used scipy.interpolate import interp1d to try and interpolate between the data points. Linear interpolation obviously yields expected results but the line are straight and the whole purpose of this exercise is to get a nice smooth curve:
If I then use cubic interpolation I get a rubish result, however quadratic interpolation yields a slightly better result:
But it's not quite there yet, and I don't think interp1d can do higher order interpolation.
Is there anyone out there who has a good method of doing this? Maybe I would be better off trying to do it in IPE or something?
Thank you!
A standard cubic spline is not very good at reasonable looking interpolations between data points that are very unevenly spaced. Fortunately, there are plenty of other interpolation algorithms and Scipy provides a number of them. Here are a few applied to your data:
import numpy as np
from scipy.interpolate import spline, UnivariateSpline, Akima1DInterpolator, PchipInterpolator
import matplotlib.pyplot as plt
x_data = np.array([1, 1.2371, 1.6809, 2.89151, 5.13304, 9.23238])
y_data = np.array([0.0688824, 0.0490012, 0.0332843, 0.0235889, 0.0222304, 0.0245952])
x_data_smooth = np.linspace(min(x_data), max(x_data), 1000)
fig, ax = plt.subplots(1,1)
spl = UnivariateSpline(x_data, y_data, s=0, k=2)
y_data_smooth = spl(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'b')
bi = Akima1DInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'g')
bi = PchipInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'k')
ax.plot(x_data_smooth, y_data_smooth)
ax.scatter(x_data, y_data)
plt.show()
I suggest looking through these, and also a few others, and finding one that matches what you think looks right. Also, though, you may want to sample a few more points. For example, I think the PCHIP algorithm wants to keep the fit monotonic between data points, so digitizing your minimum point would be useful (and probably a good idea regardless of the algorithm you use).
I want to fit histograms with a skewed gaussian.
I take my data from a text file:
rate, err = loadtxt('hist.dat', unpack = True)
and then plot them as a histogram:
plt.hist(rate, bins= 128)
This histogram has a skewed gaussian shape, that I would like to fit.
I can do it with a simple gaussian, because scipy has the function included, but not with a skewed. How can I proceed?
Possibly, a goodness of fit test returned would be the best.
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful. This has a Skewed Gaussian model built in. Your problem might be as simple as
from lmfit.models import SkewedGaussianModel
xvals, yvals = read_your_histogram()
model = SkewedGaussianModel()
# set initial parameter values
params = model.make_params(amplitude=10, center=0, sigma=1, gamma=0)
# adjust parameters to best fit data.
result = model.fit(yvals, params, x=xvals)
print(result.fit_report())
pylab.plot(xvals, yvals)
pylab.plot(xvals, result.best_fit)
This will report the values and uncertainties for the parameters amplitude, center, sigma (for the normal Gaussian), and gamma, the skewness factor.
There are several answers out there for using the .fit() method of scipy.stats.skewnorm, but that method doesn't allow for initial parameters and is not robust. This lmfit package is better, but I will add that a non-zero baseline may still throw it off. To get it to work on my particular dataset, I used the scipy.optimize.curve_fit first with an ordinary gaussian, which was the quickest way to get the baseline, then subtracted it and refit with lmfit to get the skew.