I encounter a problem in using python tool "statsmodels.api.GLM", which I cannot understand. I come here asking for helps.
I'm working on an example of (see the Section of) "Cubic and Natual Cubic Splines"
on this page https://www.analyticsvidhya.com/blog/2018/03/introduction-regression-splines-python-codes/ (data link is included in the page or here)
The problem is that. After fitting the data, I try to predict values at given places of x (eg. the xp00 and xp01 in the following code). Then I find that, once the requested positions having different min and max (i.e., the xp01) from the training x-set (i.e., the xp), the result becomes something else, not at all my transitional expectation that, at the same position, the prediction should be exactly the same value, whatever how you made the request because the fit to the data is done and fixed. I'm expecting the pred01 is overlapped with pred00, but just shorter a the left end.
# import modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
# read data_set
data = pd.read_csv("Wage.csv")
data.head()
data_x = data['age']
data_y = data['wage']
# Dividing data into train and validation datasets
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(data_x, data_y, test_size=0.33, random_state = 1)
from patsy import dmatrix
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
from math import sqrt
# Generating cubic spline with 3 knots at 25, 40 and 60
transformed_x = dmatrix("bs(train, knots=(25,40,60), degree=3, include_intercept=False)", {"train": train_x},return_type='dataframe')
# Fitting Generalised linear model on transformed dataset
fit1 = sm.GLM(train_y, transformed_x).fit()
# Prediction on splines
pred1 = fit1.predict(dmatrix("bs(valid, knots=(25,40,60), include_intercept=False)", {"valid": valid_x}, return_type='dataframe'))
# Calculating RMSE values
rms1 = sqrt(mean_squared_error(valid_y, pred1))
print(rms1)
#-> 39.4
# We will plot the graph for 70 observations only
xp = np.linspace(valid_x.min(),valid_x.max(),70)
xp00 = np.linspace(valid_x.min(),valid_x.max(),170)
xp01 = np.linspace(valid_x.min()+4,valid_x.max(),170) # just shift the lower bound a bit
# Make some predictions
pred1 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp}, return_type='dataframe'))
pred00 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp00}, return_type='dataframe'))
pred01 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp01}, return_type='dataframe'))
SMALL_SIZE = 4
gamma=0.4
plt.rc('font', size=SMALL_SIZE)
plt.rc('axes', titlesize=SMALL_SIZE)
plt.figure(figsize=(5,2),dpi=300)
# Plot the splines and error bands
plt.scatter(data.age, data.wage, facecolor='None', edgecolor='k', alpha=0.1)
#plt.plot(xp, pred1, label='Specifying degree =3 with 3 knots')
plt.plot(xp, pred1, color='r', label='Specifying degree =3 with 4 knots xp')
plt.plot(xp00, pred00, color='g', label='Specifying degree =3 with 4 knots xp00')
plt.plot(xp01, pred01, color='b', label='Specifying degree =3 with 4 knots xp00')
plt.legend()
plt.xlim(15,85)
plt.ylim(0,350)
plt.xlabel('age')
plt.ylabel('wage')
plt.show()
Well, I have no right to enclose the figure in the post! so please click the link below and check the strange results. Perhaps not strange just myself don't know how to use it. I'm ready to see.
the strange reuslts (URL :https://i.stack.imgur.com/uFkGH.jpg)
Thanks!!
Yanbin
splines are a statefull transformation. That means that computing the splines needs parameters like knot location that are based on the data. This is similar to standardization that depends on mean and standard deviation of the sample.
Using formulas in statsmodels keeps track of those stateful transformation for transformations like splines that are provided by patsy. So, the original parameters for the statefull transformation are used when computing the transformed design matrix for new prediction points.
In the example code, the spline basis is computed separately for the training and test example. However, it specifies the interior knots to be the same in both cases.
My guess what happens in the example is that patsy adjusts the boundary knots to the transformation data. In that case, even if the interior knots are the same, the boundary knots differ.
As consequence, the B-spline basis will agree in the interior of the data space, but not for points close to the boundary.
A second source of differences is that removing the intercept from the spline basis can be a "global" transformation which will affect all spline basis columns and not just a single column. (I do not remember what patsy's default for removing the intercept is for the B-splines.)
Related
I am trying to fit a curve with the curve_fit function in SciPy. By changing the inital values of the model the quality of the fit is changing but I am not able to find the best fit through my data. Here is how my fit looks like
My question is how can I improve this fit and what is the best way of selecting the initial values of the model.
I have attached the raw data which I want to fit an exponential curve to it.
This is the data which I am using
y = [ 338.52656636 337.43934446 348.25434126 308.42768639 279.24436171
269.85992004 279.24436171 249.25992615 239.53215125 219.96215705
220.41993469 220.30549028 220.30549028 195.07049776 180.364391
171.20883816 180.24994659 180.13550218 180.47883541 209.89104892
220.19104587 180.02105777 595.45426801 324.50712607 150.60884426
170.97994934 171.20883816 170.75106052 170.75106052 159.76439711
140.88106937 150.37995544 140.88106937 1620.70451979 140.42329173
150.37995544 140.53773614 284.68047121 1146.84743797 170.97994934
150.60884426 145.74495682 141.10995819 121.53996399 121.19663076
131.38218329 170.40772729 140.42329173 140.82384716 145.5732902
140.30884732 121.53996399 700.39979247 2783.74584185 131.26773888
140.76662496 140.53773614 121.76885281 126.23218482 130.69551683]
and here is my code:
from numpy import arange
from pandas import read_csv
from scipy.optimize import curve_fit
from matplotlib import pyplot
def expDecay(t, Amax, tau):
return Amax/tau*np.exp(-t/tau)
Amax = []
Tau = []
ydata = y
x = array(range(len(y)))
xdata = x
popt, pcov = curve_fit(expDecay, x, y,
p0=(10000, 5),
bounds=([0., 2.], [10000., 30]),)
Amax.append(popt[0])
Tau.append(popt[1])
plt.plot(xdata, expDecay(xdata, *popt), 'k-', label='Pred.');
plt.plot(ydata);
plt.ylim([0, 500])
plt.show()
The deviation is due to the outliers. After eliminating them :
Note about eliminating the outliers.
Since the definition of outlier is subjective a software able to do this will probably be more or less interactive. I built my own very rudimentary software. The principle is :
A first nonlinear regression is done with all the points. With the function and parameters obtained the values of y are computed for each point. The absolute difference between the "y computed" and the "y values" from the given data file are compared. This allows to eliminate the point the further away.
Another nonlinear regression is done with the remaining points. The same procedure eliminates a second point.
And so on until a specified criteria be reached to stop. That is the subjective part.
With your data (60 points) the point n.54 was eliminated first. Then the point n.34, then n.39 and so on. The process stops after eliminating 6 points. Eliminating more points doesn't improve much the LMSE.
The curve above is the result of the last nonlinear regression with the 54 remaining points.
I have added excel plot from which I get the exponential equation, I am trying to curve fit this in Python.
My fitted equation is not as close to the empirical data i have provided when i use it to predict the y data, the prediction gives f(-25)= 5.30e-11, while the empirical data f(-25) gives = 5.3e-13
How can i improve the code to be predicting close to empirical data, or i have made mistakes in my code??
python fitted plot
![][2]
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.optimize as optimize
import scipy.stats as stats
pd.set_option('precision', 14)
def f(x,A,B):
return A * np.exp((-B) * (x))
y_data= [2.156e-05, 1.85e-07, 1.02e-10 , 1.268e-11, 5.352e-13]
x= [-28.8, -27.4, -26 , -25.5, -25]
p, pcov = optimize.curve_fit(f, x, y_data, p0=[10**(-59),4], maxfev=5000)
plt.figure()
plt.plot(x, y_data, 'ko', label="Empirical BER")
plt.plot(x, f(x, *p ), 'g-', label="Fitted BER" )
plt.title(" BER ")
plt.xlabel('Power Rx (dB)')
plt.ylabel('')
plt.legend()
plt.grid()
plt.yscale("log")
plt.show()
Since you are plotting the data with a log-plot, your view of the data and fit is emphasizing the "tiny" compared to the "small". Fitting uses the sum of the squares of the misfit to determine the best fit. A misfit of a few percent of the data with a y-value of ~2e-5 would completely swamp a misfit of a factor of 10 or even 100 for the data with a y-value of 1.e-11. Your plot is consistent with that.
There are two possible routes to a better fit:
a) if you have uncertainties in the y-values, use those. It's quite possible that the uncertainty in the data with y~2e-5 is much larger than the uncertainty in the date with y~1.e-11, and scaling by the uncertainty so that the minimization is of the sum-of-squares of (data-model)/uncertainty will help fit the low-value data better. OTOH, if the errors are constant, plotting those uncertainties might show that the fit you have is actually not that bad -- the misfit where y~1.e-11 is only 1.e-10.
b) realize that you are assessing the fit quality by plotting the log of the data, and embrace that observation so that you fit the log(data) to log(model). Conveniently for a simple exponential function, the log of that model is linear, so you could do linear regression of the log of your data.
Bonus round: recognize that options a) and b) are related. Since a fit minimizes Sum[ ((data-model)/uncertainty)**2], not providing values for uncertainty is effectively saying that the has same uncertainty (=1.0 in fact) for all values of x and y. If you fit the log of the model to the log of the data, as withSum[ (log(data) - log(model))**2] is effectively saying that the uncertainty in the log(data) is the same for all values of x and y.
There are two ways to specify the noise level for Gaussian Process Regression (GPR) in scikit-learn.
The first way is to specify the parameter alpha in the constructor of the class GaussianProcessRegressor which just adds values to the diagonal as expected.
The second way is incorporate the noise level in the kernel with WhiteKernel.
The documentation of GaussianProcessRegressor (see documentation here) says that specifying alpha is "equivalent to adding a WhiteKernel with c=alpha". However, I am experiencing a different behavior and want to find out what the reason is for that (and, of course, what the "correct" way or "truth" is).
Here is a code snippet plotting two different regression fits for a perturbed version of the function f(x)=x^2 although they should show the same:
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel as C, RBF, WhiteKernel
rnd.seed(0)
n = 40
xs = np.linspace(-1, 1, num=n)
noise = 0.1
kernel1 = C()*RBF() + WhiteKernel(noise_level=noise)
kernel2 = C()*RBF()
data = xs**2 + rnd.multivariate_normal(mean=np.zeros(n), cov=noise*np.eye(n))
gpr1 = GaussianProcessRegressor(kernel=kernel1, alpha=0.0, optimizer=None)
gpr1.fit(xs[:, np.newaxis], data)
gpr2 = GaussianProcessRegressor(kernel=kernel2, alpha=noise, optimizer=None)
gpr2.fit(xs[:, np.newaxis], data)
xs_plt = np.linspace(-1., 1., num=100)
for gpr in [gpr1, gpr2]:
pred, pred_std = gpr.predict(xs_plt[:, np.newaxis], return_std=True)
plt.figure()
plt.plot(xs_plt, pred, 'C0', lw=2)
plt.scatter(xs, data, c='C1', s=20)
plt.fill_between(xs_plt, pred - 1.96*pred_std, pred + 1.96*pred_std,
alpha=0.2, color='C0')
plt.title("Kernel: %s\n Log-Likelihood: %.3f"
% (gpr.kernel_, gpr.log_marginal_likelihood(gpr.kernel_.theta)),
fontsize=12)
plt.ylim(-1.2, 1.2)
plt.tight_layout()
plt.show()
I already was looking into the implementation in the scikit-learn package, but was not able to find out what is going wrong. Or maybe I am just overseeing something and the outputs make perfect sense.
Does anyone have an idea of what is going on here or had a similar experience?
Thanks a lot!
I might be wrong here, but I believe the claim 'specifying alpha is "equivalent to adding a WhiteKernel with c=alpha"' is subtly incorrect.
When setting the GP-Regression noise, the noise is added only to K, the covariance between the training points. When adding a Whitenoise-Kernel, the noise is also added to K**, the covariance between test points.
In your case, the test points and training points are identical. However, the three different matrices are likely still created. This could lead to the discrepancy observed here.
I argue that the documentation is incorrect. See github issue #13267 about this with (which I opened).
In practice, what I do is fit a GP with the WhiteKernel then take that noice level. I then add that value to alpha and recompute the necessary variables. An easier alternative is to make a new GP with the alpha set and the same length scales but do not fit it.
I should note that it is not universally accepted as to whether or not this is the right approach. I had this discussion with a colleague and we came to the following conclusion. This pertains to the data bei.ng noise from experimental error
If you want to sample the GP to predict what a new experiment with more independent measurements, you want the WhiteKernel
If you want to sample the possible underlying truth, you do not want the WhiteKernel since you want a smooth response
https://gpflow.readthedocs.io/en/awav-documentation/notebooks/regression.html
Maybe you can use the GPflow package, which makes separate prediction for latent function f and observation y (f+ noise).
m.predict_f returns the mean and variance of the latent function (f) at the points Xnew.
m.predict_y returns the mean and variance of a new data point (i.e. includes the noise variance).
This question already has an answer here:
Plot is unclear using matplotlib and pandas library
(1 answer)
Closed 4 years ago.
This seems a sklearn question but it's not (at least not directly). I just use sklearn here to get the data points since this will be able to reproduce fully my problem. Some background
I use sklearn to predict some points in a small interval. First I build a synthetic domain X with 2d vectors (rows in a matrix).
Then I calculate some image points y= x_1 + x_2 + noise using those rows x=(x_1, x_2) and some noise to try to replicate some real data.
To do the regression (aka interpolation), as part of the method I fetch randomly pick vectors/points (here in matrix form they are rows) from the domain X using the command train_test_split, I will skip the details, but the result arrays are random subsets of the space (the space is (x_1, x_2, y) for all (x_1, x_2) in my compact support.
Then I do the regression using sklearn, so far so good. everything works as expected. And I get in y_pred_test_sine the predictions and they work well. But the prediction is completely shuffled since the method picks random points from domain as a test set.
Here comes the problem...
Since I want to plot as a continous function (being interpolated by matplotlib, and that is ok, I will play with my own interpolations tests later). I do two things:
Create a new vector with sorted predicted image points from test X_test_sort
Create a new vector with sorted domain points from test. y_pred_test_sine_sort
These (1) and (2) match (at least should) each data point in the predicted model (these are only sorted to be easily plotted using plt.plot lines, and not markers)
Then I plot them and they do not match (AT ALL) the expected points in my solution space.
Here we can see that the full black line (the sorted predicted line) do not follow the orange dots (the predicted points). And that was not what I expect at all.
Here follow the code to reproduce the issue.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
plt.close('all')
rng = np.random.RandomState(42)
regressor = LinearRegression()
# Synthetic dataset
x_1 = np.linspace(-3, 3, 300)
x_2 = np.sin(4*x_1)
noise = rng.uniform(size=len(x_1))
y = x_1 + x_2 + noise
X = np.vstack((x_1, x_2)).T
# Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Regression 2 features data
fit_sine = regressor.fit(X_train, y_train)
y_pred_test_sine = regressor.predict(X_test)
# Here I have sorted the X values and its image points Y = f(x)
# Why those are not correctly placed over the 'prediction' points
X_test_sort = np.sort(X_test[:,0].ravel())
y_pred_test_sine_sort = np.sort(y_pred_test_sine.ravel())
# DO THE PLOTTING
plt.plot(X_test[:,0], y_test, 'o', alpha=.5, label='data')
plt.plot(X_test[:,0], y_pred_test_sine, 'o', alpha=.5, label='prediction')
plt.plot(X_test_sort, y_pred_test_sine_sort, 'k', label='prediction line')
plt.plot(x, np.sin(4*x)+x+.5, 'k:', alpha=0.3, label='trend')
plt.legend()
As you mentioned in your comments, by sorting y, you ruin the connection between X and y by place. Instead, use use argsort to get the sorting order of X, and then order X_test and y with that:
argsort_X_test = np.argsort((X_test[:,0].ravel()))
X_test_sort = X_test[argsort_X_test, 0]
y_pred_test_sine_sort = y_pred_test_sine[argsort_X_test]
This will give you the desired graph
I've been trying to implement time series prediction tool using support vector regression in python language. I use SVR module from scikit-learn for non-linear Support vector regression. But I have serious problem with prediction of future events. The regression line fits the original function great (from known data) but as soon as I want to predict future steps, it returns value from the last known step.
My code looks like this:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.svm import SVR
X = np.arange(0,100)
Y = np.sin(X)
svr_rbf = SVR(kernel='rbf', C=1e5, gamma=1e5)
y_rbf = svr_rbf.fit(X[:-10, np.newaxis], Y[:-10]).predict(X[:, np.newaxis])
figure = plt.figure()
tick_plot = figure.add_subplot(1, 1, 1)
tick_plot.plot(X, Y, label='data', color='green', linestyle='-')
tick_plot.axvline(x=X[-10], alpha=0.2, color='gray')
tick_plot.plot(X, y_rbf, label='data', color='blue', linestyle='--')
plt.show()
Any ideas?
thanks in advance,
Tom
You are not really doing time-series prediction. You are trying to predict each element of Y from a single element of X, which means that you are just solving a standard kernelized regression problem.
Another problem is when computing the RBF kernel over a range of vectors [[0],[1],[2],...], you will get a band of positive values along the diagonal of the kernel matrix while values far from the diagonal will be close to zero. The test set portion of your kernel matrix is far from the diagonal and will therefore be very close to zero, which would cause all of the SVR predictions to be close to the bias term.
For time series prediction I suggest building the training test set as
x[0]=Y[0:K]; y[0]=Y[K]
x[1]=Y[1:K+1]; y[1]=Y[K+1]
...
that is, try to predict future elements of the sequence from a window of previous elements.