I've been trying to implement time series prediction tool using support vector regression in python language. I use SVR module from scikit-learn for non-linear Support vector regression. But I have serious problem with prediction of future events. The regression line fits the original function great (from known data) but as soon as I want to predict future steps, it returns value from the last known step.
My code looks like this:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.svm import SVR
X = np.arange(0,100)
Y = np.sin(X)
svr_rbf = SVR(kernel='rbf', C=1e5, gamma=1e5)
y_rbf = svr_rbf.fit(X[:-10, np.newaxis], Y[:-10]).predict(X[:, np.newaxis])
figure = plt.figure()
tick_plot = figure.add_subplot(1, 1, 1)
tick_plot.plot(X, Y, label='data', color='green', linestyle='-')
tick_plot.axvline(x=X[-10], alpha=0.2, color='gray')
tick_plot.plot(X, y_rbf, label='data', color='blue', linestyle='--')
plt.show()
Any ideas?
thanks in advance,
Tom
You are not really doing time-series prediction. You are trying to predict each element of Y from a single element of X, which means that you are just solving a standard kernelized regression problem.
Another problem is when computing the RBF kernel over a range of vectors [[0],[1],[2],...], you will get a band of positive values along the diagonal of the kernel matrix while values far from the diagonal will be close to zero. The test set portion of your kernel matrix is far from the diagonal and will therefore be very close to zero, which would cause all of the SVR predictions to be close to the bias term.
For time series prediction I suggest building the training test set as
x[0]=Y[0:K]; y[0]=Y[K]
x[1]=Y[1:K+1]; y[1]=Y[K+1]
...
that is, try to predict future elements of the sequence from a window of previous elements.
Related
I am trying to fit a curve with the curve_fit function in SciPy. By changing the inital values of the model the quality of the fit is changing but I am not able to find the best fit through my data. Here is how my fit looks like
My question is how can I improve this fit and what is the best way of selecting the initial values of the model.
I have attached the raw data which I want to fit an exponential curve to it.
This is the data which I am using
y = [ 338.52656636 337.43934446 348.25434126 308.42768639 279.24436171
269.85992004 279.24436171 249.25992615 239.53215125 219.96215705
220.41993469 220.30549028 220.30549028 195.07049776 180.364391
171.20883816 180.24994659 180.13550218 180.47883541 209.89104892
220.19104587 180.02105777 595.45426801 324.50712607 150.60884426
170.97994934 171.20883816 170.75106052 170.75106052 159.76439711
140.88106937 150.37995544 140.88106937 1620.70451979 140.42329173
150.37995544 140.53773614 284.68047121 1146.84743797 170.97994934
150.60884426 145.74495682 141.10995819 121.53996399 121.19663076
131.38218329 170.40772729 140.42329173 140.82384716 145.5732902
140.30884732 121.53996399 700.39979247 2783.74584185 131.26773888
140.76662496 140.53773614 121.76885281 126.23218482 130.69551683]
and here is my code:
from numpy import arange
from pandas import read_csv
from scipy.optimize import curve_fit
from matplotlib import pyplot
def expDecay(t, Amax, tau):
return Amax/tau*np.exp(-t/tau)
Amax = []
Tau = []
ydata = y
x = array(range(len(y)))
xdata = x
popt, pcov = curve_fit(expDecay, x, y,
p0=(10000, 5),
bounds=([0., 2.], [10000., 30]),)
Amax.append(popt[0])
Tau.append(popt[1])
plt.plot(xdata, expDecay(xdata, *popt), 'k-', label='Pred.');
plt.plot(ydata);
plt.ylim([0, 500])
plt.show()
The deviation is due to the outliers. After eliminating them :
Note about eliminating the outliers.
Since the definition of outlier is subjective a software able to do this will probably be more or less interactive. I built my own very rudimentary software. The principle is :
A first nonlinear regression is done with all the points. With the function and parameters obtained the values of y are computed for each point. The absolute difference between the "y computed" and the "y values" from the given data file are compared. This allows to eliminate the point the further away.
Another nonlinear regression is done with the remaining points. The same procedure eliminates a second point.
And so on until a specified criteria be reached to stop. That is the subjective part.
With your data (60 points) the point n.54 was eliminated first. Then the point n.34, then n.39 and so on. The process stops after eliminating 6 points. Eliminating more points doesn't improve much the LMSE.
The curve above is the result of the last nonlinear regression with the 54 remaining points.
I encounter a problem in using python tool "statsmodels.api.GLM", which I cannot understand. I come here asking for helps.
I'm working on an example of (see the Section of) "Cubic and Natual Cubic Splines"
on this page https://www.analyticsvidhya.com/blog/2018/03/introduction-regression-splines-python-codes/ (data link is included in the page or here)
The problem is that. After fitting the data, I try to predict values at given places of x (eg. the xp00 and xp01 in the following code). Then I find that, once the requested positions having different min and max (i.e., the xp01) from the training x-set (i.e., the xp), the result becomes something else, not at all my transitional expectation that, at the same position, the prediction should be exactly the same value, whatever how you made the request because the fit to the data is done and fixed. I'm expecting the pred01 is overlapped with pred00, but just shorter a the left end.
# import modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
# read data_set
data = pd.read_csv("Wage.csv")
data.head()
data_x = data['age']
data_y = data['wage']
# Dividing data into train and validation datasets
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(data_x, data_y, test_size=0.33, random_state = 1)
from patsy import dmatrix
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
from math import sqrt
# Generating cubic spline with 3 knots at 25, 40 and 60
transformed_x = dmatrix("bs(train, knots=(25,40,60), degree=3, include_intercept=False)", {"train": train_x},return_type='dataframe')
# Fitting Generalised linear model on transformed dataset
fit1 = sm.GLM(train_y, transformed_x).fit()
# Prediction on splines
pred1 = fit1.predict(dmatrix("bs(valid, knots=(25,40,60), include_intercept=False)", {"valid": valid_x}, return_type='dataframe'))
# Calculating RMSE values
rms1 = sqrt(mean_squared_error(valid_y, pred1))
print(rms1)
#-> 39.4
# We will plot the graph for 70 observations only
xp = np.linspace(valid_x.min(),valid_x.max(),70)
xp00 = np.linspace(valid_x.min(),valid_x.max(),170)
xp01 = np.linspace(valid_x.min()+4,valid_x.max(),170) # just shift the lower bound a bit
# Make some predictions
pred1 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp}, return_type='dataframe'))
pred00 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp00}, return_type='dataframe'))
pred01 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp01}, return_type='dataframe'))
SMALL_SIZE = 4
gamma=0.4
plt.rc('font', size=SMALL_SIZE)
plt.rc('axes', titlesize=SMALL_SIZE)
plt.figure(figsize=(5,2),dpi=300)
# Plot the splines and error bands
plt.scatter(data.age, data.wage, facecolor='None', edgecolor='k', alpha=0.1)
#plt.plot(xp, pred1, label='Specifying degree =3 with 3 knots')
plt.plot(xp, pred1, color='r', label='Specifying degree =3 with 4 knots xp')
plt.plot(xp00, pred00, color='g', label='Specifying degree =3 with 4 knots xp00')
plt.plot(xp01, pred01, color='b', label='Specifying degree =3 with 4 knots xp00')
plt.legend()
plt.xlim(15,85)
plt.ylim(0,350)
plt.xlabel('age')
plt.ylabel('wage')
plt.show()
Well, I have no right to enclose the figure in the post! so please click the link below and check the strange results. Perhaps not strange just myself don't know how to use it. I'm ready to see.
the strange reuslts (URL :https://i.stack.imgur.com/uFkGH.jpg)
Thanks!!
Yanbin
splines are a statefull transformation. That means that computing the splines needs parameters like knot location that are based on the data. This is similar to standardization that depends on mean and standard deviation of the sample.
Using formulas in statsmodels keeps track of those stateful transformation for transformations like splines that are provided by patsy. So, the original parameters for the statefull transformation are used when computing the transformed design matrix for new prediction points.
In the example code, the spline basis is computed separately for the training and test example. However, it specifies the interior knots to be the same in both cases.
My guess what happens in the example is that patsy adjusts the boundary knots to the transformation data. In that case, even if the interior knots are the same, the boundary knots differ.
As consequence, the B-spline basis will agree in the interior of the data space, but not for points close to the boundary.
A second source of differences is that removing the intercept from the spline basis can be a "global" transformation which will affect all spline basis columns and not just a single column. (I do not remember what patsy's default for removing the intercept is for the B-splines.)
I have added excel plot from which I get the exponential equation, I am trying to curve fit this in Python.
My fitted equation is not as close to the empirical data i have provided when i use it to predict the y data, the prediction gives f(-25)= 5.30e-11, while the empirical data f(-25) gives = 5.3e-13
How can i improve the code to be predicting close to empirical data, or i have made mistakes in my code??
python fitted plot
![][2]
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.optimize as optimize
import scipy.stats as stats
pd.set_option('precision', 14)
def f(x,A,B):
return A * np.exp((-B) * (x))
y_data= [2.156e-05, 1.85e-07, 1.02e-10 , 1.268e-11, 5.352e-13]
x= [-28.8, -27.4, -26 , -25.5, -25]
p, pcov = optimize.curve_fit(f, x, y_data, p0=[10**(-59),4], maxfev=5000)
plt.figure()
plt.plot(x, y_data, 'ko', label="Empirical BER")
plt.plot(x, f(x, *p ), 'g-', label="Fitted BER" )
plt.title(" BER ")
plt.xlabel('Power Rx (dB)')
plt.ylabel('')
plt.legend()
plt.grid()
plt.yscale("log")
plt.show()
Since you are plotting the data with a log-plot, your view of the data and fit is emphasizing the "tiny" compared to the "small". Fitting uses the sum of the squares of the misfit to determine the best fit. A misfit of a few percent of the data with a y-value of ~2e-5 would completely swamp a misfit of a factor of 10 or even 100 for the data with a y-value of 1.e-11. Your plot is consistent with that.
There are two possible routes to a better fit:
a) if you have uncertainties in the y-values, use those. It's quite possible that the uncertainty in the data with y~2e-5 is much larger than the uncertainty in the date with y~1.e-11, and scaling by the uncertainty so that the minimization is of the sum-of-squares of (data-model)/uncertainty will help fit the low-value data better. OTOH, if the errors are constant, plotting those uncertainties might show that the fit you have is actually not that bad -- the misfit where y~1.e-11 is only 1.e-10.
b) realize that you are assessing the fit quality by plotting the log of the data, and embrace that observation so that you fit the log(data) to log(model). Conveniently for a simple exponential function, the log of that model is linear, so you could do linear regression of the log of your data.
Bonus round: recognize that options a) and b) are related. Since a fit minimizes Sum[ ((data-model)/uncertainty)**2], not providing values for uncertainty is effectively saying that the has same uncertainty (=1.0 in fact) for all values of x and y. If you fit the log of the model to the log of the data, as withSum[ (log(data) - log(model))**2] is effectively saying that the uncertainty in the log(data) is the same for all values of x and y.
I want to have an input(x) of 5 points and an output of the same size(y). after that, I should fit a curved line into the dataset. finally, I should use matplotlib to draw the curved line and the points in order to show a non-linear regression.
I want to fit a single curve to my dataset of 5 points .but it does not seem to work. it is simple but I'm new to sklearn. do you know what is wrong with my code?
here is the code:
#here is the dataset of 5 points
x=np.random.normal(size=5)
y=2.2*x-1.1
y=y+np.random.normal(scale=3,size=y.shape)
x=x.reshape(-1,1)
#i use polynomialfeatures module because I want extra dimensions
preproc=PolynomialFeatures(degree=4)
x_poly=preproc.fit_transform(x)
#in this part I want to make 100 points to feed it to a polynomial and after that i can draw a curve .
x_line=np.linspace(-2,2,100)
x_line=x_line.reshape(-1,1)
#at this point i made y_hat inorder to have values of predicted y.
poly_line=PolynomialFeatures(degree=4)
x_feats=poly_line.fit_transform(x_line)
y_hat=LinearRegression().fit(x_feats,y).predict(x_feats)
plt.plot(y_hat,y_line,"r")
plt.plot(x,y,"b.")
First of all, your are having a LinearRegression problem. As joostblack and Arya commented, you equation is y=2.2x-1.1, this is linear. Why you need polynomial features?
Anyway, if you need to do this task because you have been asked, here you have code that can work:
x=np.random.normal(size=5)
y=2.2*x-1.1
mymodel = numpy.poly1d(numpy.polyfit(x, y, 4))
myline = numpy.linspace(-2, 2, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
As we commented, is "silly" to fit a linear problem with polyonomalia degree 4 because wi will always a linear regression as a solution. It can be useful if you have another relation like that: y=x**3+x-2 (this is not linear as you can see):
np.random.seed(0)
x=np.random.normal(size=5)
y=x**3+x-2
mymodel = numpy.poly1d(numpy.polyfit(x, y, 4))
myline = numpy.linspace(-2, 3, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Two final comments. You have to differentiate what is a LinearRegression and Polyonomial, and in which case they are useful. Second, I used numpy to solve your problem, not sklearn, it's more simple for your problem, be aware of that.
I am relatively new to python. I am trying to do a multivariate linear regression and plot scatter plots and the line of best fit using one feature at a time.
This is my code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
And this is the graph that I'm getting-
I have tried searching a lot but to no avail. I wanted to understand why this is not showing a line of best-fit and why instead it is connecting all the points on the scatter plot.
Thank you!
See linear regression means, that you are predicting the value linearly which will always give you a best fit line. Anything else is not possible, in your code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Use the right variables to plot the line ie:
plt.plot(x_test,y_pred)
Plot the graph between the values that you put for test and the predictions that you get from that ie:
y_pred=regr.predict(x_test)
Also your model must be trained for the same, otherwise you will get the straight line but the results will be unexpected.
This is a multivariant data so you need to get the pairwise line
http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text=wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix.
or change the model for a linearly dependent data that will change the model completely
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train['lag_7'],y_train)
y_pred=regr.predict(x_test['lag_7'])
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Assuming your graphical library is matplotlib, imported with import matplotlib.pyplot as plt, the problem is that you passed the same data to both plt.scatter and plt.plot. The former draws the scatter plot, while the latter passes a line through all points in the order given (it first draws a straight line between (x_test['lag_7'][0], y_pred[0]) and (x_test['lag_7'][1], y_pred[1]), then one between (x_test['lag_7'][1], y_pred[1]) and (x_test['lag_7'][2], y_pred[2]), etc.)
Concerning the more general question about how to do multivariate regression and plot the results, I have two remarks:
Finding the line of best fit one feature at a time amounts to performing 1D regression on that feature: it is an altogether different model from the multivariate linear regression you want to perform.
I don't think it makes much sense to split your data into train and test samples, because linear regression is a very simple model with little risk of overfitting. In the following, I consider the whole data set df.
I like to use OpenTURNS because it has built-in linear regression viewing facilities. The downside is that to use it, we need to convert your pandas tables (DataFrame or Series) to OpenTURNS objects of the class Sample.
import pandas as pd
import numpy as np
import openturns as ot
from openturns.viewer import View
# convert pandas DataFrames to numpy arrays and then to OpenTURNS Samples
X = ot.Sample(np.array(df[['lag_7','rolling_mean', 'expanding_mean']]))
X.setDescription(['lag_7','rolling_mean', 'expanding_mean']) # keep labels
Y = ot.Sample(np.array(df[['sales']]))
Y.setDescription(['sales'])
You did not provide your data, so I need to generate some:
func = ot.SymbolicFunction(['x1', 'x2', 'x3'], ['4*x1 + 0.05*x2 - 2*x3'])
inputs_distribution = ot.ComposedDistribution([ot.Uniform(0, 3.0e6)]*3)
residuals_distribution = ot.Normal(0.0, 2.0e6)
ot.RandomGenerator.SetSeed(0)
X = inputs_distribution.getSample(30)
X.setDescription(['lag_7','rolling_mean', 'expanding_mean'])
Y = func(X) + residuals_distribution.getSample(30)
Y.setDescription(['sales'])
Now, let us find the best-fitting line one feature at a time (1D linear regression):
linear_regression_1 = ot.LinearModelAlgorithm(X[:, 0], Y)
linear_regression_1.run()
linear_regression_1_result = linear_regression_1.getResult()
ot.VisualTest_DrawLinearModel(X[:, 0], Y, linear_regression_1_result)
linear_regression_2 = ot.LinearModelAlgorithm(X[:, 1], Y)
linear_regression_2.run()
linear_regression_2_result = linear_regression_2.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 1], Y, linear_regression_2_result))
linear_regression_3 = ot.LinearModelAlgorithm(X[:, 2], Y)
linear_regression_3.run()
linear_regression_3_result = linear_regression_3.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 2], Y, linear_regression_3_result))
As you can see, in this example, none of the one-feature linear regressions are able to very accurately predict the output.
Now let us do multivariate linear regression. To plot the result, it is best to view the actual vs. predicted values.
full_linear_regression = ot.LinearModelAlgorithm(X, Y)
full_linear_regression.run()
full_linear_regression_result = full_linear_regression.getResult()
full_linear_regression_analysis = ot.LinearModelAnalysis(full_linear_regression_result)
View(full_linear_regression_analysis.drawModelVsFitted())
As you can see, in this example, the fit is much better with multivariate linear regression than with 1D regressions one feature at a time.