Extending regressions beyond data in Matplotlib - python

I'm using Matplotlib and Numpy to plot linear regressions on time series plots in order to predict the trends in the future.
Generating the regressions doesn't seem to be particularly difficult, but getting the regression line to extend past the last data point is proving challenging:
How can I extend the regressions?

When you evaluate your regression model, you're predicting a value of submissions for the input date. To predict a wider range, you need to increase the range of dates that you're evaluating the model on. I'd also use np.polyval instead of the list comprehension, just because as it's more compact:
# Generate data like the question
observed_dates = pd.date_range("jan 2004", "april 2013", freq="M")
submissions = np.random.normal(5000, 100, len(observed_dates))
submissions += np.arange(len(observed_dates)) * 10
submissions[::12] += 800
# Plot the observed data
plt.plot(observed_dates, submissions, marker="o")
# Fit a model and predict future dates
predict_dates = pd.date_range("jan 2004", "jan 2020", freq="M")
model = np.polyfit(observed_dates.asi8, submissions, 1)
predicted = np.polyval(model, predict_dates.asi8)
# Plot the model
plt.plot(predict_dates, predicted, lw=3)

If you want to extend the regression line beyond the data, for example, to cover the entire x range, you can do (just change the last 3 lines):
import numpy as np
X=np.arange(xmin, xmax, 50)
line=beta1*X**2+beta2*X+beta3
plt.plot(X, line, 'r-', lw=5.)

Related

How to Fourier transform two graph and find a function that make one similar to another in Python

I have daily data like below as a result.
And my daily prediction is like the one below most of the day.
I found a mild trend that value of Y-axis in my prediction where X-axis is between 0-3000 always need to be amplified much more..
How could I find a function that can close the prediction data to result data? I'd imagine I could do it if I Fourier transform both array but I also guess there would be a simpler way to do it..
I assume what you want to get as output is a kind of 'best fit' scaling function to match your output to your prediction. One straightforward approach would be to just calculate the difference of your output and prediction and then apply a smoothing algorithm of your choice to get a scaling function. (e.g. a Savitzky Golay Filter)
Minimal example below:
import numpy as np
from scipy.signal import savgol_filter
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,100)
data = np.sin(x) + np.random.normal(0, 0.2, 100) # a noisy sine function
prediction = x # line with slope 1, a bad initial fit
deviation = prediction - data
fit = savgol_filter(deviation, window_length=71, polyorder=2)
plt.plot(x, prediction-fit)
plt.plot(x, data)
play around with window_length and polyorder to find a suitable degree of noise for your dataset.

Python Linear regression : plt.plot() not showing straight line. Instead it connects every point on scatter plot

I am relatively new to python. I am trying to do a multivariate linear regression and plot scatter plots and the line of best fit using one feature at a time.
This is my code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
And this is the graph that I'm getting-
I have tried searching a lot but to no avail. I wanted to understand why this is not showing a line of best-fit and why instead it is connecting all the points on the scatter plot.
Thank you!
See linear regression means, that you are predicting the value linearly which will always give you a best fit line. Anything else is not possible, in your code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Use the right variables to plot the line ie:
plt.plot(x_test,y_pred)
Plot the graph between the values that you put for test and the predictions that you get from that ie:
y_pred=regr.predict(x_test)
Also your model must be trained for the same, otherwise you will get the straight line but the results will be unexpected.
This is a multivariant data so you need to get the pairwise line
http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text=wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix.
or change the model for a linearly dependent data that will change the model completely
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train['lag_7'],y_train)
y_pred=regr.predict(x_test['lag_7'])
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Assuming your graphical library is matplotlib, imported with import matplotlib.pyplot as plt, the problem is that you passed the same data to both plt.scatter and plt.plot. The former draws the scatter plot, while the latter passes a line through all points in the order given (it first draws a straight line between (x_test['lag_7'][0], y_pred[0]) and (x_test['lag_7'][1], y_pred[1]), then one between (x_test['lag_7'][1], y_pred[1]) and (x_test['lag_7'][2], y_pred[2]), etc.)
Concerning the more general question about how to do multivariate regression and plot the results, I have two remarks:
Finding the line of best fit one feature at a time amounts to performing 1D regression on that feature: it is an altogether different model from the multivariate linear regression you want to perform.
I don't think it makes much sense to split your data into train and test samples, because linear regression is a very simple model with little risk of overfitting. In the following, I consider the whole data set df.
I like to use OpenTURNS because it has built-in linear regression viewing facilities. The downside is that to use it, we need to convert your pandas tables (DataFrame or Series) to OpenTURNS objects of the class Sample.
import pandas as pd
import numpy as np
import openturns as ot
from openturns.viewer import View
# convert pandas DataFrames to numpy arrays and then to OpenTURNS Samples
X = ot.Sample(np.array(df[['lag_7','rolling_mean', 'expanding_mean']]))
X.setDescription(['lag_7','rolling_mean', 'expanding_mean']) # keep labels
Y = ot.Sample(np.array(df[['sales']]))
Y.setDescription(['sales'])
You did not provide your data, so I need to generate some:
func = ot.SymbolicFunction(['x1', 'x2', 'x3'], ['4*x1 + 0.05*x2 - 2*x3'])
inputs_distribution = ot.ComposedDistribution([ot.Uniform(0, 3.0e6)]*3)
residuals_distribution = ot.Normal(0.0, 2.0e6)
ot.RandomGenerator.SetSeed(0)
X = inputs_distribution.getSample(30)
X.setDescription(['lag_7','rolling_mean', 'expanding_mean'])
Y = func(X) + residuals_distribution.getSample(30)
Y.setDescription(['sales'])
Now, let us find the best-fitting line one feature at a time (1D linear regression):
linear_regression_1 = ot.LinearModelAlgorithm(X[:, 0], Y)
linear_regression_1.run()
linear_regression_1_result = linear_regression_1.getResult()
ot.VisualTest_DrawLinearModel(X[:, 0], Y, linear_regression_1_result)
linear_regression_2 = ot.LinearModelAlgorithm(X[:, 1], Y)
linear_regression_2.run()
linear_regression_2_result = linear_regression_2.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 1], Y, linear_regression_2_result))
linear_regression_3 = ot.LinearModelAlgorithm(X[:, 2], Y)
linear_regression_3.run()
linear_regression_3_result = linear_regression_3.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 2], Y, linear_regression_3_result))
As you can see, in this example, none of the one-feature linear regressions are able to very accurately predict the output.
Now let us do multivariate linear regression. To plot the result, it is best to view the actual vs. predicted values.
full_linear_regression = ot.LinearModelAlgorithm(X, Y)
full_linear_regression.run()
full_linear_regression_result = full_linear_regression.getResult()
full_linear_regression_analysis = ot.LinearModelAnalysis(full_linear_regression_result)
View(full_linear_regression_analysis.drawModelVsFitted())
As you can see, in this example, the fit is much better with multivariate linear regression than with 1D regressions one feature at a time.

Strange plot after linear regression using Numpy's least squares

I am doing linear regression with multiple variables. To get thetas (coefficients) I used Numpy's least-squares numpy.linalg.lstsq tool. In my data I have n = 143 features and m = 13000 training examples. I want to plot house prices against area and show fitting line for this feature.
Data preparation code (Python):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
path = 'DB2.csv'
data = pd.read_csv(path, header=None, delimiter=";")
data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
Getting theta coefficients with numpy.linalg.lstsq:
thetas = np.linalg.lstsq(X, y)[0]
Prediction part:
allAreasData = X.iloc[:,120] #Used as argument to scatter all training data
areasTestValues = X.iloc[0:100,120] #Used as argument for plot function
testingExamples = X.iloc[0:100,:] #Used to make predictions
predictions = testingExamples.dot(thetas)
Note: 120 in the above code is index of Area column in my dataset.
Visualization part:
fig, ax = plt.subplots(figsize=(18,10))
ax.scatter(allAreasData, y, label='Traning Data', color='r')
ax.plot(areasTestValues, predictions, 'b', label='Prediction')
ax.legend(loc=2)
ax.set_xlabel('Area')
ax.set_ylabel('Price')
ax.set_title('Predicted Price vs. House Area')
Output plot:
I expected to get some single regression line that fits data but instead of it got such strange polyline (broken line). What I am doing wrong? Scatter works right. But plot is not. For plot function I send 2 arguments:
1) Testing area data (100 area data examples)
2) Predictions of price based on 100 training examples that include area data
Update:
After sorting x I got this plot with curve:
I was expecting to get straight line fitting all my data with least square errors but instead got a curve. Isn't linear regression and numpy.linalg.lstsq tool supposed to return straight fitting line instead of curve?
Your result is linear in a 143 dimensional space. ;) Since your X contains many more features than just the area the prediction will also (linearly) depend on those features.
If you redo your training with X = data.iloc[:,120] (only considering the area feature) you should receive a straight line when you plot the results.

prediction plots for statsmodels OLS fit, taking out categorical effects

I have some data for about 500 galaxies in a pandas DataFrame (a few hundred measurements per galaxy), and I'm trying to perform OLS regression on a few variables, one of which is categorical (each galaxy is its own category). Basically, once I have finished fitting the model, I want to plot the data, the fit, and some error bounds, taking out the presumed effects of the categories themselves.
In effect, I want to produce a plot much like the one in "OLS non-linear curve but linear in parameters" section of this tutorial (replicated here).
Instead, I have this (I've just picked two galaxies here, for ease of reading, but it gets really ugly with all 500):
Since there seem to be two "clusters" here, I have concluded that each must correspond to a galaxy. What I really want, though, is to collapse them down into a single line that takes out the inter-category effects, and imagines that they were all one galaxy.
For reference, the code that I'm using to fit and plot is:
m = sm.ols(
formula='{} ~ Rdeproj + NSAMstar + \
NSASersicN + C(plateifu)'.format(qty),
data=dfr)
f = m.fit()
#print dir(f)
ypred = f.predict()
prstd, iv_l, iv_u = wls_prediction_std(f)
plt.close('all')
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(dfr['Rdeproj'], dfr[qty], '.', label='data')
ax.plot(dfr['Rdeproj'], f.fittedvalues, 'r--.', label='pred')
ax.plot(dfr['Rdeproj'], iv_u, 'r--', label='bound')
ax.plot(dfr['Rdeproj'], iv_l, 'r--')
legend=ax.legend(loc='best')
ax.set_xlabel(r'$R_{deproj}$ [Mpc]')
ax.set_ylabel(qty)
plt.tight_layout()
plt.savefig('fits/' + qty + '_fit.png')
I found one similar question asked here, but it seems to only address predicting observations for specific categories, rather than taking out those effects entirely.
Any further advice would be very much appreciated.

Prediction intervals for ARMA.predict

The Summary of an ARMA prediction for time series (print arma_mod.summary()) shows some numbers about the confidence interval. Is it possible to use these numbers as prediction intervals in the plot which shows predicted values?
ax = indexed_df.ix[:].plot(figsize=(12,8))
ax = predict_price.plot(ax=ax, style='rx', label='Dynamic Prediction');
ax.legend();
I guess the code:
from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(results)
found here: Confidence intervals for model prediction
...does not apply here as it is made for OLS rather then for ARMA forecasting. I also checked github but did not find any new stuff which might relate to time series prediction.
(Making forecasts requires forecasting intervals i guess, especially when it comes to an out-of sample forecast.)
Help appreciated.
I suppose, for out-of-sample ARMA prediction, you can use ARMA.forecast from statsmodels.tsa
It returns three arrays: predicted values, standard error and confidence interval for the prediction.
Example with ARMA(1,1), time series y and prediction 1 step ahead:
import statsmodels as sm
arma_res = sm.tsa.ARMA(y, order=(1,1)).fit()
preds, stderr, ci = arma_res.forecast(1)

Categories