Predict future values after using polynomial regression in python - python

I'm currently using TensorFlow and SkLearn to to try to make a model that can predict the amount of sales for a certain product, X, based on the outdoor temperature in celcius.
I took my datasets for the temperature and set it equal to the x variable, and the amount of sales to as a y variable. As seen on the picture below, there is some sort of correlation between the temperature and the amount of sales:
First and foremost, I tried to do linear regression to see how well it'd fit. This is the code for that:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train) #fit tries to fit the x variable and y variable.
#Let's try to plot it out.
y_pred = model.predict(x_train)
plt.scatter(x_train,y_train)
plt.plot(x_train,y_pred,'r')
plt.legend(['Predicted Line', 'Observed data'])
plt.show()
This resulted in a predicted line that had a pretty poor fit:
A very nice feature from sklearn however is that you can try to predict an value based on a temperature, so if I were to write
model.predict(15)
i'd get the output
array([6949.05567873])
This is exactly what I want, I just wanted to line to fit better so instead I tried polynoimal regression with sklearn by doing following:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept
x_new = poly.fit_transform(x_train)
new_model = LinearRegression()
new_model.fit(x_new,y_train)
#plotting
y_prediction = new_model.predict(x_new) #this actually predicts x...?
plt.scatter(x_train,y_train)
plt.plot(x_new[:,0], y_prediction, 'r')
plt.legend(['Predicted line', 'Observed data'])
plt.show()
The line seems to fit better now:
My problem is not that I can't use new_model.predict(x) since it'll result in "ValueError: shapes (1,1) and (8,) not aligned: 1 (dim 1) != 8 (dim 0)". I understand that this is because I'm using a 8-degree polynomium, but is there any way for me to predict the y-axsis based on ONE temperature using the polynomial regression model?

Try using new_model.predict([x**a for a in range(1,9)])
or according to your previously used code, you can do new_model.predict(poly.fit_transform(x))
Since you fit a line
y = ax^1 + bx^2 + ... + h*x^8
you, need to transform your input in the same manner i.e. turn it into a polynomial without the intercept and slope terms. This was what you passed into Linear Regression training function. It learns the slope terms for that polynomial. The plot you've shown only contains the x^1 term you indexed into (x_new[:,0]) which means that the data you're using has more columns.
One last note: always make sure your training data and future/validation data undergo the same preprocessing steps to ensure your model works.
Here's some detail :
Let's start by running your code, on synthetic data.
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from numpy.random import rand
x_train = rand(1000,1)
y_train = rand(1000,1)
poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept
x_new = poly.fit_transform(x_train)
new_model = LinearRegression()
new_model.fit(x_new,y_train)
#plotting
y_prediction = new_model.predict(x_new) #this predicts y
plt.scatter(x_train,y_train)
plt.plot(x_new[:,0], y_prediction, 'r')
plt.legend(['Predicted line', 'Observed data'])
plt.show()
Now we can predict y value by transforming an x-value into a polynomial of degree 8 without an intercept
print(new_model.predict(poly.fit_transform(0.25)))
[[0.47974408]]

Related

My train/test model is returning an error and is train/test model and normal linear regression model two separate models?

I recently attending a class where the instructor was teaching us how to create a linear regression model using Python. Here is my linear regression model:
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import numpy as np
from sklearn.metrics import r2_score
#Define the path for the file
path=r"C:\Users\H\Desktop\Files\Data.xlsx"
#Read the file into a dataframe ensuring to group by weeks
df=pd.read_excel(path, sheet_name = 0)
df=df.groupby(['Week']).sum()
df = df.reset_index()
#Define x and y
x=df['Week']
y=df['Payment Amount Total']
#Draw the scatter plot
plt.scatter(x, y)
plt.show()
#Now we draw the line of linear regression
#First we want to look for these values
slope, intercept, r, p, std_err = stats.linregress(x, y)
#We then create a function
def myfunc(x):
#Below is y = mx + c
return slope * x + intercept
#Run each value of the x array through the function. This will result in a new array with new values for the y-axis:
mymodel = list(map(myfunc, x))
#We plot the scatter plot and line
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
#We print the value of r
print(r)
#We predict what the cost will be in week 23
print(myfunc(23))
The instructor said we now must use the train/test model to determine how accurate the model above is. This confused me a little as I understood it to mean we will further refine the model above. Or, does it simply mean we will use:
a normal linear regression model
a train/test model
and compare the r values the two different models yield as well as the predicted values they yield?. Is the train/test model considered a regression model?
I tried to create the train/test model but I'm not sure if it's correct (the packages were imported from the above example). When I run the train/test code I get the following error:
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
Here is the full code:
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
#I display the training set:
plt.scatter(train_x, train_y)
plt.show()
#I display the testing set:
plt.scatter(test_x, test_y)
plt.show()
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
myline = np.linspace(0, 6, 100)
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
#Let's look at how well my training data fit in a polynomial regression?
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
#Now we want to test the model with the testing data as well
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
#Now we can use this model to predict new values:
#We predict what the total amount would be on the 23rd week:
print(mymodel(23))
You better split to train and test using sklearn method:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Where X is your features dataframe and y is the column of your labels. 0.2 stands for 80% train and 20% test.
BTW - the error you are describing could be because you dataframe has only 80 rows, leaving x[80:] empty

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

Python: statsmodels - what does .predict(X) actually predict?

I'm a bit confused as to what the line model.predict(X) actually predicts. I can't find anything on it with a Google search.
import statsmodels.api as sm
# Step 1) Load data into dataframe
df = pd.read_csv('my_data.csv')
# Step 2) Separate dependent and independent variables
X = df['independent_variable']
y = df["dependent_variable"]
# Step 3) using OLS -fit a linear regression
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make predictions
predictions
I'm not sure what predictions is showing? Is it predicting the next x amount of rows or something? Aren't I just passing in my independent variables?
You are fitting an OLS model from your data, which is most likely interpreted as an array. The predict method will returns an array of fitted values given the trained model.
In other words, from statsmodels documentation:
Return linear predicted values from a design matrix.
Similar to the sk-learn. After model = sm.OLS(y, X).fit(), you will have a model, then predictions = model.predict(X) is not predict next x amount of rows, it will predict from your X, the training dataset. The model using ordinary least squares will be a function of "x" and the output should be:
$$ \hat{y}=f(x) $$
If you want to predict the new X, you need to split X into training and testing dataset.
Actually you are doing it wrong
The predict method is use to predict next values
After separating dependent and I dependent values
You can split the data in two part train and test
From sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,0.2)
This will make X_train ur 80% of total data with only independent variable
And you can put your y_test in predict method to check how well the model is performing

Sklearn Linear Regression giving me inaccurate accurate reading?

I have an excel file with contains 3 simple columns: Unit Price, Quantity Sold, and Total. The Total is simply obtained by multiplying unit price with quantity. So i set up a simple sklearn linear regression algorithm code to predict this value:
import numpy as np
import pandas as pd
from sklearn import *
data = pd.read_excel("Sample.xls")[["Units", "Unit Cost", "Total"]]
predict = "Total"
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)
I ran the print function 3 times and it gave me the following results:
-1.517292267629657
0.9167778505657976
0.15292336331892453
Why am i getting this and not a 100% accuracy? The model should recognise that the prediction is simply multiplication of the frist column with the second
Linear regression doesn't work that way. Plot a graph using matplotlib and see if you get straight line for training data. If your inputs are X1 and X2 your output is X1*X2 and not of the form aX1+bX2. Model is predicting correctly, What is wrong is the model application itself.

Why can't I predict new data using SVM and KNN?

I'm new to machine learning and I just learned KNN and SVM with sklearn. How do I make a prediction for new data using SVM or KNN? I have tried both to make prediction. They make good prediction only when the data is already known. But when I try to predict new data, they give an incorrect prediction.
Here is my code:
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVC(kernel='linear')
clf.fit(x, y)
print(clf.predict([[20]]))
print(clf.score(x, y))
0utput:
[12.]
1.0
This code will make a good prediction as long as the data to predict is within the range x_train. But when I try to predict for example 20, or anything above the range x_train, the output will always be 12 which is the last element of y. I don't know what I do wrong in the code.
The code is behaving as mathematically described by a support vector machine.
You must understand how your data are being interpreted by the algorithm. You have 11 data points, and you are giving each one a different class. The SVM ends up basically dividing the number line into 11 segments (for the 11 classes you defined):
data = [(x, clf.predict([[x]])[0]) for x in np.linspace(1, 20, 300)]
plt.scatter([p[0] for p in data], [p[1] for p in data])
plt.show()
The answer by AILearning tells you how to fit your given toy problem, but make sure you also understand why your code wasn't doing what you thought it was. For any finite set of examples there are infinitely many functions that fit the data. Your fundamental issue is you are confusing regression and classification. From the sounds of it, you want a simple regression model to extrapolate a fit function from the data points, but your code is for a classification model.
You have to use a regression model rather than a classification model. For svm based regression use svm.SVR()
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVR(kernel='linear')
clf.fit(x, y)
print(clf.predict([[50]]))
print(clf.score(x, y))
output:
[50.12]
0.9996

Categories