Problem predicting test values with python sklearn

Problem predicting test values with python sklearn - python

I did a code to predict Y values, X and Y are arrays of the same lenght
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
plt.scatter(X,Y,1)
regr2 = make_pipeline(PolynomialFeatures(10), Ridge())
regr2 =regr2.fit(X[:,np.newaxis], Y)
y_pred=regr2.predict(X[:,np.newaxis])
plt.plot(X, y_pred, color='red')
plt.show()
It works and it is a good approximation
But when I do it with test values and train values it shows an exponential when I plot it which it is not supposed to do.
In fact the y_pred1 is the X_test plus a small decimal number
plt.scatter(X_test,Y_test,1)
X_train=X[0:int(0.8*len(X))]
X_test=X[int(0.8*len(X)):]
Y_train=Y[0:int(0.8*len(X))]
Y_test=Y[int(0.8*len(X)):]
regr3 = make_pipeline(PolynomialFeatures(10), Ridge())
regr3 =regr3.fit(X_train[:,np.newaxis], Y_train)
y_pred1=regr3.predict(X_test[:,np.newaxis])
plt.plot(X_test, y_pred1, color='red')
plt.show()
I tried several things, even testing the prediction with the train values and in this case too it plot an exponential instead of an approximation of the points.
Thank in advance!

Fix Y_train
Y_train=Y[0:int(0.8*len(X))]

Related

muliple linear regression, traing dataset graphs ,ValueError: x and y must be the same size

i am running following code, graph for training dataset is giving error,
import pandas as pd
import numpy as np
df = pd.read_csv('11.csv')
df.head()
AT V AP RH PE
0 8.34 40.77 1010.84 90.01 480.48
1 23.64 58.49 1011.40 74.20 445.75
2 29.74 56.90 1007.15 41.91 438.76
3 19.07 49.69 1007.22 76.79 453.09
4 11.80 40.66 1017.13 97.20 464.43
x = df.drop(['PE'], axis = 1).values
y = df['PE'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
ml = LinearRegression()
ml.fit(x_train, y_train)
y_pred = ml.predict(x_test)
print(y_pred)
import matplotlib.pyplot as plt
plt.scatter(x_train, y_train, color = 'red')
plt.plot(x_train, ml.predict(x_test), color = 'green')
plt.show() ***
please help to reshape 2d to 1d array for plotting graphs
**ValueError: x and y must be the same size**

EDIT: Now that your question has it's format fixed, I'm spotting a few errors, with a theme of using 1D linear regression code to plot your multiple regression problem.
plt.scatter(x_train, y_train, color = 'red'): You're trying to plot multiple variables in one axis (AT, V, AP, RH) using x_train. You cannot do this since this is multiple linear regression. (For example, one can't fit pressure and volume on the x-axis against temperature on the y. What does the x-axis represent? It doesn't make sense.) You cannot plot what you are trying to plot, and I cannot give you suggestions since I don't know what you're trying to plot. You can try one variable at a time, e.g. plt.scatter(x_train['AT'], y_train, color='red'). Or you use different color to plot each variable on the same graph - though I don't recommend this since your x-axis could be of different units.
plt.plot(x_train, ml.predict(x_test): You should be using y_test for your x-input. E.g. plt.plot(y_test, ml.predict(x_test)). This is a problem with the length of your data, not your width/columns like the error above. Though if my suggestion isn't what you wanted (it's a little strange to plot y_test and your y predictions), you might be inputting (incorrectly) assumptions/code for 1D linear regression when you're working with multiple linear regression - a potential theme in these errors.

Plotting prediction from logistic regression

I would like to plot y_test and prediction in a scatter plot.
I am using the logistic regression as model.
from sklearn.linear_model import LogisticRegression
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Spam'])
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=27)
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)
pred_log = lr.predict(X_test)
I have tried as follows
## Plot the model
plt.scatter(y_test, pred_log)
plt.xlabel("True Values")
plt.ylabel("Predictions")
and I got this:
that I do not think it is what I should expect.
y_test is (250,), similarly pred_log is (250,)
Am I considering the wrong variables to plot, or they are right?
I have no idea one what the plot with those four values mean. I would have been expected more dots in the plot, but maybe I am wrong.
Please let me know if you need more info. Thanks

I think you know LogisticRegression is a classification algorithm. If you do binary classification it will predict whether predicted class is 0 or 1.If you want to get visualization about how model preform, you should consider confusion matrix.You can't use scatterplot for visualize classification results.
import seaborn as sns
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cf_matrix, annot=True)
confusion matrix shows how many labels have correct predictions and how many are wrong.Looking at confusion matrix you can calculate how accurate the model.We can use different metrices like precision,recall and F1 score.

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?

I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

Mean squared error is enormous when using Scikit Learn

I have been battling this problem with my MSE while predicting with regression. I have encountered the same problem with different regression models I have tried to build.
The problem is, my MSE is humongous. 83661743.99 to be exact. My R squared is 0.91 which does not seem problematic.
I manually implemented the cost function and gradient descent while doing the coursework in Andrew Ng's Stanford ML classes and I have a reasonable cost function; but when I try to implement it with SKLearn lib the MSE is something else. I don't know what I have done wrong and I need some help checking it out.
Here is the link to the dataset I used: https://www.kaggle.com/farhanmd29/50-startups
My code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
df = pd.read_csv('50_Startups.csv')
#checking the level of correlations between the predictors and response
sns.heatmap(df.corr(), annot=True)
#Splitting the predictors from the response
X = df.iloc[:,:-1].values
y = df.iloc[:,4].values
#Encoding the Categorical values
label_encoder_X = LabelEncoder()
X[:,3] = label_encoder_X.fit_transform(X[:,3])
#Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
#splitting train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
#Linear Regression
model = LinearRegression()
model.fit(X_train,y_train)
pred = model.predict(X_test)
#Cost Function
mse = mean_squared_error(y_test,pred)
mse

As you used standard normalization for scaling, the values of the dataset can be humongous. As desertnaut said, MSE is not scaled so it can be huge due to the big values of the dataset. You can try to normalize data using a MinMaxScaler to get the iput between [0-1]

I have gotten to understand the error of my ways. The MSE is 1/n (No of Samples) multiplied by the summation of the actual response subtracted by the predicted response SQUARED. Hence the error given will be SQUARED the expected error value. what I should have looked out for was the RMSE which will find the sqrt of the MSE. my predictions were off as well and that was because I scaled my values. Un-scaled X values gave me much better predictions. This I will have to look into more as I do not understand why.

Python - Plotting and linear regression - x and y must be the same size

I'm teaching myself some more tricks with python and scikit, and I'm trying to plot a linear regression model. My code can be seen below. But my program and console give the following error: x and y must be the same size. Additionally, my program makes it to the end of my code, but nothing gets plotted.
To fix the size error, the first thing that came to mind was testing the length of x and y with something like len(x) == len(y). But as far as I can tell, my data seems to be the same length. Maybe the error is referring to something other than length (if so, I'm not sure what). Would really appreciate any help.
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Create linear regression object
regr = linear_model.LinearRegression()
#load csv file with pandas
df = pd.read_csv("pokemon.csv")
#remove all string columns
df = df.drop(['Name','Type_1','Type_2','isLegendary','Color','Pr_Male','hasGender','Egg_Group_1','Egg_Group_2','hasMegaEvolution','Body_Style'], axis=1)
y= df.Catch_Rate
x_train, x_test, y_train, y_test = cross_validation.train_test_split(df, y, test_size=0.25, random_state=0)
# Train the model using the training sets
regr.fit(x_train, y_train)
# Make predictions using the testing set
pokemon_y_pred = regr.predict(x_test)
print (pokemon_y_pred)
# Plot outputs
plt.title("Linear Regression Model of Catch Rate")
plt.scatter(x_test, y_test, color='black')
plt.plot(x_test, pokemon_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

This is referring to the fact that your x-variable has more than one dimension; plot and scatter only work for 2D plots, and it seems that your x_test has multiple features while y_test and pokemon_y_pred are one-dimensional.

This error generates only when you have more different values of x for one y actually there are comparatively more columns in x_test than y_test.Thats why there is a size problem.
There should not be different x for one y:-basic mathematics fundamental.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem predicting test values with python sklearn - python

Fix Y_train Y_train=Y[0:int(0.8*len(X))]

Related

muliple linear regression, traing dataset graphs ,ValueError: x and y must be the same size

Plotting prediction from logistic regression

Multiple Linear Regression. Coeffs don't match

Mean squared error is enormous when using Scikit Learn

Python - Plotting and linear regression - x and y must be the same size

Categories

Resources