Mean squared error is enormous when using Scikit Learn - python

I have been battling this problem with my MSE while predicting with regression. I have encountered the same problem with different regression models I have tried to build.
The problem is, my MSE is humongous. 83661743.99 to be exact. My R squared is 0.91 which does not seem problematic.
I manually implemented the cost function and gradient descent while doing the coursework in Andrew Ng's Stanford ML classes and I have a reasonable cost function; but when I try to implement it with SKLearn lib the MSE is something else. I don't know what I have done wrong and I need some help checking it out.
Here is the link to the dataset I used: https://www.kaggle.com/farhanmd29/50-startups
My code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
df = pd.read_csv('50_Startups.csv')
#checking the level of correlations between the predictors and response
sns.heatmap(df.corr(), annot=True)
#Splitting the predictors from the response
X = df.iloc[:,:-1].values
y = df.iloc[:,4].values
#Encoding the Categorical values
label_encoder_X = LabelEncoder()
X[:,3] = label_encoder_X.fit_transform(X[:,3])
#Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
#splitting train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
#Linear Regression
model = LinearRegression()
model.fit(X_train,y_train)
pred = model.predict(X_test)
#Cost Function
mse = mean_squared_error(y_test,pred)
mse

As you used standard normalization for scaling, the values of the dataset can be humongous. As desertnaut said, MSE is not scaled so it can be huge due to the big values of the dataset. You can try to normalize data using a MinMaxScaler to get the iput between [0-1]

I have gotten to understand the error of my ways. The MSE is 1/n (No of Samples) multiplied by the summation of the actual response subtracted by the predicted response SQUARED. Hence the error given will be SQUARED the expected error value. what I should have looked out for was the RMSE which will find the sqrt of the MSE. my predictions were off as well and that was because I scaled my values. Un-scaled X values gave me much better predictions. This I will have to look into more as I do not understand why.

Related

how to build an artificial neural network with scikit-learn?

I am trying to run an artificial neural network with scikit-learn.
I want to run the regression, get the model fit results, an generate out of sample forecasts.
This is my code below. Any help will be greatly appreciated.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
#import the data
df=pd.read_excel(r"C:\Users\Action\Downloads\Python\Practice_Data\sorted_data v2.xlsx")
#view the data
df.head(5)
#to drop a column of data type
df2=df.drop('Unnamed: 13', axis=1)
#view the data
df2.head(5)
Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
describe the data
df.describe().transpose()
target_column = ['public health care services']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe().transpose()
set the X and Y
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
import MLP Classifier and fit the network
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)
predict_train = mlp.predict(X_train)
set up the MLP Classifier
mlp = MLPClassifier(
hidden_layer_sizes=(50, 8),
max_iter=15,
alpha=1e-4,
solver="sgd",
verbose=True,
random_state=1,
learning_rate_init=0.1)
import the warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
predict_test = mlp.predict(X_test)
to train on the data I use the MLPClassifier to call the fit function on the training data.
mlp.fit(X_train, y_train)
after this, the neural network is done training.
after the neural network is trained, the next step is to test it.
print out the model scores
print(f"Training set score: {mlp.score(X_train, y_train)}")
print(f"Test set score: {mlp.score(X_test, y_test)}")
y_predict = mlp.predict(X_train)
I am getting an error from below
x_ann = y_predict[:, 0]
y_ann = y_predict[:, 1]
The error message is
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
any help will be greatly appreciated
predict function gives you the actual class and since your point can belong to one and only one class (except multi label), it is supposed to be like this only
What is the shape of your Y_true_labels? Might be the case that your labels are Sparse and with 2 classes, means 0,1 and since the models is minimising Log Loss as described here as:
This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
Also looking at the predict() it says:
log_y_probndarray of shape (n_samples, n_classes)
The predicted log-probability of the sample for each class in the model, where classes are ordered as they are in self.classes_. Equivalent to log(predict_proba(X))
So it means that if probability is 0.3 it means it belongs to class and if it's 0.7 it belongs to class, ASSUMING it's binary classification with a threshold set to 0.5.
What you might be confusing with is the predict_proba() function which gives you the probabilities for each classes.
Might be the case. Please post your X,Y data shape and type so that we can understand better.

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

My accuracy is at 0.0 and I don't know why?

I am getting an accuracy of 0.0. I am using the boston housing dataset.
Here is my code:
import sklearn
from sklearn import datasets
from sklearn import svm, metrics
from sklearn import linear_model, preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
boston = datasets.load_boston()
x = boston.data
y = boston.target
train_data, test_data, train_label, test_label = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
model = KNeighborsClassifier()
lab_enc = preprocessing.LabelEncoder()
train_label_encoded = lab_enc.fit_transform(train_label)
test_label_encoded = lab_enc.fit_transform(test_label)
model.fit(train_data, train_label_encoded)
predicted = model.predict(test_data)
accuracy = model.score(test_data, test_label_encoded)
print(accuracy)
How can I increase the accuracy on this dataset?
Boston dataset is for regression problems. Definition in the docs:
Load and return the boston house-prices dataset (regression).
So, it does not make sense if you use an ordinary encoding like the labels are not samples from a continuous data. For example, you encode 12.3 and 12.4 to completely different labels but they are pretty close to each other, and you evaluate the result wrong if the classifier predicts 12.4 when the real target is 12.3, but this is not a binary situation. In classification, the prediction is whether correct or not, but in regression it is calculated in a different way such as mean square error.
This part is not necessary, but I would like to give you an example for the same dataset and source code. With a simple idea of rounding the labels towards zero(to the nearest integer to zero) will give you some intuition.
5.0-5.9 -> 5
6.0-6.9 -> 6
...
50.0-50.9 -> 50
Let's change your code a little bit.
import numpy as np
def encode_func(labels):
return np.array([int(l) for l in labels])
...
train_label_encoded = encode_func(train_label)
test_label_encoded = encode_func(test_label)
The output will be around 10%.

How to set a value for a specific threshold in SVC model and generate a confusion matrix?

I need to set a value to a specific threshold and generate a confusion matrix. The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing?
First, i received a error message: ""AttributeError: predict_proba is not available when probability=False""
So i used this for correction:
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
I saw a lot on the internet and I didn't quite understand how a specific threshold value is being persolanized. Sounds pretty hard.
Now, i see a wrong output:
array([[ 0, 0],
[5359, 65]])
I have no idea whats is somenthing wrong.
i need help and i'm new in that.
thanks
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
# set threshold as -220
y_pred = (svc_model.predict_proba(X_test)[:,1] >= -220)
conf_matrix = confusion_matrix(y_pred, svc_model.predict(X_test))
return conf_matrix
answer_four()
This function should return a confusion matrix, a 2x2 numpy array with 4 integers.
This code produces the expected output, in addition to the fact that in the previous code I was using the confusion matrix incorrectly I should have also used decision_function and getting the output filtering the 220 threshold.
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
#SVC without mencions of kernel, the default is rbf
svc = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
#decision_function scores: Predict confidence scores for samples
y_score = svc.decision_function(X_test)
#Set a threshold -220
y_score = np.where(y_score > -220, 1, 0)
conf_matrix = confusion_matrix(y_test, y_score)
####threshold###
#input threshold in the model after trained this model
#threshold is a limiar of separation of class
return conf_matrix
answer_four()
#output:
array([[5320, 24],
[ 14, 66]])
You are using the confusion matrix in a wrong way.
The idea behind the confusion matrix is to have a picture as to how good our predictions y_pred are compared with the ground truth y_true, usually in a test set.
What you actually do here is computing a "confusion matrix" between your predictions with the custom threshold of -220 (y_pred), compared to some other predictions with the default threshold (the output of svc_model.predict(X_test)), which does not make any sense.
Your ground truth for the test set is y_test; so, to get the confusion matrix with the default threshold, you should use
confusion_matrix(y_test, svc_model.predict(X_test))
To get the confusion matrix with your custom threshold of -220, you should use
confusion_matrix(y_test, y_pred)
See the documentation for more details in the usage (which is your best friend, and should always be the first place to look at, when having issues or doubts).

Logistic Regression - Machine Learning

Logistic Regression with inputs of "Machine Learning.csv" file.
#Import Libraries
import pandas as pd
#Import Dataset
dataset = pd.read_csv('Machine Learning Data Set.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 10]
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Fitting Logistic Regression to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
#Predicting the Test set results
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
I have a machine learning / logistic regression code (python) as above. It has properly trained my model and gives a really good match with the test data. But unfortunately it is only giving me 0/1 (binary) results when I test with some other random values. (the training set has only 0/1 - as in failed/succeeded)
How can I get a probability result instead of a binary result in this algorithm? I have tried very different set of numbers and would like find out a probability of failing - instead of a 0 and 1.
Any help is strongly appreciated :) Thanks a lot!
Just replace
y_pred = classifier.predict(X_test)
with
y_pred = classifier.predict_proba(X_test)
For details refer Logistic Regression Probability
predict_proba(X_test) will give you probability of each sample for each class.i.e if X_test contains n_samples and you have 2 classes output of above function will be a "n_samples X 2 " matrix. and sum of two classes predicted will be 1. for more details have a look at documentation here

Categories