How to calculate r-squared with python? - python

I have fitted a model from which I'd like to know the scores (r-squared).
The data is split into a training and testing set. Although the model is only trained using the training set, how is it possible that my r-squared for my testing data is higher? I mean the model has never seen the testing set, but is more accurate than with the training set... Am I interpreting something wrong?
My code:
import pandas as pd
import numpy
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
import sklearn
from sklearn.linear_model import LinearRegression
from scipy import stats
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
df=pd.read_csv("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DA0101EN/module_5_auto.csv")
df=df._get_numeric_data()
y_data = df['price']
x_data=df.drop('price',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=1)
lr=LinearRegression()
lr.fit(x_train[['horsepower']], y_train)
h=lr.score(x_train[['horsepower']], y_train).mean()
h2=lr.score(x_test[['horsepower']], y_test).mean()
print(h,h2)

To put it simply, R-Squared is used to find the 'difference in percent' or calculate the accuracy of two time-series datasets.
Formula
Note: squaring Pearsons-r, squaring pandas corr(), or r^2 have slightly different results than R^2 formula shown above, this is due to 'statistic round up' reasons... refer to Max Pierini's answer
SciKit Learn R-squared is very different from square of Pearson's Correlation R
Method 1: function
def r_squared(y, y_hat):
y_bar = y.mean()
ss_tot = ((y-y_bar)**2).sum()
ss_res = ((y-y_hat)**2).sum()
return 1 - (ss_res/ss_tot)
Method 2: sklearn library
from sklearn.metrics import r2_score
r2 = r2_score(actual, predicted)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

It looks like you're using scikit-learn. If so, you can use the r2_score metric.

Related

How to make knn faster?

I have a dataset of shape(700000,20) and I want to apply KNN to it.
However on testing it takes really huge time,can someone expert please help to let me know how can I reduce the KNN predicting time.
Is there something like GPU-KNN or something.Please help to let me know.
Below is the code I am using.
import os
os.chdir(os.path.dirname(os.path.realpath(__file__)))
import tensorflow as tf
import pandas as pd
import numpy as np
from joblib import load, dump
import numpy as np
from scipy.spatial import distance
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from dtaidistance import dtw
window_length = 20
n = 5
X_train = load('X_train.pth').reshape(-1,20)
y_train = load('y_train.pth').reshape(-1)
X_test = load('X_test.pth').reshape(-1,20)
y_test = load('y_test.pth').reshape(-1)
#custom metric
def DTW(a, b):
return dtw.distance(a, b)
clf = KNeighborsClassifier(metric=DTW)
clf.fit(X_train, y_train)
#evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
I can suggest reducing the number of features which i think its 20 features from your dataset shape, Which mean you have 20 dimensions.
You can reduce the number of features by using PCA (Principal Component Analysis) like the following:
from sklearn.decomposition import PCA
train_data_pca = PCA(n_components=10)
reduced_train_data = train_data_pca.fit_transform(train_data)
this code will reduce deminisions for example to 10 instead of 20
Second issue in your code, that I see that you are not using th K neighboors value in the classifier, It should be as the following:
clf = KNeighborsClassifier(n_neighbors=n, metric=DTW)
The metric dtw is taking too much time while simple knn is working fine.

How to calculate the accuracy?

I'm trying to calculate the accuracy for a twitter sentiment analysis project. However, I get this error, and I was wondering if anyone could help me calculate the accuracy? Thanks
Error: ValueError: Classification metrics can't handle a mix of continuous and multiclass targets
My code:
import re
import pickle
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv("updated_tweet_info.csv")
data = df.fillna(' ')
train,test = train_test_split(data, test_size = 0.2, random_state = 42)
train_clean_tweet=[]
for tweet in train['tweet']:
train_clean_tweet.append(tweet)
test_clean_tweet=[]
for tweet in test['tweet']:
test_clean_tweet.append(tweet)
v = CountVectorizer(analyzer = "word")
train_features= v.fit_transform(train_clean_tweet)
test_features=v.transform(test_clean_tweet)
lr = RandomForestRegressor(n_estimators=200)
fit1 = lr.fit(train_features, train['clean_polarity'])
pred = fit1.predict(test_features)
accuracy = accuracy_score(pred, test['clean_polarity'])`
You are trying to use the accuracy_score method, but accuracy is a classification metric.
In your case, try using a regression metric method like: mean_squared_error() and then applying np.sqrt(). This will return you the Root Mean Squared Error. The lower the number, the better. You can also look here for more details.
Try this:
import numpy as np
rmse = np.sqrt(mean_squared_error(test['clean_polarity'], pred))
This guy also had the same problem

Mean squared error is enormous when using Scikit Learn

I have been battling this problem with my MSE while predicting with regression. I have encountered the same problem with different regression models I have tried to build.
The problem is, my MSE is humongous. 83661743.99 to be exact. My R squared is 0.91 which does not seem problematic.
I manually implemented the cost function and gradient descent while doing the coursework in Andrew Ng's Stanford ML classes and I have a reasonable cost function; but when I try to implement it with SKLearn lib the MSE is something else. I don't know what I have done wrong and I need some help checking it out.
Here is the link to the dataset I used: https://www.kaggle.com/farhanmd29/50-startups
My code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
df = pd.read_csv('50_Startups.csv')
#checking the level of correlations between the predictors and response
sns.heatmap(df.corr(), annot=True)
#Splitting the predictors from the response
X = df.iloc[:,:-1].values
y = df.iloc[:,4].values
#Encoding the Categorical values
label_encoder_X = LabelEncoder()
X[:,3] = label_encoder_X.fit_transform(X[:,3])
#Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
#splitting train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
#Linear Regression
model = LinearRegression()
model.fit(X_train,y_train)
pred = model.predict(X_test)
#Cost Function
mse = mean_squared_error(y_test,pred)
mse
As you used standard normalization for scaling, the values of the dataset can be humongous. As desertnaut said, MSE is not scaled so it can be huge due to the big values of the dataset. You can try to normalize data using a MinMaxScaler to get the iput between [0-1]
I have gotten to understand the error of my ways. The MSE is 1/n (No of Samples) multiplied by the summation of the actual response subtracted by the predicted response SQUARED. Hence the error given will be SQUARED the expected error value. what I should have looked out for was the RMSE which will find the sqrt of the MSE. my predictions were off as well and that was because I scaled my values. Un-scaled X values gave me much better predictions. This I will have to look into more as I do not understand why.

How to know whether i am overfitting/underfitting my data?

So i have to build a regression model to predict wine quality based on 11 inputs. Currently i am evaluating the Mean Squared Error, Mean absolute error and R2 scores of various algorithms. I want to make a decision on which algorithm to use, but before i do, i want to make sure my data is not being overfitted/underfitted. Below is the link to the dataset i use (its a bit different but the data is exactly the same) as well as my entire code.
Any help is greatly appreciated!
Data:
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
Also, the kagggle link where i copied most of my code from:
https://www.kaggle.com/jhansia/regression-models-analysis-on-the-wine-quality
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
wine = pd.read_csv('wineQualityReds.csv', usecols=lambda x: 'Unnamed' not in x,)
wine.head()
y = wine.quality
X = wine.drop('quality',axis = 1)
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(X,y,random_state = 0, stratify = y)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(train_x)
train_x_scaled = scaler.transform(train_x)
test_x_scaled = scaler.transform(test_x)
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
models = []
models.append(('DecisionTree', DecisionTreeRegressor()))
models.append(('RandomForest', RandomForestRegressor()))
models.append(('GradienBoost', GradientBoostingRegressor()))
models.append(('SVR', SVR()))
names = []
for name,model in models:
kfold = model_selection.KFold(n_splits=5,random_state=2)
cv_results = model_selection.cross_val_score(model,train_x_scaled,train_y, cv= kfold, scoring = 'neg_mean_absolute_error')
names.append(name)
msg = "%s: %f" % (name, -1*(cv_results).mean())
print(msg)
model = RandomForestRegressor()
model.fit(train_x_scaled,train_y)
pred_y = model.predict(test_x_scaled)
from sklearn import metrics
print('Mean Squared Error:', metrics.mean_squared_error(test_y, pred_y))
print('Mean Absolute Error:', metrics.mean_absolute_error(test_y, pred_y))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_y, pred_y)))
print('R2:', metrics.r2_score(test_y, pred_y))
You can use cross validation on the data sets to find whether it is over fitting or under fitting.

ROC curve for multi-class classification without one vs all in python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

Categories