r2 score turns out to be negative - python

I study support vector regression but I faced a problem: my r2 score becomes negative. Is that normal or is there any changeable part in my code to fix this?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
df = pd.read_csv('Position_Salaries.csv')
df.head()
X = df.iloc[:, 1:2].values
y = df.iloc[:, -1].values
from sklearn.preprocessing import StandardScaler
y = y.reshape(len(y),1)
x_scaler = StandardScaler()
y_scaler = StandardScaler()
X = x_scaler.fit_transform(X)
y = y_scaler.fit_transform(y)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
regressor = SVR(kernel="rbf")
regressor.fit(x_train,y_train.ravel())
y_pred = y_scaler.inverse_transform(regressor.predict(x_scaler.transform(x_test)))
from sklearn.metrics import r2_score
r2_score(y_scaler.inverse_transform(y_test), y_pred)
My output is -0.5313206322807349

In this part, your X is in scaled version
X = x_scaler.fit_transform(X)
In this part, your x_test also in scaled version
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
When creating prediction, you shouldn't transform your input again since your x_test already in scaled version
y_pred = y_scaler.inverse_transform(regressor.predict(x_scaler.transform(x_test)))

From the documentation of sklearn.metrics.r2_score.
Best possible score is 1.0 and it can be negative (because the model
can be arbitrarily worse). A constant model that always predicts the
expected value of y, disregarding the input features, would get a R^2
score of 0.0.

Per documentation:
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)

Related

What does the error mean and how to fix it - "ValueError: query data dimension must match training data dimension"

I am trying to write the code for K-NN
Below is my code. - I know that issue is in `predict() but I am not able to figure out how o fix it.
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('UniversalBank.csv')
X = dataset.iloc[:,[ 1,2,3,5,6,7,8,10,11,12,13]].values #,
y = dataset.iloc[:,9].values
#Splitting the dataset to training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state= 0)
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Fitting the classifier to training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train,y_train)
#Predicting the test results
y_pred = classifier.predict(X_test)

`setCalculateVarImportance` changes result in `cv2.ml.RTrees` model

Looking at the documentation for cv2.ml.RTrees, it says
calcVarImportance – If true then variable importance will be calculated and then it can be retrieved by RTrees::getVarImportance.
It sounds like this parameter should only change whether the variable importance is calculated or not. It should not change the model's output.
However, as the MCVE below shows, it does. Why?
import cv2
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1729)
forest = cv2.ml.RTrees_create()
forest.setTermCriteria((cv2.TERM_CRITERIA_MAX_ITER,10,1))
forest.setCalculateVarImportance(True)
forest.train(cv2.ml.TrainData_create(np.float32(X_train), 0, y_train))
preds = forest.predict(np.float32(X_test), 0)[1]
print(sum(preds))
# output: [94.]
forest = cv2.ml.RTrees_create()
forest.setTermCriteria((cv2.TERM_CRITERIA_MAX_ITER,10,1))
forest.setCalculateVarImportance(False)
forest.train(cv2.ml.TrainData_create(np.float32(X_train), 0, y_train))
preds_new = forest.predict(np.float32(X_test), 0)[1]
print(sum(preds_new))
# output: [95.]

Error code with Preprocessor Scaling?

Using KNN and I wanted to experiment with different normalizers (Normalizer(), MinMaxScaler(), StandardScaler() etc).
I have loaded the data into a variable called X:
X = pd.read_csv('C:/Users/rmahesh/documents/parkinson.csv')
After doing some data wrangling, I try and run this code:
from sklearn import preprocessing
from sklearn.decomposition import PCA
T = preprocessing.Normalizer().fit(X)
from sklearn.cross_validation import train_test_split
T_train, T_test, y_train, y_test = train_test_split(T, y, test_size = 0.3, random_state = 7)
from sklearn.svm import SVC
model = SVC()
model = model.fit(T_train, y_train)
score = model.score(T_test, y_test)
print(score)
The specific error code I am getting is this:
TypeError: Singleton array array(Normalizer(copy=True, norm='l2'), dtype=object) cannot be considered a valid collection.
The code in which the error is appearing is this line:
T_train, T_test, y_train, y_test = train_test_split(T, y,
test_size = 0.3, random_state = 7)
Any help would be greatly appreciated!
You're fitting your normalizer and then treating it as an array directly. Replace
T = preprocessing.Normalizer().fit(X)
With
T = preprocessing.Normalizer().fit_transform(X)
So that the actual output of the normalization is used instead. .fit() returns the Normalizer object itself.

Each time accuracy differences with classifier?

Each time when I run this code, accuracy comes out different. Can anyone please explain why? Am I missing something here ? Thanks in advance :)
Below is my code:
import scipy
import numpy
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = .5)
# Use a classifier of K-nearestNeibour
from sklearn.neighbors import KNeighborsClassifier
my_classifier = KNeighborsClassifier()
my_classifier.fit(X_train,y_train)
predictions = my_classifier.predict(X_test)
print(predictions)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,predictions))
train_test_split randomly splits the data into training and test sets, and so you will get different splits each time you run the script. If you want, there's a random_state parameter that you can set to some number and it will ensure that you get the same split each time you run the script:
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = .5, random_state = 0)
This should give you an accuracy of 0.96 every time.

How to spliting datasets - Number of labels=150 does not match number of samples=600

I have a data sample of 750x256.
Rows = 750
Columns = 256
If I split my data into 20%. I will have for X_train 600 samples and y_train 150 samples.
Then the problem would accure when doing decisionTreeRegressor
it will say Number of y_train=150 does not match number of samples=600
But if I split my test_size into 50%, then it will work.
is there a way to around this? I don't want to use 50% of my test_size.
Any help would be great!
here is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphviz
#Load the data
dataset = pd.read_csv('new_york.csv')
dataset['Higher'] = dataset['2016-12'].gt(dataset['2016-11']).astype(int)
X = dataset.iloc[:, 6:254].values
y = dataset.iloc[:, 255].values
#Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, :248])
X[:, :248] = imputer.transform(X[:, :248])
#Split the data into train and test sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_test, y_train = train_test_split(X, y, test_size = .2, random_state = 0)
#let's build our first model
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
clf = DecisionTreeClassifier(max_depth=6)
clf.fit(X_train, y_train)
clf.score(X_train, y_train)
train_test_split() returns X_train, X_test, y_train, y_test, you have y_train and y_test in the wrong order.
If you use a split of 50% this is not causing an error because y_train and y_test will have the same size (but the wrong values obviously).

Categories