Looking at the documentation for cv2.ml.RTrees, it says
calcVarImportance – If true then variable importance will be calculated and then it can be retrieved by RTrees::getVarImportance.
It sounds like this parameter should only change whether the variable importance is calculated or not. It should not change the model's output.
However, as the MCVE below shows, it does. Why?
import cv2
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1729)
forest = cv2.ml.RTrees_create()
forest.setTermCriteria((cv2.TERM_CRITERIA_MAX_ITER,10,1))
forest.setCalculateVarImportance(True)
forest.train(cv2.ml.TrainData_create(np.float32(X_train), 0, y_train))
preds = forest.predict(np.float32(X_test), 0)[1]
print(sum(preds))
# output: [94.]
forest = cv2.ml.RTrees_create()
forest.setTermCriteria((cv2.TERM_CRITERIA_MAX_ITER,10,1))
forest.setCalculateVarImportance(False)
forest.train(cv2.ml.TrainData_create(np.float32(X_train), 0, y_train))
preds_new = forest.predict(np.float32(X_test), 0)[1]
print(sum(preds_new))
# output: [95.]
Related
I wanted to find an optimal model to solve the assigned classification problem. Everything went smooth before I applied pd.get_dummies() function to preprocess the data. The experiment showed a impossibly perfect result. I know it is unlikely to happen but I do not know why. Any help would be highly appreciated.
Code for preprocessing data is as below
# Encoding Booking Status
status_dict = {'Not_Canceled':1, 'Canceled':0}
df.booking_status = df.booking_status.map(status_dict)
df.drop('Booking_ID',axis=1, inplace=True)
df = df.dropna()
df = pd.get_dummies(df)
# Standardizing Data
from sklearn.preprocessing import StandardScaler
import numpy as np
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
And I split my data into training and testing with a proportion of 0.3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
I used several models and the amazing result is
enter image description here
Simple code, stupid me. By the way, just a beginner in ML field. Any advice to master it well?
It was caused by data leaks. You must split your data first before any data pre-processing step. For example,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
Then do your data scaling part on the training and test data separately.
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
You could try to use Pipe line as well to avoid data leaks.
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))
Ref: https://machinelearningmastery.com/data-preparation-without-data-leakage/
I study support vector regression but I faced a problem: my r2 score becomes negative. Is that normal or is there any changeable part in my code to fix this?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
df = pd.read_csv('Position_Salaries.csv')
df.head()
X = df.iloc[:, 1:2].values
y = df.iloc[:, -1].values
from sklearn.preprocessing import StandardScaler
y = y.reshape(len(y),1)
x_scaler = StandardScaler()
y_scaler = StandardScaler()
X = x_scaler.fit_transform(X)
y = y_scaler.fit_transform(y)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
regressor = SVR(kernel="rbf")
regressor.fit(x_train,y_train.ravel())
y_pred = y_scaler.inverse_transform(regressor.predict(x_scaler.transform(x_test)))
from sklearn.metrics import r2_score
r2_score(y_scaler.inverse_transform(y_test), y_pred)
My output is -0.5313206322807349
In this part, your X is in scaled version
X = x_scaler.fit_transform(X)
In this part, your x_test also in scaled version
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
When creating prediction, you shouldn't transform your input again since your x_test already in scaled version
y_pred = y_scaler.inverse_transform(regressor.predict(x_scaler.transform(x_test)))
From the documentation of sklearn.metrics.r2_score.
Best possible score is 1.0 and it can be negative (because the model
can be arbitrarily worse). A constant model that always predicts the
expected value of y, disregarding the input features, would get a R^2
score of 0.0.
Per documentation:
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)
I am trying to write the code for K-NN
Below is my code. - I know that issue is in `predict() but I am not able to figure out how o fix it.
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('UniversalBank.csv')
X = dataset.iloc[:,[ 1,2,3,5,6,7,8,10,11,12,13]].values #,
y = dataset.iloc[:,9].values
#Splitting the dataset to training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state= 0)
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Fitting the classifier to training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train,y_train)
#Predicting the test results
y_pred = classifier.predict(X_test)
This is the custom code
#Custom model for multiple linear regression
import numpy as np
import pandas as pd
dataset = pd.read_csv("50s.csv")
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,4:5].values
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
x[:,3] = lb.fit_transform(x[:,3])
from sklearn.preprocessing import OneHotEncoder
on = OneHotEncoder(categorical_features=[3])
x = on.fit_transform(x).toarray()
x = x[:,1:]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=1/5, random_state=0)
con = np.matrix(X_train)
z = np.matrix(y_train)
#training model
result1 = con.transpose()*con
result1 = np.linalg.inv(result1)
p = con.transpose()*z
f = result1*p
l = []
for i in range(len(X_test)):
temp = f[0]*X_test[i][0] + f[1]*X_test[i][1] +f[2]*X_test[i][2]+f[3]*X_test[i][3]+f[4]*X_test[i][4]
l.append(temp)
import matplotlib.pyplot as plt
plt.scatter(y_test,l)
plt.show()
Then I created created a model with scikit learn
and compared the results with y_test and l(predicted values of above code)
comparisons are as follows
for i in range(len(prediction)):
print(y_test[i],prediction[i],l[i],sep=' ')
103282.38 103015.20159795816 [[116862.44205399]]
144259.4 132582.27760816005 [[118661.40080974]]
146121.95 132447.73845175043 [[124952.97891882]]
77798.83 71976.09851258533 [[60680.01036438]]
This were the comparison between y_test,scikit-learn model predictions and custom code predictions
please help with the accuracy of model.
blue :Custom model predictions
yellow : scikit-learn model predictions
Using KNN and I wanted to experiment with different normalizers (Normalizer(), MinMaxScaler(), StandardScaler() etc).
I have loaded the data into a variable called X:
X = pd.read_csv('C:/Users/rmahesh/documents/parkinson.csv')
After doing some data wrangling, I try and run this code:
from sklearn import preprocessing
from sklearn.decomposition import PCA
T = preprocessing.Normalizer().fit(X)
from sklearn.cross_validation import train_test_split
T_train, T_test, y_train, y_test = train_test_split(T, y, test_size = 0.3, random_state = 7)
from sklearn.svm import SVC
model = SVC()
model = model.fit(T_train, y_train)
score = model.score(T_test, y_test)
print(score)
The specific error code I am getting is this:
TypeError: Singleton array array(Normalizer(copy=True, norm='l2'), dtype=object) cannot be considered a valid collection.
The code in which the error is appearing is this line:
T_train, T_test, y_train, y_test = train_test_split(T, y,
test_size = 0.3, random_state = 7)
Any help would be greatly appreciated!
You're fitting your normalizer and then treating it as an array directly. Replace
T = preprocessing.Normalizer().fit(X)
With
T = preprocessing.Normalizer().fit_transform(X)
So that the actual output of the normalization is used instead. .fit() returns the Normalizer object itself.