Error code with Preprocessor Scaling? - python

Using KNN and I wanted to experiment with different normalizers (Normalizer(), MinMaxScaler(), StandardScaler() etc).
I have loaded the data into a variable called X:
X = pd.read_csv('C:/Users/rmahesh/documents/parkinson.csv')
After doing some data wrangling, I try and run this code:
from sklearn import preprocessing
from sklearn.decomposition import PCA
T = preprocessing.Normalizer().fit(X)
from sklearn.cross_validation import train_test_split
T_train, T_test, y_train, y_test = train_test_split(T, y, test_size = 0.3, random_state = 7)
from sklearn.svm import SVC
model = SVC()
model = model.fit(T_train, y_train)
score = model.score(T_test, y_test)
print(score)
The specific error code I am getting is this:
TypeError: Singleton array array(Normalizer(copy=True, norm='l2'), dtype=object) cannot be considered a valid collection.
The code in which the error is appearing is this line:
T_train, T_test, y_train, y_test = train_test_split(T, y,
test_size = 0.3, random_state = 7)
Any help would be greatly appreciated!

You're fitting your normalizer and then treating it as an array directly. Replace
T = preprocessing.Normalizer().fit(X)
With
T = preprocessing.Normalizer().fit_transform(X)
So that the actual output of the normalization is used instead. .fit() returns the Normalizer object itself.

Related

What's wrong with these seemingly perfect ML model?

I wanted to find an optimal model to solve the assigned classification problem. Everything went smooth before I applied pd.get_dummies() function to preprocess the data. The experiment showed a impossibly perfect result. I know it is unlikely to happen but I do not know why. Any help would be highly appreciated.
Code for preprocessing data is as below
# Encoding Booking Status
status_dict = {'Not_Canceled':1, 'Canceled':0}
df.booking_status = df.booking_status.map(status_dict)
df.drop('Booking_ID',axis=1, inplace=True)
df = df.dropna()
df = pd.get_dummies(df)
# Standardizing Data
from sklearn.preprocessing import StandardScaler
import numpy as np
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
And I split my data into training and testing with a proportion of 0.3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
I used several models and the amazing result is
enter image description here
Simple code, stupid me. By the way, just a beginner in ML field. Any advice to master it well?
It was caused by data leaks. You must split your data first before any data pre-processing step. For example,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
Then do your data scaling part on the training and test data separately.
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
You could try to use Pipe line as well to avoid data leaks.
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))
Ref: https://machinelearningmastery.com/data-preparation-without-data-leakage/

Getting 100% Accuracy on my DecisionTree Model

Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.
Issues -
You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score

What does the error mean and how to fix it - "ValueError: query data dimension must match training data dimension"

I am trying to write the code for K-NN
Below is my code. - I know that issue is in `predict() but I am not able to figure out how o fix it.
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('UniversalBank.csv')
X = dataset.iloc[:,[ 1,2,3,5,6,7,8,10,11,12,13]].values #,
y = dataset.iloc[:,9].values
#Splitting the dataset to training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state= 0)
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Fitting the classifier to training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train,y_train)
#Predicting the test results
y_pred = classifier.predict(X_test)

`setCalculateVarImportance` changes result in `cv2.ml.RTrees` model

Looking at the documentation for cv2.ml.RTrees, it says
calcVarImportance – If true then variable importance will be calculated and then it can be retrieved by RTrees::getVarImportance.
It sounds like this parameter should only change whether the variable importance is calculated or not. It should not change the model's output.
However, as the MCVE below shows, it does. Why?
import cv2
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1729)
forest = cv2.ml.RTrees_create()
forest.setTermCriteria((cv2.TERM_CRITERIA_MAX_ITER,10,1))
forest.setCalculateVarImportance(True)
forest.train(cv2.ml.TrainData_create(np.float32(X_train), 0, y_train))
preds = forest.predict(np.float32(X_test), 0)[1]
print(sum(preds))
# output: [94.]
forest = cv2.ml.RTrees_create()
forest.setTermCriteria((cv2.TERM_CRITERIA_MAX_ITER,10,1))
forest.setCalculateVarImportance(False)
forest.train(cv2.ml.TrainData_create(np.float32(X_train), 0, y_train))
preds_new = forest.predict(np.float32(X_test), 0)[1]
print(sum(preds_new))
# output: [95.]

Each time accuracy differences with classifier?

Each time when I run this code, accuracy comes out different. Can anyone please explain why? Am I missing something here ? Thanks in advance :)
Below is my code:
import scipy
import numpy
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = .5)
# Use a classifier of K-nearestNeibour
from sklearn.neighbors import KNeighborsClassifier
my_classifier = KNeighborsClassifier()
my_classifier.fit(X_train,y_train)
predictions = my_classifier.predict(X_test)
print(predictions)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,predictions))
train_test_split randomly splits the data into training and test sets, and so you will get different splits each time you run the script. If you want, there's a random_state parameter that you can set to some number and it will ensure that you get the same split each time you run the script:
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = .5, random_state = 0)
This should give you an accuracy of 0.96 every time.

Categories