Performing regression while dividing sample - python

Let's take data
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
And consider code following :
#Defining X,y - independent variable and dependent variables
X=df.drop(df.columns[[1]], axis=1)
y = (df[1] == 'B').astype(int)
clf=LogisticRegression(solver="lbfgs")
kfold = StratifiedKFold(n_splits=10, shuffle=True)
for train, validation in kfold.split(X, y):
# Fit the model
clf.fit(X[train], y[train])
And the following error occurs :
Do you have any idea why it occurs ? I think I did really not complicated things, so I'm not sure what exactly I did wrong.

X is a DataFrame so you need to use .iloc to select the indices:
for train_index, validation_index in kfold.split(X, y):
# Fit the model
X_train = X.iloc[train_index]
y_train = y[train_index]
clf.fit(X_train, y_train)

Related

How to do K-fold cross validation without using python libraries?

I am trying to do a cross validation, however, I am only allowed to use those libraries below (as the professor demanded):
import numpy as np
from sklearn import svm
from sklearn.datasets import load_iris
Therefore, I am not able to use KFold for example to split the data. How should I go about it? Any suggestions?
k = 10
kf = KFold(n_splits=k, random_state=None)
acc_score = []
for train_index , test_index in kf.split(X):
X_train , X_test = X[train_index,:],X[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
SVC = train_model(X_train, y_train)
y_pred = make_predictions(SVC, X_test)
acc = accuracy_score(y_pred , y_test)
acc_score.append(acc)
mean = sum(acc_score)/k
I was thinking on hard coding the split, but there could be a better solution.
You can use the np.random library to generate random samples for the train split indices. Example:
import numpy as np
np.random.seed(42)
train_indices = np.random.choice(len(train), int(0.8*len(train)), replace=False)
Then you can have the validation/test indices be the opposite of those, you can use different random seeds to get reproducible results for each of your folds

What's wrong with these seemingly perfect ML model?

I wanted to find an optimal model to solve the assigned classification problem. Everything went smooth before I applied pd.get_dummies() function to preprocess the data. The experiment showed a impossibly perfect result. I know it is unlikely to happen but I do not know why. Any help would be highly appreciated.
Code for preprocessing data is as below
# Encoding Booking Status
status_dict = {'Not_Canceled':1, 'Canceled':0}
df.booking_status = df.booking_status.map(status_dict)
df.drop('Booking_ID',axis=1, inplace=True)
df = df.dropna()
df = pd.get_dummies(df)
# Standardizing Data
from sklearn.preprocessing import StandardScaler
import numpy as np
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
And I split my data into training and testing with a proportion of 0.3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
I used several models and the amazing result is
enter image description here
Simple code, stupid me. By the way, just a beginner in ML field. Any advice to master it well?
It was caused by data leaks. You must split your data first before any data pre-processing step. For example,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
Then do your data scaling part on the training and test data separately.
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
You could try to use Pipe line as well to avoid data leaks.
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))
Ref: https://machinelearningmastery.com/data-preparation-without-data-leakage/

Cannot perform StratifiedKFold

I want to divide my sample into train/test set respectively 80/20 and after that I want to perform StratifiedKFold.
So let's take some data and divide them into 80/20 using train_test_split
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-
databases/breast-cancer-wisconsin/wdbc.data', header=None)
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X=df.drop(df.columns[[1]], axis=1)
y=np.array(df[1])
y[y=='M']=0
y[y=='B']=1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2).
Now if I want to see result of my division I see error :
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, y):
print(X[train].shape, X[validation].shape)
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.
I've read about it and it's common error connected to this function, however I'm not sure how to solve the issue.
I saw that we can perform this on iris data :
iris = load_iris()
X = iris.data
y = iris.target
print(X.shape) # initial dataset size
# (150, 4)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, y):
print(X[train].shape, X[validation].shape)
And we will se the result ? What I'm doing differently that this function doesn't want to work ?
You need to recast your y as an array of integers:
y = y.astype(int)
I'm not really sure how it works, but I guess since it started as an array of strings, and was converted one by one (first y=='M', later y=='B') into an array of integers, it just doesn't convert the array itself into an array of integers.
When you convert df[1] to zeros and ones, the type is still a string (or object in numpy):
y=np.array(df[1])
y[y=='M']=0
y[y=='B']=1
y.dtype
dtype('O')
You can either do it as:
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
kfold.get_n_splits(X,df[1])
5
Or:
y = (df[1] == 'B').astype(int)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
kfold.get_n_splits(X,y)
When you have something in between, it gets confused.

Getting two totally different results when I run Decision Tree Algorithm using normal accuracy and K Fold cross validation

The problem is I am getting two totally different results when I run the DTC algorithm, I just want to make sure that I am writing the cross validation - K Fold in a correct way or to understand why the result of the K fold is too much less than the normal one.
I've tried to run the codes for getting result from both normal accuracy and K fold accuracy the code is below:
from scipy.signal import butter, lfilter
import numpy as np
import pandas as pd
import pandas
from sklearn import preprocessing
from scipy.fftpack import fft
import pickle
import numpy
from pandas import Series
from numpy.random import randn
import pandas as pd
import numpy as np
from pandas import DataFrame
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
xx = pandas.read_csv("data1.dat", delimiter=",")
y = pandas.read_csv("label.dat", delim_whitespace=True)
x = xx.as_matrix()
y = numpy.array(y).astype(numpy.int)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
clf2 = DecisionTreeClassifier(random_state=42)
clf2.fit(X_train, y_train)
y_predict_2 = clf2.predict(X_test)
print("DTC Accuracy : ")
print(accuracy_score(y_test, y_predict_2)*100)
DTC Accuracy :
97.6302083333333
from sklearn.model_selection import cross_val_score
DTC = DecisionTreeClassifier(random_state=42)
scores =cross_val_score(DTC, x, y, cv=10, scoring='accuracy')
print(scores.mean()*100)
35.331452470904985
from sklearn.model_selection import cross_val_score
DTC = DecisionTreeClassifier(random_state=42)
scores =cross_val_score(DTC, X_train, y_train, cv=10, scoring='accuracy')
print(scores.mean()*100)
97.34356
However, in the cross validation part, when I put X_train instead of x and y_train instead of y, the accuracy again rises to 97.
I am wondering which one I need to use (x and y) or (X_train adn y_train) will be the correct and common sense cross validation.
Try to shuffle your data and reduce the cross validation folds.
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeClassifier
xx = pandas.read_csv("data1.dat", delimiter=",")
y = pandas.read_csv("label.dat", delim_whitespace=True)
x = xx.as_matrix()
y = y.values.astype(np.int32).reshape(-1, 1)
x, y = shuffle(x, y, random_state=42)
DTC = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(DTC, x, y, cv=3, scoring='accuracy')
print(scores.mean()*100)

I am getting Not Fitted error in random forest classifier?

I have 4 features and one target variable. I am using RandomForestRegressor instead of RandomForestClassifer as my target variable is float. When I am trying to fit my model and then output them in sorted order to get the important features I am getting Not fitted error how to fix it?
Code:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
feat_labels = data.columns[:4]
regr = RandomForestRegressor(max_depth=2, random_state=0)
#clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
#clf.fit(X_train, y_train)
regr.fit(X, y)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
You are fitting to regr but calling the feature importances on clf. Try calling this instead:
importances = regr.feature_importances_
I noticed that previously your classifier was being fit with the training data you setup, but the regressor is now being fit with X and y.
However, I don't see here where you're setting X and y in the first place or even more where you actually load in a dataset. Could it be you forgot this step as well as what Harpal mentioned in another answer?

Categories