I am trying to use decision tree for classification and get 100% accuracy.
It is a common problem, described here and here. And in many other questions.
Data is here.
Two best guesses:
I split data incorrectly
My dataset is too imbalanced
What is wrong with my code?
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import sklearn.model_selection as cv
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# Split data
Y = starbucks.iloc[:, 4]
X = starbucks.loc[:, starbucks.columns != 'offer_completed']
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.3,
random_state=100)
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth = 3,
min_samples_leaf = 5)
# Performing training
clf_gini.fit(X_train, y_train)
# Predicton on test with giniIndex
y_pred = clf_gini.predict(X_test)
print("Predicted values:")
print(y_pred)
print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print ("Accuracy : ", accuracy_score(y_test, y_pred)*100)
print("Report : ", classification_report(y_test, y_pred))
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
Predicted values:
[0. 0. 0. ... 0. 0. 0.]
Confusion Matrix: [[36095 0]
[ 0 8158]]
Accuracy : 100.0
When I print X, it shows me that offer_completed was removed.
X.dtypes
offer_received int64
offer_viewed float64
time_viewed_received float64
time_completed_received float64
time_completed_viewed float64
transaction float64
amount float64
total_reward float64
age float64
income float64
male int64
membership_days float64
reward_each_time float64
difficulty float64
duration float64
email float64
mobile float64
social float64
web float64
bogo float64
discount float64
informational float64
Fitting the model and checking feature importances you can see that they are all zeros except for total_reward. Then investingating such column you get:
df.groupby(target)['total_reward'].describe()
count mean std min 25% 50% 75% max
0 119995 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 27513 5.74 4.07 2.0 3.0 5.0 10.0 40.0
You can see that for target 0, total_reward is always zero, otherwise has a value always greater than 0. Here's your leak.
As there could be other leaks and it is tedious to check each column, we can use a sort of "predictive power" of each feature alone:
acc_df = pd.DataFrame(columns=['col', 'acc'], index=range(len(X.columns)))
for i, c in enumerate(X.columns):
clf = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth = 3,
min_samples_leaf = 5)
clf.fit(X_train[c].to_numpy()[:, None], y_train)
y_pred = clf.predict(X_test[c].to_numpy()[:, None])
acc_df.iloc[i] = [c, accuracy_score(y_test, y_pred)*100]
acc_df.sort_values('acc',ascending=False)
col acc
8 total_reward 100
4 completed_time 99.8848
13 reward_each_time 89.3205
14 difficulty 89.3205
15 duration 89.3205
21 discount 86.4054
19 web 85.088
20 bogo 84.4801
3 viewed_time 84.4056
2 offer_viewed 84.3491
18 social 83.3525
1 received_time 83.0497
7 amount 82.5436
0 offer_received 81.7526
16 email 81.7526
17 mobile 81.6464
11 male 81.5651
10 income 81.5651
9 age 81.5651
6 transaction_time 81.5651
5 transaction 81.5651
22 informational 81.5651
12 membership_days 81.5561
Related
I am using a random forest regression model on the following data.
However, the R-squared value I get is -0.22, which seems very low.
rsquaredscores = []
x = correcteddf[['sleepiness_resolution_index']]
y = correcteddf[['distance']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,train_size=0.7,random_state=0)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100,random_state = 0)
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
from sklearn.metrics import r2_score
rsquaredscore = r2_score(y_test,y_pred)
rsquaredscores.append(rsquaredscore)
print(rsquaredscores)
When I use the same data in a linear regression model:
resultmodeldistancevariation2sleep = sm.OLS(y,x).fit()
resultmodeldistancevariation2sleepsummary = resultmodeldistancevariation2sleep.summary()
print(resultmodeldistancevariation2sleepsummary)
I get a much higher R-squared value: 0.027.
Is it normal for one of these values to be positive and one to be negative? Am I right in saying that the negative value basically indicates very poor model fit? Or am I going wrong somewhere?
I would be so grateful for a helping hand!
correcteddf['sleepiness_resolution_index'].head(10)
sleepiness_resolution_index
0 0.555556
1 1.500000
3 1.000000
6 0.900000
7 0.857143
8 0.875000
9 0.300000
10 0.750000
12 1.400000
13 2.333333
correcteddf[['distance']].head(10)
distance
0 -0.251893
1 0.072180
3 -0.555438
6 0.199593
7 -0.378295
8 -0.162809
9 -0.108010
10 0.162275
12 -0.007996
13 4.637838
I am running into an issue after transferring a grid search workflow for decision tree classifiers from Windows to our Linux cluster. While it works every time on Windows most often than not it generates an error/warning.
The code is as follows:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer,accuracy_score,recall_score,f1_score, precision_score, roc_auc_score
import numpy as np
from sklearn.model_selection import GridSearchCV
data = pd.read_csv("Dtree_test_data.csv")
d_tree = DecisionTreeClassifier()
scoring = {'accuracy' : make_scorer(accuracy_score),
'precision' : make_scorer(precision_score, average = 'weighted'),
'recall' : make_scorer(recall_score, average = 'weighted'),
'f1_score' : make_scorer(f1_score, average = 'weighted')}
features = data[["H-Bonds","Mass"]]
labels = data["Class"]
parameters = {'max_depth':np.arange(1,3,1),
'min_samples_leaf': np.arange(0.01,0.05,0.01),
'min_impurity_decrease':np.arange(0.01,0.04,0.01),
'criterion':['gini','entropy']}
dtree = DecisionTreeClassifier()
clf = GridSearchCV(dtree,parameters,verbose=1,n_jobs=12,scoring=scoring,refit=False,return_train_score=True,cv=2)
clf.fit(features,labels)
The data looks like this:
H-Bonds Mass Class
0 2 123 0
1 3 45 1
2 1 153 2
3 4 90 0
4 6 300 1
5 1 40 2
6 2 200 0
7 3 245 1
8 4 87 2
9 1 126 1
The error/warning is the following (repeated multiple times):
Fitting 2 folds for each of 48 candidates, totalling 96 fits
/../miniconda3/envs/ml-debug-current/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1334: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Any idea why this is?
I made up a excel sheet of random numbers (3000 rows and 6 columns) and set it so any row having a B column >= 50, a C column of 0 and an E column of 1 gets a final 'y' value of 1. Else, it gets a 0 value. Ran this through this RandomForestClassifier code and it doesn't work and either returns 0 for all new test data or doesn't even take into account the B column when predicting. How can I solve this?
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pickle
data_crd = pd.read_csv(r'C:\Users\Rada1\.spyder-py3\new_created_data.csv')
#C:\Users\Rada1\.spyder-py3\new_created_data.csv
data_crd.head()
X = data_crd.iloc[:,1:5]
y = data_crd.iloc[:,5]
#print (X)
#print (y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state=0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = RandomForestClassifier (n_estimators = 500, random_state = 0)
classifier.fit (X_train, y_train)
y_pred = classifier.predict(X_test)
print (classification_report(y_test,y_pred))
print (confusion_matrix(y_test,y_pred))
print (accuracy_score(y_test,y_pred))
with open ('model_wcd','wb') as f:
pickle.dump(classifier,f)
I get a 100% accuracy rate as my result which just already feels wrong. What do I need to adjust?
precision recall f1-score support
0 1.00 1.00 1.00 515
1 1.00 1.00 1.00 85
accuracy 1.00 600
macro avg 1.00 1.00 1.00 600
weighted avg 1.00 1.00 1.00 600
[[515 0]
[ 0 85]]
1.0
Hopefully if you use stratify =y it might work
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state=0,stratify=y)
and also use MinMaxScaler for numerical feature and reshape them to (-1,1)
x_train_num=num_feature.transform(x_train[column_name].values.reshape(-1,1))
x_test_num=num_feature.transform(x_test[column_name].values.reshape(-1,1))
Dataset
0-9 columns: float features (parameters of a product)
10 column: int labels (products)
Goal
Calculate an 0-1 classification certainty score for the labels (this is what my current code should do)
Calculate the same certainty score for each “product_name”(300 columns) at each rows(22'000)
ERROR I use sklearn.tree.DecisionTreeClassifier.
I am trying to use "predict_proba" but it gives an error.
Python CODE
data_train = pd.read_csv('data.csv')
features = data_train.columns[:-1]
labels = data_train.columns[-1]
x_features = data_train[features]
x_label = data_train[labels]
X_train, X_test, y_train, y_test = train_test_split(x_features, x_label, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
clf = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
class_probabilitiesDec = clf.predict_proba(y_train)
#ERORR: ValueError: Number of features of the model must match the input. Model n_features is 10 and input n_features is 16722
print('Decision Tree Classification Accuracy Training Score (max_depth=3): {:.2f}'.format(clf.score(X_train, y_train)*100) + ('%'))
print('Decision Tree Classification Accuracy Test Score (max_depth=3): {:.2f}'.format(clf.score(X_test, y_test)*100) + ('%'))
print(class_probabilitiesDec[:10])
# if I use X_tranin than it jsut prints out a buch of 41 element vectors: [[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
[....
FEATURES (COLUMNS)
(last columns are the labels)
0 1 1 1 1.0 1462293561 1462293561 0 0 0.0 0.0 1
1 2 2 2 8.0 1460211580 1461091152 1 1 0.0 0.0 2
2 3 3 3 1.0 1469869039 1470560880 1 1 0.0 0.0 3
3 4 4 4 1.0 1461482675 1461482675 0 0 0.0 0.0 4
4 5 5 5 5.0 1462173043 1462386863 1 1 0.0 0.0 5
CLASSES COLUMNS (300 COLUMNS OF ITEMS)
HEADER ROW: apple gameboy battery ....
SCORE in 1st row: 0.763 0.346 0.345 ....
SCORE in 2nd row: 0.256 0.732 0.935 ....
ex.: of similar scores used when someone image classify cat VS. dog and the classification gives confidence scores.
You cannot predict the probability of your labels.
predict_proba predicts the probability for each label from your X Data, thus:
class_probabilitiesDec = clf.predict_proba(X_test)
What you postet as "when i use X_train":
[[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
Is a list of the probability to be true for every possible label.
EDIT
After reading your comments predict proba is exactly what you want.
Lets make an example. In the following code we have a classifier with 3 classes: either 11, 12 or 13.
If the input is 1 the classifier should predict 11
If the input is 2 the classifier should predict 12
...
If the input is 7 the classifier should predict 13
clf = DecisionTreeClassifier()
clf.fit([[1],[2],[3],[4],[5],[6],[7]], [[11],[12],[13],[13],[12],[11],[13]])
now if you have test data with a single row e.g. 5 than the classifier should predict 12. So lets try that.
clf.predict([[5]])
And voila: the result is array([12])
if we want a probability then predict proba is the way to go:
clf.predict_proba([[5]])
and we get [array([0., 1., 0.])]
In that case the array [0., 1., 0.] means :
0% probability for class 11
100% probability for class 12
0% probability for class 13
If i'm correct thats exactly what you want.
You can even map that to the names of your classes with:
probabilities = clf.predict_proba([[5]])[0]
{clf.classes_[i] : probabilities[i] for i in range(len(probabilities))}
which gives you a dictionary with probabilities for class names:
{11: 0.0, 12: 1.0, 13: 0.0}
Now in your case you have way more classes than only [11,12,13] so the array gets longer. And for every row in your dataset predict_proba creates an array, so for more than a single row of data your output becomes a matrix.
I just finished a gridsearch CV on tree based modelling and after looking into the results, I managed to access the results of each iteration from gridsearchCV.
I need each run into a separate row and each parameter in a separate column.
I can run a loop or list comprehension to run for each row but unable to separate each run into columns
df = grid.grid_scores_
df[0]
mean: 0.57114, std: 0.00907, params: {'criterion': 'gini', 'max_depth': 10,
'max_features': 8, 'min_samples_leaf': 2, 'min_samples_split': 2, 'splitter': 'best'}`
I tried with tuple and dict accesories but ended up in errors. Essentially I need every parameter in a new column like below.
mean | std | criterion | ..... | splitter
0.57 0.009 'gini' ..... | 'best'
0.58 0.029 'entropy' ..... | 'random'
.
.
.
.
You could use the pre-made class to generate a DataFrame with a report of the parameters (see stackoverflow post using this code here)
Imports and settings
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from gridsearchcv_helper import EstimatorSelectionHelper
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
Generate some data
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
Define model and hyper-parameter grid
models = {'RandomForestClassifier': RandomForestClassifier()}
params = {'RandomForestClassifier': { 'n_estimators': [16, 32],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion' : ['gini', 'entropy'] }}
Perform gridsearch (with CV) and report results
helper = EstimatorSelectionHelper(models, params)
helper.fit(X_iris, y_iris, n_jobs=4)
df_gridsearchcv_summary = helper.score_summary()
Here is the output
print(type(df_gridsearchcv_summary))
print(df_gridsearchcv_summary.iloc[:,1:])
RandomForestClassifier
<class 'pandas.core.frame.DataFrame'>
min_score mean_score max_score std_score criterion max_features n_estimators
0 0.941176 0.973856 1 0.0244553 gini auto 16
1 0.921569 0.96732 1 0.0333269 gini auto 32
8 0.921569 0.96732 1 0.0333269 entropy sqrt 16
10 0.921569 0.96732 1 0.0333269 entropy log2 16
2 0.941176 0.966912 0.980392 0.0182045 gini sqrt 16
3 0.941176 0.966912 0.980392 0.0182045 gini sqrt 32
4 0.941176 0.966912 0.980392 0.0182045 gini log2 16
5 0.901961 0.960784 1 0.0423578 gini log2 32
6 0.921569 0.960376 0.980392 0.0274454 entropy auto 16
7 0.921569 0.960376 0.980392 0.0274454 entropy auto 32
11 0.901961 0.95384 0.980392 0.0366875 entropy log2 32
9 0.921569 0.953431 0.980392 0.0242635 entropy sqrt 32