I just finished a gridsearch CV on tree based modelling and after looking into the results, I managed to access the results of each iteration from gridsearchCV.
I need each run into a separate row and each parameter in a separate column.
I can run a loop or list comprehension to run for each row but unable to separate each run into columns
df = grid.grid_scores_
df[0]
mean: 0.57114, std: 0.00907, params: {'criterion': 'gini', 'max_depth': 10,
'max_features': 8, 'min_samples_leaf': 2, 'min_samples_split': 2, 'splitter': 'best'}`
I tried with tuple and dict accesories but ended up in errors. Essentially I need every parameter in a new column like below.
mean | std | criterion | ..... | splitter
0.57 0.009 'gini' ..... | 'best'
0.58 0.029 'entropy' ..... | 'random'
.
.
.
.
You could use the pre-made class to generate a DataFrame with a report of the parameters (see stackoverflow post using this code here)
Imports and settings
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from gridsearchcv_helper import EstimatorSelectionHelper
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
Generate some data
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
Define model and hyper-parameter grid
models = {'RandomForestClassifier': RandomForestClassifier()}
params = {'RandomForestClassifier': { 'n_estimators': [16, 32],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion' : ['gini', 'entropy'] }}
Perform gridsearch (with CV) and report results
helper = EstimatorSelectionHelper(models, params)
helper.fit(X_iris, y_iris, n_jobs=4)
df_gridsearchcv_summary = helper.score_summary()
Here is the output
print(type(df_gridsearchcv_summary))
print(df_gridsearchcv_summary.iloc[:,1:])
RandomForestClassifier
<class 'pandas.core.frame.DataFrame'>
min_score mean_score max_score std_score criterion max_features n_estimators
0 0.941176 0.973856 1 0.0244553 gini auto 16
1 0.921569 0.96732 1 0.0333269 gini auto 32
8 0.921569 0.96732 1 0.0333269 entropy sqrt 16
10 0.921569 0.96732 1 0.0333269 entropy log2 16
2 0.941176 0.966912 0.980392 0.0182045 gini sqrt 16
3 0.941176 0.966912 0.980392 0.0182045 gini sqrt 32
4 0.941176 0.966912 0.980392 0.0182045 gini log2 16
5 0.901961 0.960784 1 0.0423578 gini log2 32
6 0.921569 0.960376 0.980392 0.0274454 entropy auto 16
7 0.921569 0.960376 0.980392 0.0274454 entropy auto 32
11 0.901961 0.95384 0.980392 0.0366875 entropy log2 32
9 0.921569 0.953431 0.980392 0.0242635 entropy sqrt 32
Related
I am using a random forest regression model on the following data.
However, the R-squared value I get is -0.22, which seems very low.
rsquaredscores = []
x = correcteddf[['sleepiness_resolution_index']]
y = correcteddf[['distance']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,train_size=0.7,random_state=0)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100,random_state = 0)
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
from sklearn.metrics import r2_score
rsquaredscore = r2_score(y_test,y_pred)
rsquaredscores.append(rsquaredscore)
print(rsquaredscores)
When I use the same data in a linear regression model:
resultmodeldistancevariation2sleep = sm.OLS(y,x).fit()
resultmodeldistancevariation2sleepsummary = resultmodeldistancevariation2sleep.summary()
print(resultmodeldistancevariation2sleepsummary)
I get a much higher R-squared value: 0.027.
Is it normal for one of these values to be positive and one to be negative? Am I right in saying that the negative value basically indicates very poor model fit? Or am I going wrong somewhere?
I would be so grateful for a helping hand!
correcteddf['sleepiness_resolution_index'].head(10)
sleepiness_resolution_index
0 0.555556
1 1.500000
3 1.000000
6 0.900000
7 0.857143
8 0.875000
9 0.300000
10 0.750000
12 1.400000
13 2.333333
correcteddf[['distance']].head(10)
distance
0 -0.251893
1 0.072180
3 -0.555438
6 0.199593
7 -0.378295
8 -0.162809
9 -0.108010
10 0.162275
12 -0.007996
13 4.637838
I am running into an issue after transferring a grid search workflow for decision tree classifiers from Windows to our Linux cluster. While it works every time on Windows most often than not it generates an error/warning.
The code is as follows:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer,accuracy_score,recall_score,f1_score, precision_score, roc_auc_score
import numpy as np
from sklearn.model_selection import GridSearchCV
data = pd.read_csv("Dtree_test_data.csv")
d_tree = DecisionTreeClassifier()
scoring = {'accuracy' : make_scorer(accuracy_score),
'precision' : make_scorer(precision_score, average = 'weighted'),
'recall' : make_scorer(recall_score, average = 'weighted'),
'f1_score' : make_scorer(f1_score, average = 'weighted')}
features = data[["H-Bonds","Mass"]]
labels = data["Class"]
parameters = {'max_depth':np.arange(1,3,1),
'min_samples_leaf': np.arange(0.01,0.05,0.01),
'min_impurity_decrease':np.arange(0.01,0.04,0.01),
'criterion':['gini','entropy']}
dtree = DecisionTreeClassifier()
clf = GridSearchCV(dtree,parameters,verbose=1,n_jobs=12,scoring=scoring,refit=False,return_train_score=True,cv=2)
clf.fit(features,labels)
The data looks like this:
H-Bonds Mass Class
0 2 123 0
1 3 45 1
2 1 153 2
3 4 90 0
4 6 300 1
5 1 40 2
6 2 200 0
7 3 245 1
8 4 87 2
9 1 126 1
The error/warning is the following (repeated multiple times):
Fitting 2 folds for each of 48 candidates, totalling 96 fits
/../miniconda3/envs/ml-debug-current/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1334: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Any idea why this is?
I made up a excel sheet of random numbers (3000 rows and 6 columns) and set it so any row having a B column >= 50, a C column of 0 and an E column of 1 gets a final 'y' value of 1. Else, it gets a 0 value. Ran this through this RandomForestClassifier code and it doesn't work and either returns 0 for all new test data or doesn't even take into account the B column when predicting. How can I solve this?
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pickle
data_crd = pd.read_csv(r'C:\Users\Rada1\.spyder-py3\new_created_data.csv')
#C:\Users\Rada1\.spyder-py3\new_created_data.csv
data_crd.head()
X = data_crd.iloc[:,1:5]
y = data_crd.iloc[:,5]
#print (X)
#print (y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state=0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = RandomForestClassifier (n_estimators = 500, random_state = 0)
classifier.fit (X_train, y_train)
y_pred = classifier.predict(X_test)
print (classification_report(y_test,y_pred))
print (confusion_matrix(y_test,y_pred))
print (accuracy_score(y_test,y_pred))
with open ('model_wcd','wb') as f:
pickle.dump(classifier,f)
I get a 100% accuracy rate as my result which just already feels wrong. What do I need to adjust?
precision recall f1-score support
0 1.00 1.00 1.00 515
1 1.00 1.00 1.00 85
accuracy 1.00 600
macro avg 1.00 1.00 1.00 600
weighted avg 1.00 1.00 1.00 600
[[515 0]
[ 0 85]]
1.0
Hopefully if you use stratify =y it might work
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state=0,stratify=y)
and also use MinMaxScaler for numerical feature and reshape them to (-1,1)
x_train_num=num_feature.transform(x_train[column_name].values.reshape(-1,1))
x_test_num=num_feature.transform(x_test[column_name].values.reshape(-1,1))
I am trying to use decision tree for classification and get 100% accuracy.
It is a common problem, described here and here. And in many other questions.
Data is here.
Two best guesses:
I split data incorrectly
My dataset is too imbalanced
What is wrong with my code?
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import sklearn.model_selection as cv
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# Split data
Y = starbucks.iloc[:, 4]
X = starbucks.loc[:, starbucks.columns != 'offer_completed']
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.3,
random_state=100)
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth = 3,
min_samples_leaf = 5)
# Performing training
clf_gini.fit(X_train, y_train)
# Predicton on test with giniIndex
y_pred = clf_gini.predict(X_test)
print("Predicted values:")
print(y_pred)
print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print ("Accuracy : ", accuracy_score(y_test, y_pred)*100)
print("Report : ", classification_report(y_test, y_pred))
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
Predicted values:
[0. 0. 0. ... 0. 0. 0.]
Confusion Matrix: [[36095 0]
[ 0 8158]]
Accuracy : 100.0
When I print X, it shows me that offer_completed was removed.
X.dtypes
offer_received int64
offer_viewed float64
time_viewed_received float64
time_completed_received float64
time_completed_viewed float64
transaction float64
amount float64
total_reward float64
age float64
income float64
male int64
membership_days float64
reward_each_time float64
difficulty float64
duration float64
email float64
mobile float64
social float64
web float64
bogo float64
discount float64
informational float64
Fitting the model and checking feature importances you can see that they are all zeros except for total_reward. Then investingating such column you get:
df.groupby(target)['total_reward'].describe()
count mean std min 25% 50% 75% max
0 119995 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 27513 5.74 4.07 2.0 3.0 5.0 10.0 40.0
You can see that for target 0, total_reward is always zero, otherwise has a value always greater than 0. Here's your leak.
As there could be other leaks and it is tedious to check each column, we can use a sort of "predictive power" of each feature alone:
acc_df = pd.DataFrame(columns=['col', 'acc'], index=range(len(X.columns)))
for i, c in enumerate(X.columns):
clf = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth = 3,
min_samples_leaf = 5)
clf.fit(X_train[c].to_numpy()[:, None], y_train)
y_pred = clf.predict(X_test[c].to_numpy()[:, None])
acc_df.iloc[i] = [c, accuracy_score(y_test, y_pred)*100]
acc_df.sort_values('acc',ascending=False)
col acc
8 total_reward 100
4 completed_time 99.8848
13 reward_each_time 89.3205
14 difficulty 89.3205
15 duration 89.3205
21 discount 86.4054
19 web 85.088
20 bogo 84.4801
3 viewed_time 84.4056
2 offer_viewed 84.3491
18 social 83.3525
1 received_time 83.0497
7 amount 82.5436
0 offer_received 81.7526
16 email 81.7526
17 mobile 81.6464
11 male 81.5651
10 income 81.5651
9 age 81.5651
6 transaction_time 81.5651
5 transaction 81.5651
22 informational 81.5651
12 membership_days 81.5561
I'm learning how to use Lasso and Ridge with sklearn in Python. I am given the folds in a column. I want to find the best parameter based on a 5 fold cross validation.
My data looks like the following:
mpg cylinders displacement horsepower weight acceleration origin fold
0 18 8 307 130 3504 12.0 1 3
1 15 8 350 165 3693 11.5 1 0
2 18 8 318 150 3436 11.0 1 2
3 16 8 304 150 3433 12.0 1 2
4 17 8 302 140 3449 10.5 1 3
reg_para = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]
mpg is the y/target variable and the other columns are the predictor. the last column contains the folds. I want to run a Lasso and Ridge and find the best parameter. The problem I am having is incorporating the specified folds in cross validation. Here is what I have so far (for Lasso):
from sklearn.linear_model import Lasso, LassoCV
lasso_model = LassoCV(cv=5, alphas=reg_para)
lasso_fit = lasso_model.fit(X,y)
Is there a simple way to incorporate the fold splits? Any help is greatly appreciated
If your data are in a pandas dataframe, then all you need to do is access that column
fold_labels = df["fold"]
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(fold_labels)
lasso_model = LassoCV(cv=cv, alphas=reg_para)
So if you obtain the fold labels in an array fold_labels you can just use LeaveOneLabelOut (sorry for the non-functional code. It should be sufficient to elucidate the idea though.)