I'm currently working on a Kaggle dataset regarding Human Resources Analytics.
I've cleaned the dataset, benchmark some models. The best one is the RandomForestClassifier, which predict if a employee left the company or not with a good accuracy (around 99%).
Now, I would like to find the most probable employee still in the company who may leave. I used the predict_proba method on the train model but this gives me the probability that the employee left or not. It's not the probability for the employee to leave. Moreover, the dataset is the one used for the training.
I have no idea, how to predict this kind of information. In a linear regression for example, I'd have look for the closest point to the estimator but with an ensemble, I don't know.
I attached below a piece of code if you want to try it:
dataset = pd.read_csv("HR.csv")
# Cleanup/Preparation datas
convert_dict = {"high" : 3, "medium": 2, "low": 1}
dataset = dataset.replace({"salary": convert_dict})
dataset = pd.get_dummies(dataset)
X = dataset.drop("left", axis=1)
y = dataset["left"]
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# training best model (I pass the benchmark part)
model = RandomForestClassifier(bootstrap=False, n_estimators=50)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(acc)
# Eval
eval_dataset = dataset[dataset["left"] == 0]
X = eval_dataset.drop("left", axis=1)
y = eval_dataset["left"]
X = scaler.transform(X)
y_pred = model.predict_proba(X) # => This is wrong
Thanks for your support,
You say your model is ~99% accurate, but is that in test? If so great! Now image you have new data coming in that contains all of your data fields, you would be able to use the predict_proba method on each obersavtion/s to predict whether or not they left. In this sense you can use this as a simple proxy for will leave as this is the best you have right now.
I will give you a quick hypothesis to test though. Say all things remained the same for an employee, but time continues to pass. You could update the amount of time an employee has spent at a company and see how the probability of them leaving changes over time. Granted this wouldn't be a great method for predicting several years out (as hopefully people grow and the other parameters change), but it would give you a good idea of how long someone would put up with their current status quo, based on the knowledge learned from the training data.
There are several issues with your question...
I used the predict_proba method on the train model but this gives me the probability that the employee left or not. It's not the probability for the employee to leave.
This is wrong on many levels:
philosophically, since the employee has already either left or not, there is not any actual probability involved here, and that is why the respective data column left is actually binary (0/1) and not in the range [0,1]
computationally, you indeed get what the model would have guessed as a probability of leaving, after training
On close inspection, the data also seem to suffer from class imbalance (in simple words, your 1's are much more than your 0's), which calls for more caution and specialised techniques (vanilla accuracy may be misinformative here).
It is not clear what your code does after #Eval, why you seem to keep only records with left==0, or what exactly best_1 is (your "best" model, perhaps?). But applying predict_proba on your test set X_test will indeed give you the model's probability guess regarding leaving for these (unseen during training) employees.
Related
I'm a person who doesnt get used to loocv yet.
I've been curious of the title problem.
I did leave one out cross validation for my random forest model.
My codes are like this.
for train_index, test_index in loo.split(x):
x_train, x_test = x.iloc[train_index], x.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model = RandomForestRegressor()
model.fit(x_train, y_train.values.ravel()) #y_train.values.ravel()
y_pred = model.predict(x_test)
#y_pred = [np.round(x) for x in y_pred]
y_tests += y_test.values.tolist()[0]
y_preds += list(y_pred)
rr = metrics.r2_score(y_tests, y_preds)
ms_error = metrics.mean_squared_error(y_tests, y_preds)**0.5
After that, I wanted to get a feature importance of my model like this.
features = x.columns
sorted_idx = model.feature_importances_.argsort()
It's pretty different to what i've expected.
In loocv process, my computer made many different models with using different test and datasets from my original data, which has a length of literally same to original data.
So I'm thinking that feature importances should be multiple as the length of original data, because the test set of each loocv epoch is just one.. (I don't know which word is best for explaining this in English, english is not my mother tongue)
It was not multiple results, just one though.
It was only one feature importance, like calculated for only one sets (as if loocv hadnt added in my codes)
Then why should i had gotten only one importance? I want to understand the reason of it.
Thank you for reading my question.
want to know the reason why I got only one feature importance even though loocv was added in my codes
I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.
Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!
I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.
When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.
There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)
your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .
Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.
Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.
I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1.
Here is the Histogram of predicted probabilities on test data:
with majority values at 0 - 0.2 and 0.9 to 1, which is much accurate.
But when I try to predict the probability values for unseen data or let's say all data points for which value of 0 or 1 is unknown, the predicted probabilities values are between 0 to 0.5 only for class 1. Why is that so? Aren't the values should be from 0.5 to 1?
Here is the histogram of predicted probabilities on unseen data:
I am using sklearn RandomforestClassifier in python. The code is below:
#Read the CSV
df=pd.read_csv('path/df_all.csv')
#Change the type of the variable as needed
df=df.astype({'probabilities': 'int32', 'CPZ_CI_new.tif' : 'category'})
#Response variable is between 0 and 1 having actual probabilities values
y = df['probabilities']
# Separate majority and minority classes
df_majority = df[y == 0]
df_minority = df[y == 1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=100387, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df1 = pd.concat([df_majority, df_minority_upsampled])
y = df1['probabilities']
X = df1.iloc[:,1:138]
#Change interfere values to category
y_01=y.astype('category')
#Split training and testing
X_train, X_valid, y_train, y_valid = train_test_split(X, y_01, test_size = 0.30, random_state = 42,stratify=y)
#Model
model=RandomForestClassifier(n_estimators = 500,
max_features= 'sqrt',
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state=0,class_weight='balanced',)
#I had 137 variable, to select the optimum one, I used RFECV
rfecv = RFECV(model, step=1, min_features_to_select=1, cv=10, scoring='neg_brier_score')
rfecv.fit(X_train, y_train)
#Retrained the model with only 15 variables selected
rf=RandomForestClassifier(n_estimators = 500,
max_features= 'sqrt',
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state=0,class_weight='balanced',)
#X1_train is same dataframe with but with only 15 varible
rf.fit(X1_train,y_train)
#Printed ROC metric
print('roc_auc_score_testing:', metrics.roc_auc_score(y_valid,rf.predict(X1_valid)))
#Predicted probabilties on test data
predv=rf.predict_proba(X1_valid)
predv = predv[:, 1]
print('brier_score_training:', metrics.brier_score_loss(y_train, predt))
print('brier_score_testing:', metrics.brier_score_loss(y_valid, predv))
#Output is,
roc_auc_score_testing: 0.9832652130944419
brier_score_training: 0.002380976369884945
brier_score_testing: 0.01669848089917487
#Later, I have images of that 15 variables, I created a data frame out(sample_img) of it and use the same function to predict probabilities.
IMG_pred=rf.predict_proba(sample_img)
IMG_pred=IMG_pred[:,1]
The results shown for your test data are not valid; you perform a mistaken procedure that has two serious consequences, which invalidate them.
The mistake here is that you perform the minority class upsampling before splitting to train & test sets, which should not be the case; you should first split into training and test sets, and then perform the upsampling only to the training data and not to the test ones.
The first reason why such a procedure is invalid is that, this way, some of the duplicates due to upsampling will end up both to the training and the test splits; the result being that the algorithm is tested with some samples that have already been seen during training, which invalidates the very fundamental requirement of a test set. For more details, see own answer in Process for oversampling data for imbalanced binary classification; quoting from there:
I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...
The second reason is that this procedure shows biased performance measures in a test set that is no longer representative of reality: remember, we want our test set to be representative of the real unseen data, which of course will be imbalanced; artificially balancing our test set and claiming that it has X% accuracy when a great part of this accuracy will be due to the artificially upsampled minority class makes no sense, and gives misleading impressions. For details, see own answer in Balance classes in cross validation (the rationale is identical for the case of train-test split, as here).
The second reason is why your procedure would still be wrong even if you had not performed the first mistake, and you had proceeded to upsample the training and test sets separately after splitting.
I short, you should remedy the procedure, so that you first split into training & test sets, and then upsample your training set only.
I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
scores = []
for _, test_index in kFolds:
test_data = test_x[test_index]
test_labels = test_y[test_index]
scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)
TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.
If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.