features selection embedded method showing wrong features

features selection embedded method showing wrong features - python

in features selection (embedded method) i'm getting wrong features.
feature selection code:
# create the random forest model
model = RandomForestRegressor(n_estimators=120)
# fit the model to start training.
model.fit(X_train[_columns], X_train['delay_in_days'])
# get the importance of the resulting features.
importances = model.feature_importances_
# create a data frame for visualization.
final_df = pd.DataFrame({"Features": X_train[_columns].columns, "Importances":importances})
final_df.set_index('Importances')
# sort in descending order
final_df = final_df.sort_values('Importances',ascending=False)
#visualising feature importance
pd.Series(model.feature_importances_, index=X_train[_columns].columns).nlargest(10).plot(kind='barh')
_columns #my some selected features
enter image description here
here is the features list, as you can see total_open_amount is very important feature
but when I put top 3 features in my model I'm getting -ve R2_Score. but if I remove
total_open_amount from my model I'm getting decent R2_Score.
my question is what causing this ?(all the data train, test are randomly selected from dataset of size=100000)
clf = RandomForestRegressor()
clf.fit(x_train, y_train)
# Predicting the Test Set Results
predicted = clf.predict(x_test)

This is an educated guess since you did not provide the data itself. Looking at names of your features, the most important features are name customers and total open amount. I suppose these are features with a lot of unique values.
If you check the help page for random forest, it does mention:
Warning: impurity-based feature importances can be misleading for high
cardinality features (many unique values). See
sklearn.inspection.permutation_importance as an alternative.
This is also mentioned in a publication by Strobl et al:
We show that random forest variable importance measures are a sensible
means for variable selection in many applications, but are not
reliable in situations where potential predictor variables vary in
their scale of measurement or their number of categories.
I would try with the permutation importance and see whether I get the same results.

Related

Random Forest Regressor Feature Importance all zero

I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.

Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!

While doing feature selection using selectKbest, can we use the same code for the training and testing set? Will it give the same features?

x_tr = SelectKBest(chi2, k=25).fit_transform(x_tr,y_tr)
x_ts = SelectKBest(chi2, k=25).fit_transform(x_ts, y_ts)
This is the code I have. I'm worried that it will select different features for the training and testing data. Should I change the code or will it give the same features?

Short answer: You will obtain different features (unless you are lucky).
Why? Basically, because you are obtaining information from different data:
x_tr = SelectKBest(chi2, k=25).fit_transform(x_tr,y_tr)
x_ts = SelectKBest(chi2, k=25).fit_transform(x_ts, y_ts)
In the first line you obtain the feature from x_tr and y_tr; and in the second line, you obtain features from x_ts and y_ts. So it makes sense that the output you obtain is different. To sump up, if the input is different, the output has a high probability to be different too.
The only case you will obtain the same features is when training and test data are super homogenous and they hide exactly the same information. In your case, you are asking for 25 features, and it will be quiet difficult that you obtain exactly the same 25 features in each code.
If you want to apply the transformation use this code:
select = SelectKBest(score_func=chi2, k=25) #We define the model by using SelectKBest class
x_tr_selected = select.fit_transform(x_tr ,y_tr) #We fit the class using x_tr and y_tr. And we transform x_tr
x_ts_selected = select.transform(x_ts) #We only transform the data x_ts with the information we obtain in the previous fit

To get the same features, you have to fit on training data, and then transform both training and testing data.
select = SelectKBest(chi2, k=25).fit(x_tr, y_tr)
X_train_new = select.transform(x_tr)
X_test_new = select.transform(x_ts)

How is it that the accuracy score for 10-fold cross validation is worst than for a 90-10 train_test_split using sklearn?

The task is binary classification via a neural network. The data is present in a dictionary, that contains composite names (as the key) of each entries and the labels (0 or 1, as the third element in the vector value). The first and second elements are the two parts of the composite name, which are used later to extract the corresponding features.
In both cases, the dictionary is transformed into two arrays for the purpose of performing a balanced undersampling of the majority class (that is present in 66% of the data):
data_for_sampling = np.asarray([key for key in list(data.keys())])
labels_for_sampling = [element[2] for element in list(data.values())]
sampler = RandomUnderSampler(sampling_strategy = 'majority')
data_sampled, label_sampled = sampler.fit_resample(data_for_sampling.reshape(-1, 1), labels_for_sampling)
Then the resampled arrays of names and labels are used to create train and test sets via the Kfold method:
kfolder = KFold(n_splits = 10, shuffle = True)
kfolder.get_n_splits(data_sampled)
for train_index, test_index in kfolder.split(data_sampled):
data_train, data_test = data_sampled[train_index], data_sampled[test_index]
Or the train_test_split method:
data_train, data_test, label_train, label_test = train_test_split(data_sampled, label_sampled, test_size = 0.1, shuffle = True)
Finally, the names from data_train and data_test are used to re-extract the relevant entries (by key) from the original dictionary, that is then used to gather the features of those entries. As far as I'm concerned, a single split of the 10-fold sets should provide similar train-test data distribution as the 90-10 train_test_split, and that seems to be true during training, where both training sets result in ~0.82 accuracy after only one epoch, run separately with model.fit(). However, when the corresponding models are evaluated using model.evaluate() on the test sets after said epoch, the set from train_test_split gives a score of ~0.86, while the set from Kfold is ~0.72. I have done numerous testing to see if it's just a bad random seed, which is not bounded, but the results stayed the same. The sets also have correctly balanced label distributions and seemingly well-shuffled entries.

As it turns out, the problem originates from a combination of sources:
While shuffle = True in the train_test_split() method properly shuffles the provided data first, then splits it into the desired parts, the shuffle = True in the Kfold method only results in the randomly built folds, however the data within the folds remains ordered.
This is something the documentation points out, thanks to this post:
https://github.com/scikit-learn/scikit-learn/issues/16068
Now, during learning, my custom train function applies shuffle again on the train data, just to be sure, however it does not shuffle the test data. Moreover, model.evaluate() defaults to batch_size = 32, if no parameter is given, which paired with the ordered test data resulted in the discrepancy in the validation accuracy. The test data is indeed flawed in the sense that it contains large portion of "hard-to-predict" entries, which were clustered together thanks to the ordering and seems like they dragged down the average accuracy in the results. Given a completed run across all N folds, as pointed out by TC Arlen, may have indeed given a more precise estimation in the end, but I've expected closer results after only one fold, which lead to the discovery of this problem.

Depending on the amount of noise in the data and on the size of the dataset, this could be expected behavior to see scores on out of sample data to deviate by this amount. One split is not guaranteed to be just like any other split, which is why you have 10 in the first place and then average across all results.
What you should trust to be the most generalizable is not any one given split (whether that comes from one of the 10 folds or train_test_split()), but what is far more trustworthy is the average result across all N folds.
Digging deeper into the data could reveal whether there is some reason why one or more splits deviate so much from another. For example, perhaps there is some feature in your data (e.g. "date the sample was collected" and the collection methodology changed from month to month) that makes the data differ from one another in a biased way. If that is the case, you should use a stratified test split (in your CV as well) (see the scikit-learn documentation on that) so you can get a more unbiased grouping of your data.

SelectFromModel from sklearn gives significantly different features on random forest and gradient boosting classifier

As mentioned in the title, Im using SelectFromModel from sklearn to select features for both my random forest and gradient boosting classification models.
#feature selection performed on training dataset to prevent overfitting
sel = SelectFromModel(GradientBoostingClassifier(n_estimators=10, learning_rate=0.25,max_depth=1, max_features = 15, random_state=0).fit(X_train_bin, y_train))
sel.fit(X_train_bin, y_train)
#returns a boolean array to indicate which features are of importance (above the mean threshold)
sel.get_support()
#shows the names of the selected features
selected_feat= X_train_bin.columns[(sel.get_support())]
selected_feat
The boolean array that is returned for random forest and gradient boosting model are COMPLETELY different. random forest feature selection tells me to drop an additional 4 columns (out of 25 features) and the feature selection on the gradient boosting model is telling me to drop nearly everything. What is happening here?
EDIT: I'm trying to compare the performance of these 2 models on my dataset. Should I move the threshold so I at least have approximately the same amount of features to train on?

There's no reason for them to select the same variables. GradientBoostingClassifier builds each tree to improve on the error of the previous step, while RandomForestClassifier trains independent trees that have nothing to do with each others' errors.
Another reason why they may select different features is criterion, which is entropy for Random Forests and Friedman MSE for Gradient Boosting. Finally, it could be because both algorithms select random subsets of features when making each split. Hence, they did not compare the same variables in the same order, which will naturally yield different importances.

RandomForest classification - closest point to change class

I'm currently working on a Kaggle dataset regarding Human Resources Analytics.
I've cleaned the dataset, benchmark some models. The best one is the RandomForestClassifier, which predict if a employee left the company or not with a good accuracy (around 99%).
Now, I would like to find the most probable employee still in the company who may leave. I used the predict_proba method on the train model but this gives me the probability that the employee left or not. It's not the probability for the employee to leave. Moreover, the dataset is the one used for the training.
I have no idea, how to predict this kind of information. In a linear regression for example, I'd have look for the closest point to the estimator but with an ensemble, I don't know.
I attached below a piece of code if you want to try it:
dataset = pd.read_csv("HR.csv")
# Cleanup/Preparation datas
convert_dict = {"high" : 3, "medium": 2, "low": 1}
dataset = dataset.replace({"salary": convert_dict})
dataset = pd.get_dummies(dataset)
X = dataset.drop("left", axis=1)
y = dataset["left"]
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# training best model (I pass the benchmark part)
model = RandomForestClassifier(bootstrap=False, n_estimators=50)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(acc)
# Eval
eval_dataset = dataset[dataset["left"] == 0]
X = eval_dataset.drop("left", axis=1)
y = eval_dataset["left"]
X = scaler.transform(X)
y_pred = model.predict_proba(X) # => This is wrong
Thanks for your support,

You say your model is ~99% accurate, but is that in test? If so great! Now image you have new data coming in that contains all of your data fields, you would be able to use the predict_proba method on each obersavtion/s to predict whether or not they left. In this sense you can use this as a simple proxy for will leave as this is the best you have right now.
I will give you a quick hypothesis to test though. Say all things remained the same for an employee, but time continues to pass. You could update the amount of time an employee has spent at a company and see how the probability of them leaving changes over time. Granted this wouldn't be a great method for predicting several years out (as hopefully people grow and the other parameters change), but it would give you a good idea of how long someone would put up with their current status quo, based on the knowledge learned from the training data.

There are several issues with your question...
I used the predict_proba method on the train model but this gives me the probability that the employee left or not. It's not the probability for the employee to leave.
This is wrong on many levels:
philosophically, since the employee has already either left or not, there is not any actual probability involved here, and that is why the respective data column left is actually binary (0/1) and not in the range [0,1]
computationally, you indeed get what the model would have guessed as a probability of leaving, after training
On close inspection, the data also seem to suffer from class imbalance (in simple words, your 1's are much more than your 0's), which calls for more caution and specialised techniques (vanilla accuracy may be misinformative here).
It is not clear what your code does after #Eval, why you seem to keep only records with left==0, or what exactly best_1 is (your "best" model, perhaps?). But applying predict_proba on your test set X_test will indeed give you the model's probability guess regarding leaving for these (unseen during training) employees.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.