I have an XGBoost model and the Gini feature importance that I get if I pass in standardized data vs non-standardized data is completely different. I thought xgboost is immune to standardization. Can you please help me understand?
from xgboost import XGBClassifier
import pandas as pd
import matplotlib.pyplot as plt
def train_model(x_train, y_train, x_test, y_test):
xg = XGBClassifier(n_estimators=350, subsample=1.0, scale_pos_weight=25, min_child_weight=8, max_depth=21,
learning_rate=0.1, gamma=0.2, colsample_bytree=0.5)
xg.fit(x_train, y_train)
return xg
def find_feat_imp(model, cols):
feat_importance = pd.Series(model.feature_importances_, index=cols)
sort_feat_imp = feat_importance.sort_values(ascending=False)
sort_feat_imp
sort_feat_imp[:20].plot(kind='bar')
plt.show()
x_train, y_train, x_test, y_test = read_data()
xg = train_model(x_train, y_train, x_test, y_test)
find_feat_imp(xg,x_train.columns)
The output of standardized data is (https://i.stack.imgur.com/i6sW0.png) and the output of raw data is (https://i.stack.imgur.com/If5u8.png)
The important features are completely different. Can you please help me understand?
I was expecting the feature importance to be the same since xgboost is not affected by standardizing the data.
You haven't set a random seed, so the models may be different just due to random chance. In particular, because you set colsample_bytree<1, you get a random effect in which columns are available per tree. Since the first trees are usually the most impactful, if a feature happens to be left out of them, its importance score will suffer. (Note however that xgboost supports several feature importance types; this effect might be more or less noticeable depending on which type you are using.)
XGBoost, like many other tree-based models, is not affected by feature scaling because the algorithm splits the data based on the values of the features, rather than using the scale of the features. However, the feature importance scores generated by XGBoost can be affected by feature scaling because they are based on the feature's contribution to the reduction of the impurity criterion (e.g. Gini impurity) at each split. Standardizing the data can change the relative scale of the features, which in turn can change their relative importance as determined by the impurity reduction criterion. Therefore, it is recommended to use the same data preprocessing method when comparing feature importance values.
Related
I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.
Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!
For some reasons, I have base dataframes of the following structure
print(df1.shape)
display(df1.head())
print(df2.shape)
display(df2.head())
Where the top dataframe is my features set and my bottom is the output set. To turn this into a problem that is amenable to data modeling I first do:
x_train, x_test, y_train, y_test = train_test_split(df1, df2, train_size = 0.8)
I then have a split for 80% training and 20% testing.
Since the output set (df2; y_test/y_train) is individual measurements with no inherent meaning on their own, I calculate pairwise distances between the labels to generate a single output value denoting the pairwise distances between observations using (the distances are computed after z-scoring; the z-scoring code isn't described here but it is done):
y_train = pdist(y_train, 'euclidean')
y_test = pdist(y_test, 'euclidean')
Similarly I then apply this strategy to the features set to generate pairwise distances between individual observations of each of the instances of each feature.
def feature_distances(input_vector):
modified_vector = np.array(input_vector).reshape(-1,1)
vector_distances = pdist(modified_vector, 'euclidean')
vector_distances = pd.Series(vector_distances)
return vector_distances
x_train = x_train.apply(feature_distances, axis = 0)
x_test = x_test.apply(feature_distances, axis = 0)
I then proceed to train & test all of my models.
For now I am trying linear regression , random forest, xgboost.
Is there any easy way to implement a cross validation scheme in my dataset?
Since my problem requires calculating pairwise distances between observations, I am struggling to identify an easy way to do cross validation schemes to optimize parameter tuning.
GridsearchCV doesn't quite work here since in each instance of the test/train split, distances have to be recomputed to avoid contamination of test with train.
Hope it's clear!
First, what I understood from the shape of your data frames that you have 42 samples and 1643 features in the input, and each output vector consists of 392 values.
Huge Input: In case, you are sure that your problem has 1643 features, you might need to use PCA to reduce the dimensionality instead of pairwise distance. You should collect more samples instead of 42 samples to avoid overfitting because it is not enough data to train and test your model.
Huge Output: you could use sampled_softmax_loss to speed up the training process as mentioned in TensorFlow documentation . You could also read this here. In case, you do not want to follow this approach, you can continue training with this output but it takes some time.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=n)
here X is independent feature, y is dependent feature means what you actually want to predict - it could be label or continuous value. We used train_test_split on train dataset and we are using (x_train, y_train) to train model and (x_test, y_test) to test model to ensure performance of model on unknown data(x_test, y_test). In your case you have given y as df2 which is wrong just figure out your target feature and give it as y and there is no need to split test data.
I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.
There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
mod.evals_result()
{'validation_0': OrderedDict([('auc',
[0.965939,
0.9833,
0.984788,
[...]
0.991402,
0.991071,
0.991402,
0.991733])])}
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website
I'm running a feature selection using sns.heatmap and one using sklearn feature_importances.
When using the same data I get two difference values.
Here is the heatmap
and heatmap code
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
training_data = pd.read_csv(
"/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv")
df_model = training_data.copy()
df_model = df_model.dropna()
df_model = df_model.drop(['Money_Line', 'Money_Line_Percentage', 'Money_Line_Money', 'Money_Line_Move', 'Money_Line_Direction', "Spread", 'Spread_Percentage', 'Spread_Money', 'Spread_Move', 'Spread_Direction',
"Win", "Money_Line_Percentage", 'Cover'], axis=1)
X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
'Under_Percentage', 'Under_Money']] # independent columns
y = df_model['Over_Under'] # target column
# get correlations of each features in dataset
corrmat = df_model.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20, 20))
# plot heat map
g = sns.heatmap(
df_model[top_corr_features].corr(), annot=True, cmap='hot')
plt.xticks(rotation=90)
plt.yticks(rotation=45)
plt.show()
Here is the feature_importances bar graph
and the code
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.inspection import permutation_importance
training_data = pd.read_csv(
"/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv", index_col=False)
df_model = training_data.copy()
df_model = df_model.dropna()
X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
'Under_Percentage', 'Under_Money']] # independent columns
y = df_model['Over_Under'] # target column
model = RandomForestClassifier(
random_state=1, n_estimators=100, min_samples_split=100, max_depth=5, min_samples_leaf=2)
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model.fit(X_train, y_train)
# use inbuilt class feature_importances of tree based classifiers
print(model.feature_importances_)
# plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
perm_importance = permutation_importance(model, X_test, y_test)
feat_importances.nlargest(5).plot(kind='barh')
print(perm_importance)
plt.show()
I'm not sure which one is more accurate or if I'm using them in the correct way? Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?
You are comparing two different things, why would you expect them to be the same? And what would it even mean in this case?
Feature importances in tree based models are computed based on how many times given feature was used for splitting. Feature that is used more often for a split is more important (for a particular model fitted with particular dataset) than a feature that is used less often.
Correlation on the other hand is a measure of linear relationship between 2 features.
I'm not sure which one is more accurate
What do you mean by accuracy? Both of these are accurate in what they are measuring. It is just that none of these directly tells you which feature/s to throw away.
Note that just because 2 features are correlated, it doesn't mean that you can automatically throw one of them away. Collinearity can cause issues with interpretability of the model. If you have highly correlated features, then you can't say which one is more important based on the weights associated with these features. Collinearity should not affect the prediction power of the model. More often, you will find that by throwing away one of the correlated features, your model's prediction power decreases.
Collinearity in a dataset can therefore make feature importances of your random forrest model less interpretable in a sense that you can't rely on their strict ordering. But again, it should not affect the predictive power of the model (except that the model is more prone to overfitting due to having more degrees of freedom).
Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?
Feature engineering/selection is more of an art than science (outside of end-to-end deep learning). There is no correct answer here and you will need to develop your own heuristics and try different things to see which one works better in which scenario.
Example of a simple heuristic based on feature importances and correlation can be (assuming that you have large number of features):
fit the random forrest model and measure the feature importances
throw away those that seem to have no impact on the model (close to 0 importance)
refit the model with the new subset of your original data and see whether the metric of your interest (accuracy, MSE, ...) stays approximately the same as in the step 1.
if you still have a lot of features, you can repeat the step 1-3, increasing the throw-away threshold until your metric of interest starts worsening
measure the correlation of the features that you are left with and select the most correlated pairs (based on some threshold, e.g. (|c| > 0.8))
pick one pair; drop a feature from this pair; measure model performance; return the dropped feature; repeat for each each pair
drop the feature that seems to have the least negative effect on the model's performance based on the results from step 6.
repeat steps 6-7 until the model's performance starts dropping
I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
scores = []
for _, test_index in kFolds:
test_data = test_x[test_index]
test_labels = test_y[test_index]
scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)
TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.
If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.