How to get variance explained of features for lightgbm model?

How to get variance explained of features for lightgbm model? - python

I used lightgbm for feature importance. However, the output is a plot scores by some metric. My questions are:
What are is the metric in the x-axis? Is that an F-score or something else?
How can I get an output of the features where it shows me how much each feature makes up for the variance the model (similar to PCA)?
How do I extract the Metric for all the feature of importance in a dataframe format?
This is my code:
import lightgbm as lgb
import matplotlib.pyplot as plt
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'num_leaves': 30,
'num_round': 360,
'max_depth':8,
'learning_rate': 0.01,
'feature_fraction': 0.5,
'bagging_fraction': 0.8,
'bagging_freq': 12
}
lgb_train = lgb.Dataset(X, y)
model = lgb.train(lgb_params, lgb_train)
plt.figure(figsize=(12,6))
lgb.plot_importance(model, max_num_features=30)
plt.title("Feature importances")
plt.show()

1) the metric on x axis, in your case, is the feature importance obtained with "split" type (by default). as you can see in lgm doc: the importance can be calculated using "split" or "gain" method. If "split", result contains numbers of times the feature is used in a model. If "gain", result contains total gains of splits that use the feature.
The first measure is split-based, it doesn’t take the number of samples into account.
The second measure is gain-based. It’s basically the same as the method in scikit-learn with Gini impurity replaced by the objective used by the gradient boosting model
These measures are purely calculated using training data, so there’s a chance that a split creates no improvement on the objective in test-set
2) the most similar measure to explained_variance_ratio_ of sklearn pca (not in the meaning but in the way it can be used) is exactly feature_importances in tree-based method. if you are more confident you can scale this number in % as done by sklearn random forest, where the feature importances sum up to 1. you can do model.feature_importances_/model.feature_importances_.sum(). Otherwise, there are other similar methods like permutation importance
3) to store in df all the importances you can do: pd.DataFrame({'name':model.feature_name_,'importance':model.feature_importances_})

Related

How to plot feature_importance for DecisionTreeClassifier?

I need to plot feature_importances for DecisionTreeClassifier. Features are already found and target results are achieved, but my teacher tells me to plot feature_importances to see weights of contributing factors.
I have no idea how to do it.
model = DecisionTreeClassifier(random_state=12345, max_depth=8,class_weight='balanced')
model.fit(features_train,target_train)
model.feature_importances_
It gives me.
array([0.02927077, 0.3551379 , 0.01647181, ..., 0.03705096, 0. ,
0.01626676])
Why it is not attached to anything like max_depth and just an array of some numbers?

Feature importances represent the affect of the factor to the outcome variable. The greater it is, the more it affects the outcome. That's why you received the array.
For plotting, you can do:
import matplotlib.pyplot as plt
feat_importances = pd.DataFrame(model.feature_importances_, index=features_train.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(8,6))

Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.
Feature importance scores can be calculated for problems that involve predicting a numerical value, called regression, and those problems that involve predicting a class label, called classification.
Load the feature importances into a pandas series indexed by your dataframe column names, then use its plot method.
From Scikit Learn
Feature importances are provided by the fitted attribute feature_importances_ and they are computed as the mean and standard deviation of accumulation of the impurity decrease within each tree.
How are feature_importances in RandomForestClassifier determined?
For your example:
feat_importances = pd.Series(model.feature_importances_, index=df.columns)
feat_importances.nlargest(5).plot(kind='barh')
More ways to plot Feature Importances- Random Forest Feature Importance Chart using Python

Why do I get two different values in heatmap and feature_importances?

I'm running a feature selection using sns.heatmap and one using sklearn feature_importances.
When using the same data I get two difference values.
Here is the heatmap
and heatmap code
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
training_data = pd.read_csv(
"/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv")
df_model = training_data.copy()
df_model = df_model.dropna()
df_model = df_model.drop(['Money_Line', 'Money_Line_Percentage', 'Money_Line_Money', 'Money_Line_Move', 'Money_Line_Direction', "Spread", 'Spread_Percentage', 'Spread_Money', 'Spread_Move', 'Spread_Direction',
"Win", "Money_Line_Percentage", 'Cover'], axis=1)
X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
'Under_Percentage', 'Under_Money']] # independent columns
y = df_model['Over_Under'] # target column
# get correlations of each features in dataset
corrmat = df_model.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20, 20))
# plot heat map
g = sns.heatmap(
df_model[top_corr_features].corr(), annot=True, cmap='hot')
plt.xticks(rotation=90)
plt.yticks(rotation=45)
plt.show()
Here is the feature_importances bar graph
and the code
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.inspection import permutation_importance
training_data = pd.read_csv(
"/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv", index_col=False)
df_model = training_data.copy()
df_model = df_model.dropna()
X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
'Under_Percentage', 'Under_Money']] # independent columns
y = df_model['Over_Under'] # target column
model = RandomForestClassifier(
random_state=1, n_estimators=100, min_samples_split=100, max_depth=5, min_samples_leaf=2)
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model.fit(X_train, y_train)
# use inbuilt class feature_importances of tree based classifiers
print(model.feature_importances_)
# plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
perm_importance = permutation_importance(model, X_test, y_test)
feat_importances.nlargest(5).plot(kind='barh')
print(perm_importance)
plt.show()
I'm not sure which one is more accurate or if I'm using them in the correct way? Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?

You are comparing two different things, why would you expect them to be the same? And what would it even mean in this case?
Feature importances in tree based models are computed based on how many times given feature was used for splitting. Feature that is used more often for a split is more important (for a particular model fitted with particular dataset) than a feature that is used less often.
Correlation on the other hand is a measure of linear relationship between 2 features.
I'm not sure which one is more accurate
What do you mean by accuracy? Both of these are accurate in what they are measuring. It is just that none of these directly tells you which feature/s to throw away.
Note that just because 2 features are correlated, it doesn't mean that you can automatically throw one of them away. Collinearity can cause issues with interpretability of the model. If you have highly correlated features, then you can't say which one is more important based on the weights associated with these features. Collinearity should not affect the prediction power of the model. More often, you will find that by throwing away one of the correlated features, your model's prediction power decreases.
Collinearity in a dataset can therefore make feature importances of your random forrest model less interpretable in a sense that you can't rely on their strict ordering. But again, it should not affect the predictive power of the model (except that the model is more prone to overfitting due to having more degrees of freedom).
Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?
Feature engineering/selection is more of an art than science (outside of end-to-end deep learning). There is no correct answer here and you will need to develop your own heuristics and try different things to see which one works better in which scenario.
Example of a simple heuristic based on feature importances and correlation can be (assuming that you have large number of features):
fit the random forrest model and measure the feature importances
throw away those that seem to have no impact on the model (close to 0 importance)
refit the model with the new subset of your original data and see whether the metric of your interest (accuracy, MSE, ...) stays approximately the same as in the step 1.
if you still have a lot of features, you can repeat the step 1-3, increasing the throw-away threshold until your metric of interest starts worsening
measure the correlation of the features that you are left with and select the most correlated pairs (based on some threshold, e.g. (|c| > 0.8))
pick one pair; drop a feature from this pair; measure model performance; return the dropped feature; repeat for each each pair
drop the feature that seems to have the least negative effect on the model's performance based on the results from step 6.
repeat steps 6-7 until the model's performance starts dropping

Why roc_auc produces weird results in sklearn?

I have a binary classification problem where I use the following code to get my weighted avarege precision, weighted avarege recall, weighted avarege f-measure and roc_auc.
df = pd.read_csv(input_path+input_file)
X = df[features]
y = df[["gold_standard"]]
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))
print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))
I got the following results for the same dataset with 2 different feature settings.
Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6920, 0.6888, 0.6920, 0.6752, 0.7120
Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806 0.6754 0.6806 0.6643 0.7233
So, we can see that in feature setting 1 we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2.
However, when it comes to 'roc_auc' feature setting 2 is better than feature setting 1. I found this weird becuase every other metric was better with feature setting 1.
On one hand, I suspect that this happens since I am using weighted scores for precision, recall and f-measure and not with roc_auc. Is it possible to do weighted roc_auc for binary classification in sklearn?
What is the real problem for this weird roc_auc results?

It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.
Here is a high-level description of the whole process:
Probabilistic classifiers (like RF here) produce probability outputs p in [0, 1].
To get hard class predictions (0/1), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. if p>0.5 then class=1, else class=0.
Metrics like accuracy, precision, recall, and f1-score are calculated over the hard class predictions 0/1, i.e after the threshold has been applied.
In contrast, AUC measures the performance of a binary classifier averaged over the range of all possible thresholds, and not for a particular threshold.
So, it can certainly happen, and it can indeed lead to confusion among new practitioners.
The second part of my answer in this similar question might be helpful for more details. Quoting:
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

Working with, preparing bag-of-word data for Regression

Im trying to create a regression model that predicts an authors age. Im using (Nguyen et al,2011) as my basis.
Using a Bag of Words Model I count the occurences of words per Document (which are Posts from Boards) and create the vector for every Post.
I limit the size of each vector by using as features the top-k (k=number) most frequent used words(stopwords will not be used)
Vectorexample_with_k_8 = [0,0,0,1,0,3,0,0]
My data is generally sparse like in the Example.
When I test the model on my test data I get a very low r² score(0.00-0.1), sometimes even a negative score. The model predicts always the same age, which happens to be the average age of my dataset, like seen in the
distribution of my data (age/amount):
I used diffrerent Regression Models: Linear Regression, Lasso,
SGDRegressor from scikit-learn with no improvement.
So the questions are:
1.How do I improve the r² score?
2.Do I have to change the data to fit the Regression better? If yes with what method?
3.Which Regressor/Methods should I use for text classification?

To my knowledge Bag-of-words models usually use Naive Bayes as classifier to fit the document-by-term sparse matrix.
None of your regressors can handle large sparse matrix well. Lasso may work well if you have groups of highly correlated features.
I think for your problem, Latent Semantic Analysis may provide better results. Essentially, use the TfidfVectorizer to normalize the word count matrix, then use TruncatedSVD to reduce the dimensionality to retain the first N components which capture the major variance. Most regressors should work well with the matrix in lower dimension. In my experimence SVM works pretty good for this problem.
Here I show an example script:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('svd', TruncatedSVD()),
('clf', svm.SVR())
])
# You can tune hyperparameters using grid search
params = {
'tfidf__max_df': (0.5, 0.75, 1.0),
'tfidf__ngram_range': ((1,1), (1,2)),
'svd__n_components': (50, 100, 150, 200),
'clf__C': (0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, params, scoring='r2',
n_jobs=-1, verbose=10)
# fit your documents (Should be a list/array of strings)
grid_search.fit(documents, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))

AUC-base Features Importance using Random Forest

I'm trying to predict a binary variable with both random forests and logistic regression. I've got heavily unbalanced classes (approx 1.5% of Y=1).
The default feature importance techniques in random forests are based on classification accuracy (error rate) - which has been shown to be a bad measure for unbalanced classes (see here and here).
The two standard VIMs for feature selection with RF are the Gini VIM and the permutation VIM. Roughly speaking the Gini VIM of a predictor of interest is the sum over the forest of the decreases of Gini impurity generated by this predictor whenever it was selected for splitting, scaled by the number of trees.
My question is : is that kind of method implemented in scikit-learn (like it is in the R package party) ? Or maybe a workaround ?
PS : This question is kind of linked with an other.

scoring is just a performance evaluation tool used in test sample, and it does not enter into the internal DecisionTreeClassifier algo at each split node. You can only specify the criterion (kind of internal loss function at each split node) to be either gini or information entropy for the tree algo.
scoring can be used in a cross-validation context where the goal is to tune some hyperparameters (like max_depth). In your case, you can use a GridSearchCV to tune some of your hyperparameters using the scoring function roc_auc.

After doing some researchs, this is what I came out with :
from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict
names = db_train.iloc[:,1:].columns.tolist()
# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
class_weight="auto",
criterion='gini',
bootstrap=True,
max_features=10,
min_samples_split=1,
min_samples_leaf=6,
max_depth=3,
n_jobs=-1)
scores = defaultdict(list)
# -- Fit the model (could be cross-validated)
rf = model_rf.fit(X_train, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))
for i in range(X_train.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True))
Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]
The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.
I took my inspiration from this (great) notebook.
All suggestions/comments are most welcome !

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get variance explained of features for lightgbm model? - python

Related

How to plot feature_importance for DecisionTreeClassifier?

Why do I get two different values in heatmap and feature_importances?

Why roc_auc produces weird results in sklearn?

Working with, preparing bag-of-word data for Regression

AUC-base Features Importance using Random Forest

Categories

Resources