scikit learn - feature importance calculation in decision trees - python

I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing.
For example:
from StringIO import StringIO
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif
X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]
y = [1,0,1,1]
clf = DecisionTreeClassifier()
clf.fit(X, y)
feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))
out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')
results in feature importance:
feat importance = [0.25 0.08333333 0.04166667]
and gives the following decision tree:
Now, this answer to a similar question suggests the importance is calculated as
Where G is the node impurity, in this case the gini impurity. This is the impurity reduction as far as I understood it. However, for feature 1 this should be:
This answer suggests the importance is weighted by the probability of reaching the node (which is approximated by the proportion of samples reaching that node). Again, for feature 1 this should be:
Both formulas provide the wrong result. How is the feature importance calculated correctly?

I think feature importance depends on the implementation so we need to look at the documentation of scikit-learn.
The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance
That reduction or weighted information gain is defined as :
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Since each feature is used once in your case, feature information must be equal to equation above.
For X[2] :
feature_importance = (4 / 4) * (0.375 - (0.75 * 0.444)) = 0.042
For X[1] :
feature_importance = (3 / 4) * (0.444 - (2/3 * 0.5)) = 0.083
For X[0] :
feature_importance = (2 / 4) * (0.5) = 0.25

A single feature can be used in the different branches of the tree, feature importance then is it's total contribution in reducing the impurity.
feature_importance += number_of_samples_at_parent_where_feature_is_used\*impurity_at_parent-left_child_samples\*impurity_left-right_child_samples\*impurity_right
impurity is the gini/entropy value
normalized_importance = feature_importance/number_of_samples_root_node(total num of samples)
In the above eg:
feature_2_importance = 0.375*4-0.444*3-0*1 = 0.16799 ,
normalized = 0.16799/4(total_num_of_samples) = 0.04199
If feature_2 was used in other branches calculate the it's importance at each such parent node & sum up the values.
There is a difference in the feature importance calculated & the ones returned by the library as we are using the truncated values seen in the graph.
Instead, we can access all the required data using the 'tree_' attribute of the classifier which can be used to probe the features used, threshold value, impurity, no of samples at each node etc..
eg: clf.tree_.feature gives the list of features used. A negative value indicates it's a leaf node.
Similarly clf.tree_.children_left/right gives the index to the clf.tree_.feature for left & right children
Using the above traverse the tree & use the same indices in clf.tree_.impurity & clf.tree_.weighted_n_node_samples to get the gini/entropy value and number of samples at the each node & at it's children.
def dt_feature_importance(model,normalize=True):
left_c = model.tree_.children_left
right_c = model.tree_.children_right
impurity = model.tree_.impurity
node_samples = model.tree_.weighted_n_node_samples
# Initialize the feature importance, those not used remain zero
feature_importance = np.zeros((model.tree_.n_features,))
for idx,node in enumerate(model.tree_.feature):
if node >= 0:
# Accumulate the feature importance over all the nodes where it's used
feature_importance[node]+=impurity[idx]*node_samples[idx]- \
impurity[left_c[idx]]*node_samples[left_c[idx]]-\
impurity[right_c[idx]]*node_samples[right_c[idx]]
# Number of samples at the root node
feature_importance/=node_samples[0]
if normalize:
normalizer = feature_importance.sum()
if normalizer > 0:
feature_importance/=normalizer
return feature_importance
This function will return the exact same values as returned by clf.tree_.compute_feature_importances(normalize=...)
To sort the features based on their importance
features = clf.tree_.feature[clf.tree_.feature>=0] # Feature number should not be negative, indicates a leaf node
sorted(zip(features,dt_feature_importance(clf,False)[features]),key=lambda x:x[1],reverse=True)

Related

Interpreting logistic regression coefficients of scaled features

I'm using a logistic regression to estimate the probability of scoring a goal in soccer/footbal. I've got 5 features. My target values are 1 (goal) or 0 (no goal).
As is always a must, I've scaled my features before fitting my model. I've used the MinMaxScaler, who scales all features in the range [0-1] as follows:
X_scaled = (x - x_min)/(x_max - x_min)
The coefficients of my logistic regression model are the following:
coef = [[-2.26286643 4.05722387 0.74869811 0.20538172 -0.49969841]]
My first thoughts are that the second features is the most important, followed by the first. Is this always true?
I read that "In other words, for a one-unit increase in the 'the second feature', the expected change in log odds is 4.05722387." on this site, but there, their features were normalized with a mean of 50 and some std deviation.
If I do not scale my features, the coefficients of the model are the following:
coef = [[-0.04743728 0.04394143 -0.00247654 0.23769469 -0.55051824]]
And now it seems that the first feature is more important than the second one. I read in literature about my topic that this is indeed true. So this confuses me off course.
My questions are:
Which of my features is the most important and what/why is the best methodology to find it?
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1*coef1 + feature2*coef2 + ... (with all features scaled).
Which of my features is the most important and what/why is the best methodology to find it?
Look at several versions of marginal effects calculations. For example, see overview/discussion in a blog Stata's example resources for R
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
The interpretation depends on which marginal effects you calculate. You just need to account for scaling when you talk about one unit of X increasing/decreasing the change in probability or odds ratio etc.
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1coef1 + feature2coef2 + ... (with all features scaled).
Yes, it's just that features x are in scaled measures.

What Vectorizer should I use when I'm doing clustering of text data?

Im doing clustering of text data with Kmeans in Python's Scikit-Learn.
I have problem with Vectorizing the data because I get very different results when Im using different vectorizers.
I want to do clustering of text data (data are instagram comments about USA politics) and I want to find the key-words for every cluster. But I do not know what vectorizer should I use
For example when I'm using :
cv = CountVectorizer(analyzer = 'word', max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
#should I scale x value?
#x = scale(x, with_mean=False)
#If I do this I get the graph just one dot and silhouette_score less than 0.01
I get that my optimal number of clusters is 2, based on silhouette_score that gives me score of 0.87. And my graph looks like this:
And when Im using:
cv = TfidfVectorizer(analyzer = 'word',max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
I get that my optimal number of clusters is 13, based on silhouette_score that gives me score of 0.0159. And my graph looks like this:
This is how I'm doing the clustering:
my_list = []
list_of_clusters = []
for i in range(2,15):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels)
print(round(silhouette_avg,2))
list_of_clusters.append(round(silhouette_avg, 1))
plt.plot(range(2,15),my_list)
plt.show()
number_of_clusters = max(list_of_clusters)
number_of_clusters = list_of_clusters.index(number_of_clusters)+2
print('Number of clusters: ', number_of_clusters)
kmeans = KMeans(n_clusters = number_of_clusters, init = 'k-means++', random_state = 42)
kmeans.fit(x)
And this is how I plot the data:
# reduce the features to 2D
pca = PCA(n_components=2, random_state=0)
reduced_features = pca.fit_transform(x.toarray())
# reduce the cluster centers to 2D
reduced_cluster_centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=kmeans.predict(x), s=3)
plt.scatter(reduced_cluster_centers[:, 0], reduced_cluster_centers[:,1], marker='x', s=50, c='r')
plt.show()
I think that this is the very big difference, so Im sure that Im doing something wrong, but I do not know what.
Thanks for your help :)
The Silhouette score, from sklearn:
The Silhouette Coefficient is calculated using the mean intra-cluster
distance (a) and the mean nearest-cluster distance (b) for each
sample. The Silhouette Coefficient for a sample is (b - a) / max(a,
b). To clarify, b is the distance between a sample and the nearest
cluster that the sample is not a part of. Note that Silhouette
Coefficient is only defined if number of labels is 2 <= n_labels <=
n_samples - 1.
This function returns the mean Silhouette Coefficient over all
samples. To obtain the values for each sample, use silhouette_samples.
The best value is 1 and the worst value is -1. Values near 0 indicate
overlapping clusters. Negative values generally indicate that a sample
has been assigned to the wrong cluster, as a different cluster is more
similar.
Your best value is near 0, which means you don't have a good clustering
A good clustering is a clustering in which:
the intra-class similarity is high (the documents in your cluster are similar).
the inter-class similarity is low (the documents from two clusters are not similar).
And of course, the clusters should tell you something interesting. For example one cluster could contains all the words linked to a specific domain. But this is a work that you do after the clusters are done.
Changing the vectorizing means that you are changing the features on which you are doing clustering.
From the python sklearn doc, CountVectorizer:
CountVectorizer implements both tokenization and occurrence counting
in a single class
Basically you have the token counts as features.
Instead for TfidfVectorizer:
Convert a collection of raw documents to a matrix of TF-IDF features.
Which means that you are using the Term Frequency - Inverse Document Frequency formula as features for your documents, which is calculated as:
TF(t) = (Number of times term t appears in a document) / (Total
number of terms in the document). IDF(t) = log_e(Total number of
documents / Number of documents with term t in it)
TF-IDF = TF(t) * IDF(t)

Python Clustering 'purity' metric

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.
Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE
These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_
I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))
The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

scikit feature importance selection experiences

Scikit-learn has a mechanism to rank features (classification) using extreme randomized trees.
forest = ExtraTreesClassifier(n_estimators=250,
compute_importances=True,
random_state=0)
I have a question if this method is doing a "univariate" or "multivariate" feature ranking. Univariate case is where individual features are compared to each other. I would appreciate some clarifications here. Any other parameters that I should try to fiddle? Any experiences and pitfalls with this ranking methhod are also appreciated.
THe output of this ranking identify feature numbers(5,20,7. I would like to check if the feature number really corresponds to the row in the feature matrix. THat is, the feature number 5 corresponds to the sixth row in the feature matrix (starts with 0).
I'm not an expert but this is not univariate. In fact the total feature importance is computed from the feature importance of each tree (taking the mean value i think).
For each tree, the importances are computed from the impurity of the split.
I used this method and it seems to give good results, better from my point of view than the univariate method. But I don't know any technique to test the results except the knowledge of the dataset.
To order, the feature correctly you should follow this example and modify it a bit like so to use pandas.DataFrame and their proper column names:
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
X = pandas.DataFrame(...)
Y = pandas.Series(...)
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
random_state=0)
forest.fit(X, y)
feature_importance = forest.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)[::-1]
print "Feature importance:"
i=1
for f,w in zip(X.columns[sorted_idx], feature_importance[sorted_idx]):
print "%d) %s : %d" % (i, f, w)
i+=1
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
nb_to_display = 30
plt.barh(pos[:nb_to_display], feature_importance[sorted_idx][:nb_to_display], align='center')
plt.yticks(pos[:nb_to_display], X.columns[sorted_idx][:nb_to_display])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

Categories