scikit feature importance selection experiences - python

Scikit-learn has a mechanism to rank features (classification) using extreme randomized trees.
forest = ExtraTreesClassifier(n_estimators=250,
compute_importances=True,
random_state=0)
I have a question if this method is doing a "univariate" or "multivariate" feature ranking. Univariate case is where individual features are compared to each other. I would appreciate some clarifications here. Any other parameters that I should try to fiddle? Any experiences and pitfalls with this ranking methhod are also appreciated.
THe output of this ranking identify feature numbers(5,20,7. I would like to check if the feature number really corresponds to the row in the feature matrix. THat is, the feature number 5 corresponds to the sixth row in the feature matrix (starts with 0).

I'm not an expert but this is not univariate. In fact the total feature importance is computed from the feature importance of each tree (taking the mean value i think).
For each tree, the importances are computed from the impurity of the split.
I used this method and it seems to give good results, better from my point of view than the univariate method. But I don't know any technique to test the results except the knowledge of the dataset.
To order, the feature correctly you should follow this example and modify it a bit like so to use pandas.DataFrame and their proper column names:
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
X = pandas.DataFrame(...)
Y = pandas.Series(...)
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
random_state=0)
forest.fit(X, y)
feature_importance = forest.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)[::-1]
print "Feature importance:"
i=1
for f,w in zip(X.columns[sorted_idx], feature_importance[sorted_idx]):
print "%d) %s : %d" % (i, f, w)
i+=1
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
nb_to_display = 30
plt.barh(pos[:nb_to_display], feature_importance[sorted_idx][:nb_to_display], align='center')
plt.yticks(pos[:nb_to_display], X.columns[sorted_idx][:nb_to_display])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

Related

How to calculate which independant variable influences dependant variable the most?

I have a dataframe with 5 independent variables and 1 dependent variable. All my variables are continous including the dependent variable. Is there a way i can calculate which of my independent variables influences my dependent variable the most in python? Is there an algorithm i could ran to do this for me?
i tried the information gain method but that was a classification method so had to use a labelencoder to transform my dependent variable. I used the following code after splitting my dataset into a train and test set
#encoding the dependant variable
lab_enc = preprocessing.LabelEncoder()
training_scores_encoded = lab_enc.fit_transform(y_train)
#SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.
#Firstly, I specify the random forest instance, indicating the number of trees.
#Then I use selectFromModel object from sklearn to automatically select the features.
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
sel.fit(X_train, training_scores_encoded)
#We can now make a list and count the selected features.
selected_feat= X_train.columns[(sel.get_support())]
len(selected_feat)
#viewing the importances
import matplotlib.pyplot as plt
importances = sel.estimator_.feature_importances_
indices = np.argsort(importances)[::-1]
# X is the train data used to fit the model
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
color="r", align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
Although i got a result, I'm not sure about this because i had to encode my (continous) dependent variable. Is this correct way to go? if not what can i do?
Thank you in advance for the assitance
You can use the SelectKBest class from scikit-learn module.
Check the original documentation here.
This technique is called Feature Selection.
You can also pick features with the highest correlation to the response.
print([(feature, abs(df[response].corr(df[feature]))) for feature in features])
This uses values from Tamarie's comment.
for feature in feature_cols:
print(f'feature: {feature} correlation: {abs(target_v.corr(df[feature]))}')

What Vectorizer should I use when I'm doing clustering of text data?

Im doing clustering of text data with Kmeans in Python's Scikit-Learn.
I have problem with Vectorizing the data because I get very different results when Im using different vectorizers.
I want to do clustering of text data (data are instagram comments about USA politics) and I want to find the key-words for every cluster. But I do not know what vectorizer should I use
For example when I'm using :
cv = CountVectorizer(analyzer = 'word', max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
#should I scale x value?
#x = scale(x, with_mean=False)
#If I do this I get the graph just one dot and silhouette_score less than 0.01
I get that my optimal number of clusters is 2, based on silhouette_score that gives me score of 0.87. And my graph looks like this:
And when Im using:
cv = TfidfVectorizer(analyzer = 'word',max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
I get that my optimal number of clusters is 13, based on silhouette_score that gives me score of 0.0159. And my graph looks like this:
This is how I'm doing the clustering:
my_list = []
list_of_clusters = []
for i in range(2,15):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels)
print(round(silhouette_avg,2))
list_of_clusters.append(round(silhouette_avg, 1))
plt.plot(range(2,15),my_list)
plt.show()
number_of_clusters = max(list_of_clusters)
number_of_clusters = list_of_clusters.index(number_of_clusters)+2
print('Number of clusters: ', number_of_clusters)
kmeans = KMeans(n_clusters = number_of_clusters, init = 'k-means++', random_state = 42)
kmeans.fit(x)
And this is how I plot the data:
# reduce the features to 2D
pca = PCA(n_components=2, random_state=0)
reduced_features = pca.fit_transform(x.toarray())
# reduce the cluster centers to 2D
reduced_cluster_centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=kmeans.predict(x), s=3)
plt.scatter(reduced_cluster_centers[:, 0], reduced_cluster_centers[:,1], marker='x', s=50, c='r')
plt.show()
I think that this is the very big difference, so Im sure that Im doing something wrong, but I do not know what.
Thanks for your help :)
The Silhouette score, from sklearn:
The Silhouette Coefficient is calculated using the mean intra-cluster
distance (a) and the mean nearest-cluster distance (b) for each
sample. The Silhouette Coefficient for a sample is (b - a) / max(a,
b). To clarify, b is the distance between a sample and the nearest
cluster that the sample is not a part of. Note that Silhouette
Coefficient is only defined if number of labels is 2 <= n_labels <=
n_samples - 1.
This function returns the mean Silhouette Coefficient over all
samples. To obtain the values for each sample, use silhouette_samples.
The best value is 1 and the worst value is -1. Values near 0 indicate
overlapping clusters. Negative values generally indicate that a sample
has been assigned to the wrong cluster, as a different cluster is more
similar.
Your best value is near 0, which means you don't have a good clustering
A good clustering is a clustering in which:
the intra-class similarity is high (the documents in your cluster are similar).
the inter-class similarity is low (the documents from two clusters are not similar).
And of course, the clusters should tell you something interesting. For example one cluster could contains all the words linked to a specific domain. But this is a work that you do after the clusters are done.
Changing the vectorizing means that you are changing the features on which you are doing clustering.
From the python sklearn doc, CountVectorizer:
CountVectorizer implements both tokenization and occurrence counting
in a single class
Basically you have the token counts as features.
Instead for TfidfVectorizer:
Convert a collection of raw documents to a matrix of TF-IDF features.
Which means that you are using the Term Frequency - Inverse Document Frequency formula as features for your documents, which is calculated as:
TF(t) = (Number of times term t appears in a document) / (Total
number of terms in the document). IDF(t) = log_e(Total number of
documents / Number of documents with term t in it)
TF-IDF = TF(t) * IDF(t)

Apply KNN from small supervised dataset to large unsupervised dataset in Python

I have trained and tested a KNN model on a small supervised dataset of about 200 samples in Python. I would like to apply these results to a much larger unsupervised dataset of several thousand samples.
My question is: is there a way to fit the KNN model using the small supervised dataset, and then change the K-value for the large unsupervised dataset? I do not want to overfit the model by using the low K-value from the smaller dataset but am unsure how to fit the model and then change the K-value in Python.
Is this possible using KNN? Is there some other way to apply KNN to a much larger unsupervised dataset?
I would recommend actually fitting a KNN model on your larger dataset a couple of different times, each time using a different value for k. For each of those models you could then calculate the Silhouette Score.
Compare the various silhouette scores and choose for your final value of k (number of clusters) the value that you used for your highest scoring model.
As an example, here's some code I used to do this for myself last year:
from sklearn import mixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
## A list of the different numbers of clusters (the 'n_components' parameter) with
## which we will run GMM.
number_of_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
## Graph plotting method
def makePlot(number_of_clusters, silhouette_scores):
# Plot the each value of 'number of clusters' vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
ax.set_xlabel('GMM - number of clusters')
ax.set_ylabel('Silhouette Score (higher is better)')
ax.plot(number_of_clusters, silhouette_scores)
# Ticks and grid
xticks = np.arange(min(number_of_clusters), max(number_of_clusters)+1, 1.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(silhouette_scores), 2), max(silhouette_scores), .02)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')
## Graph the mean silhouette score of each cluster amount.
## Print out the number of clusters that results in the highest
## silhouette score for GMM.
def findBestClusterer(number_of_clusters):
silhouette_scores = []
for i in number_of_clusters:
clusterer = mixture.GMM(n_components=i) # Use the model of your choice here
clusterer.fit(<your data set>) # enter your data set's variable name here
preds = clusterer.predict(<your data set>)
score = silhouette_score(<your data set>, preds)
silhouette_scores.append(score)
## Print a table of all the silhouette scores
print("")
print("| Number of clusters | Silhouette score |")
print("| ------------------ | ---------------- |")
for i in range(len(number_of_clusters)):
## Ensure printed table is properly formatted, taking into account
## amount of digits (either one or two) in the value for number of clusters.
if number_of_clusters[i] <= 9:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
else:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
## Graph the plot of silhoutte scores for each amount of clusters
makePlot(number_of_clusters, silhouette_scores)
## Find and print out the cluster amount that gives the highest
## silhouette score.
best_silhouette_score = max(silhouette_scores)
index_of_best_score = silhouette_scores.index(best_silhouette_score)
ideal_number_of_clusters = number_of_clusters[index_of_best_score]
print("")
print("Having {} clusters gives the highest silhouette score of {}.".format(ideal_number_of_clusters,
round(best_silhouette_score, 4)))
findBestClusterer(number_of_clusters)
Please note that in my example, I used a GMM model instead of KNN, but you should be able to slightly modify the findBestClusterer() method to use whatever clustering algorithm you wish. In this method you will also specify your dataset.
In machine learning there are two broad types of learners, namely eager learners (Decision trees, neural nets, svms...) and lazy learners such as KNN. In fact, KNN doesn't do any learning at all. It just stores the "labeled" data you have and then uses it to perform inference such that it computes how similar the new sample (unlabeled) is, to all of the samples in the data that it has stored (labeled data). Then based on majority voting of the K nearest instances (K nearest neighbours hence the name) of the new sample, it will infer it's class/value.
Now to get to your question, "training" the KNN has nothing to do with the K itself, so when performing inference feel free to use whatever K gives the best result for you.

scikit learn - feature importance calculation in decision trees

I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing.
For example:
from StringIO import StringIO
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif
X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]
y = [1,0,1,1]
clf = DecisionTreeClassifier()
clf.fit(X, y)
feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))
out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')
results in feature importance:
feat importance = [0.25 0.08333333 0.04166667]
and gives the following decision tree:
Now, this answer to a similar question suggests the importance is calculated as
Where G is the node impurity, in this case the gini impurity. This is the impurity reduction as far as I understood it. However, for feature 1 this should be:
This answer suggests the importance is weighted by the probability of reaching the node (which is approximated by the proportion of samples reaching that node). Again, for feature 1 this should be:
Both formulas provide the wrong result. How is the feature importance calculated correctly?
I think feature importance depends on the implementation so we need to look at the documentation of scikit-learn.
The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance
That reduction or weighted information gain is defined as :
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Since each feature is used once in your case, feature information must be equal to equation above.
For X[2] :
feature_importance = (4 / 4) * (0.375 - (0.75 * 0.444)) = 0.042
For X[1] :
feature_importance = (3 / 4) * (0.444 - (2/3 * 0.5)) = 0.083
For X[0] :
feature_importance = (2 / 4) * (0.5) = 0.25
A single feature can be used in the different branches of the tree, feature importance then is it's total contribution in reducing the impurity.
feature_importance += number_of_samples_at_parent_where_feature_is_used\*impurity_at_parent-left_child_samples\*impurity_left-right_child_samples\*impurity_right
impurity is the gini/entropy value
normalized_importance = feature_importance/number_of_samples_root_node(total num of samples)
In the above eg:
feature_2_importance = 0.375*4-0.444*3-0*1 = 0.16799 ,
normalized = 0.16799/4(total_num_of_samples) = 0.04199
If feature_2 was used in other branches calculate the it's importance at each such parent node & sum up the values.
There is a difference in the feature importance calculated & the ones returned by the library as we are using the truncated values seen in the graph.
Instead, we can access all the required data using the 'tree_' attribute of the classifier which can be used to probe the features used, threshold value, impurity, no of samples at each node etc..
eg: clf.tree_.feature gives the list of features used. A negative value indicates it's a leaf node.
Similarly clf.tree_.children_left/right gives the index to the clf.tree_.feature for left & right children
Using the above traverse the tree & use the same indices in clf.tree_.impurity & clf.tree_.weighted_n_node_samples to get the gini/entropy value and number of samples at the each node & at it's children.
def dt_feature_importance(model,normalize=True):
left_c = model.tree_.children_left
right_c = model.tree_.children_right
impurity = model.tree_.impurity
node_samples = model.tree_.weighted_n_node_samples
# Initialize the feature importance, those not used remain zero
feature_importance = np.zeros((model.tree_.n_features,))
for idx,node in enumerate(model.tree_.feature):
if node >= 0:
# Accumulate the feature importance over all the nodes where it's used
feature_importance[node]+=impurity[idx]*node_samples[idx]- \
impurity[left_c[idx]]*node_samples[left_c[idx]]-\
impurity[right_c[idx]]*node_samples[right_c[idx]]
# Number of samples at the root node
feature_importance/=node_samples[0]
if normalize:
normalizer = feature_importance.sum()
if normalizer > 0:
feature_importance/=normalizer
return feature_importance
This function will return the exact same values as returned by clf.tree_.compute_feature_importances(normalize=...)
To sort the features based on their importance
features = clf.tree_.feature[clf.tree_.feature>=0] # Feature number should not be negative, indicates a leaf node
sorted(zip(features,dt_feature_importance(clf,False)[features]),key=lambda x:x[1],reverse=True)

Python Clustering 'purity' metric

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)

Categories