I am trying to get the features which are important for a class and have a positive contribution (having red points on the positive side of the SHAP plot).
I can get the shap_values and plot the shap summary for each class (e.g. class 2 here) using the following code:
import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values[2], X)
From the plot I can understand which features are important to that class. In the below plot, I can say alcohol and sulphates are the main features (that I am more interested in).
However, I want to automate this process, so the code can rank the features (which are important on the positive side) and return the top N. Any idea on how to automate this interpretation?
I need to automatically identify those important features for each class. Any other method rather than shap that can handle this process would be ideal.
You can do the following steps - where basically we are trying to get only the values that effect the classification positively (shap_values>0) when shap_values<0 it means don't classify
Later you take mean and sort the results.
If you prefers the global values then use .abs() instead of [shap_df>0]
and for the hole model use only shap_values instead of shap_values['your_class_number']
import shap
import pandas as pd
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X)
shap_df = pd.DataFrame(shap_values['your_class_number'],columns=X.columns)
feature_importance = (shap_df
[shap_df>0]
.mean()
.sort_values(ascending=False)
.reset_index()
.rename(columns={'index':'feature',0:'weight'})
.head(n)
)
Related
I have a data set that has 10,000 rows each row has 248 values and these values determine if that row is a zero or one. I am trying to figure out why this is so. I am trying to plot the logistic regression line from
LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr',fit_intercept=True).fit(X, Y)
So I can see why they are classified how they are. But I can't figure out how to do this, I can't use a scatter plot since there x data has way more value then the label data.
My question is how would I go about plotting this.
I could suggest plotting the logistic regression using
import seaborn as sns
sns.regplot(x='target', y='variable', data=data, logistic=True)
But that takes a single variable input. Since you are trying to find correlations with a large number of inputs, I would look for feature importance first, running this
from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
m.fit(X, y)
print(m.coef_)
The next steps would be applying PCA to either eliminate some features or condense them into fewer variables and running a correlation matrix.
P.S. what does a zero or one represent?
I'm running a deep learning model which requires me to scale my dataset. I'm using scikit-learn's MinMaxScaler. After I make the prediction, if I compare the prediction with the target column I get a certain relative error. But if I rescale the dataset and the prediction, the relative error increases massively.
For reference, it's not a good model and the error when using the scaled dataset is around 40% and when I re-scale the error jumps to over 60%. I'm also calculating the relative error this way:
def calculate_error(prediction, y):
rel_error = 2 * np.absolute(y - prediction) / (np.absolute(y) + np.absolute(prediction))
return rel_error
From this I get the mean and the standard deviation using numpy's mean() and std() functions. An example is the following
predicted_scaled = [0.26652822, 0.2384195, 0.26829958, 0.25697553, 0.28840747]
real_scaled = [0.16201117, 0.37243948, 0.42085661, 0.49534451, 0.23649907]
rel_error.mean() = 44.02%
rel_error.std() = 14.03%
---
predicted_rescaled = [12.012565, 10.503127, 12.107687, 11.499586, 13.187481]
real_rescaled = [6.4, 17.7, 20.3, 24.3, 10.4]
rel_error.mean() = 51.54%
rel_error.std() = 17.8%
Why does this happen and how can I prevent it? Furthermore, what's the correct error: the one that compares prediction and target while scaled or the one I get after scaling?
It's because of your min value in your min/max scaler shifting the shape of your modelled distribution. Let us, for example, take a single datapoint, pred=0.6, true=0.8.
Let us calculate your error according to this point without scaling:
error = 2*|0.6-0.8|/ (1.4)
error = 2/7 = 0.28
Now we can calculate this scaled according to a (randomly-chosen) scaler with a min of 2.2 and max of 10.1:
error = 2*|6.94-8.52|/(16.46)
error = 0.19
So, this is not an error in the code, but rather the fact that you are calculating a relative error between two different distributions which will result in a different value!
In regards to which one is the 'correct' result to display, I would suggest it depends on what you're discussing. If you're conveying the real results, then I would suggest that you use the re-scaled results. If you're conveying model performance then either will suffice.
Also, I think it is important to scale your outputs/inputs as a model will learn better (generally) with scaled outputs/inputs with an activated output (ie. scaling with a sigmoid of tanh function at the output layer).
I have a dataframe with 5 independent variables and 1 dependent variable. All my variables are continous including the dependent variable. Is there a way i can calculate which of my independent variables influences my dependent variable the most in python? Is there an algorithm i could ran to do this for me?
i tried the information gain method but that was a classification method so had to use a labelencoder to transform my dependent variable. I used the following code after splitting my dataset into a train and test set
#encoding the dependant variable
lab_enc = preprocessing.LabelEncoder()
training_scores_encoded = lab_enc.fit_transform(y_train)
#SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.
#Firstly, I specify the random forest instance, indicating the number of trees.
#Then I use selectFromModel object from sklearn to automatically select the features.
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
sel.fit(X_train, training_scores_encoded)
#We can now make a list and count the selected features.
selected_feat= X_train.columns[(sel.get_support())]
len(selected_feat)
#viewing the importances
import matplotlib.pyplot as plt
importances = sel.estimator_.feature_importances_
indices = np.argsort(importances)[::-1]
# X is the train data used to fit the model
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
color="r", align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
Although i got a result, I'm not sure about this because i had to encode my (continous) dependent variable. Is this correct way to go? if not what can i do?
Thank you in advance for the assitance
You can use the SelectKBest class from scikit-learn module.
Check the original documentation here.
This technique is called Feature Selection.
You can also pick features with the highest correlation to the response.
print([(feature, abs(df[response].corr(df[feature]))) for feature in features])
This uses values from Tamarie's comment.
for feature in feature_cols:
print(f'feature: {feature} correlation: {abs(target_v.corr(df[feature]))}')
I am working on an anomaly detection project on a call detail record for a telephone operator, I have prepared a sample of 10000 observations and 80 dimensions which represent the totality of the observations for a day of traffic, the data are represented as follows:
this is a small part of the whole dataset.
however, I decided to use the library PYOD which is an API that offers many unsupervised learning algorithms, I decided to start with CNN:
from pyod.models.knn import KNN
knn= KNN(contamination= 0.1)
result = knn.fit_predict(conso)
Then to visualize the result I decided to resize the sample in 2 dimentions and to display it in scatter with in blue the observations that KNN predicted that were not outliers and in red those which are outliers.
from sklearn.manifold import TSNE
result_f = TSNE(n_components = 2).fit_transform(df_final_2)
result_f = pd.DataFrame(result_f)
color= ['red' if row == 1 else 'blue' for row in result_list]
'df_final_2' is the dataframe version of 'conso'.
then I put all that in the right colors:
import matplotlib.pyplot as plt
plt.scatter(result_f[0],result_f[1], s=1, c=color)
The thing that disturbs me in the graph is that the observations predict as outliers are not really outliers because normally the outliers are in the extremity of the graph and not grouped with the normal behaviors, even by analyzing these obseravations aberent they have a normal behavior in the original dataset, I have tried other PYOD algorithms and I have modified the parameters of each algorithm but I have obtained at least the same result. I made a mistake somewhere and I can not distinguish it.
Thnx.
There are several things to check:
using knn, lof, and similar models that rely on distance measures, the data should be first standardized (using sklearn StandardScaler)
tsne may now work in this case and the dimensionality reduction could be off
maybe do not use fit_predict, but do this (use y_train_pred):
# train kNN detector
clf_name = 'KNN'
clf = KNN(contamination=0.1)
clf.fit(X)
# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
If none of these work, feel free to open an issue report on GitHub and we will take a further investigation.
I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)