I calculated the permutation Importance using eli5. But I only get a subset of the values.
import eli5
eli5.show_weights(perm, feature_names = X.columns.tolist())
At the end of the original plot ..10 more is shown. How can I get all the values?
The show_weights method has a top argument, which when set to None there is no limit in the shown features (see the documentation), so that should fix your problem:
eli5.show_weights(perm, feature_names = X.columns.tolist(), top=None)
Related
I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below
row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])
This generated the plot as shown below
However, I want to export this to dataframe and how can I do it?
I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?
Let's do a small experiment:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:
explainer.expected_value
which is of interest to you because this is base on which SHAP values add up.
Furthermore:
sv = explainer.shap_values(X)
len(sv)
will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).
Hence:
sv1 = sv[1]
Now you have everything to pack it to the desired format:
df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Q: How do I know?
A: Read docs and source code.
If I recall correctly, you can do something like this with pandas
import pandas as pd
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)
to get the feature names, you should do something like this (if data_for_prediction is a dataframe):
feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
I'm a currenty using that :
def getShapReport(classifier,X_test):
shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values[1], X_test)
return pd.DataFrame(shap_values[1])
It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)
It is for a Tree explainer and not a waterfall, but it is basically the same.
I want to standardize 'x_train'.
The first 'x_train' in the picture is the original data set, and the next 'x_train' below the previous one is standardized.
I just want to standardize the first six columns, so I wrote x_train[:,0:6] during standardization.
However, the result of standardization is obviously unreasonable. Moreover, when I use the mean and standard deviation of 'x_train' to standardize x_test, the result went right. It's weird. I have no idea what's wrong with my code.
Below is my code for standardizing.
Try -
scaler = preprocessing.StandardScaler().fit(x_train.iloc[:, 0:6])
#returning the scaled values to a new variable
X_train_first_six = scaler.transform(x_train.iloc[:, 0:6])
X_test_first_six = scaler.transform(x_test.iloc[:, 0:6])
ref. pandas iloc
I wrote a code for a confusion matrix in order to compare two lists of number following documentation online and when I thought I got good results, I noticed that the values were positioned in a weird way. First, this is the code I am using:
## Classification report and confusion matrix
import numpy as np
def evaluate_pred(y_true, y_pred):
y_test = np.array(y_true)
y_predict = np.array(y_pred)
target_names = ['Empty', 'Human', 'Dog', 'Dog&Human']
labels_names = [0,1,2,3]
print(classification_report(y_test, y_predict,labels=labels_names, target_names=target_names))
cm = confusion_matrix(y_test, y_predict,labels=labels_names, normalize='pred')
cm2 = confusion_matrix(y_test, y_predict,labels=labels_names)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=target_names)
disp = disp.plot(cmap=plt.cm.Blues,values_format='g')
disp2 = ConfusionMatrixDisplay(confusion_matrix=cm2,display_labels=target_names)
disp2 = disp2.plot(cmap=plt.cm.Blues,values_format='g')
plt.show()
and after giving it two lists (labels and prediction) I get the following result (below is the normalized matrix), but as you can see, the rows for each class are supposed to add up to the total, but instead, it's the columns that do. I tried different things but I still cannot get it fixed. There is something I am missing but I cannot figure it out. Thanks a lot for any help.
I simply had to use normalize='true' instead of normalize='pred' to solve the issue. it seems like setting the value to pred considers the total of each column and then calculates the percentage based on that.
I'm trying to plot a probability distribution using a pandas.Series and I'm struggling to set different yerr for each bar. In summary, I'm plotting the following distribution:
It comes from a Series and it is working fine, except for the yerr. It cannot overpass 1 or 0. So, I'd like to set different errors for each bar. Therefore, I went to the documentation, which is available here and here.
According to them, I have 3 options to use either the yerr aor xerr:
scalar: Symmetric +/- values for all data points.
scalar: Symmetric +/- values for all data points.
shape(2,N): Separate - and + values for each bar. The first row contains the lower errors, the second row contains the upper errors.
The case I need is the last one. In this case, I can use a DataFrame, Series, array-like, dict and str. Thus, I set the arrays for each yerr bar, however it's not working as expected. Just to replicate what's happening, I prepared the following examples:
First I set a pandas.Series:
import pandas as pd
se = pd.Series(data=[0.1,0.2,0.3,0.4,0.4,0.5,0.2,0.1,0.1],
index=list('abcdefghi'))
Then, I'm replicating each case:
This works as expected:
err1 = [0.2]*9
se.plot(kind="bar", width=1.0, yerr=err1)
This works as expected:
err2 = err1
err2[3] = 0.5
se.plot(kind="bar", width=1.0, yerr=err1)
Now the problem: This doesn't works as expected!
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [err_low, err_up]
se.plot(kind="bar", width=1.0, yerr=err3)
It's not setting different errors for low and up. I found an example here and a similar SO question here, although they are using matplotlib instead of pandas, it should work here.
I'm glad if you have any solution about that.
Thank you.
Strangely, plt.bar works as expected:
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [err_low, err_up]
fig, ax = plt.subplots()
ax.bar(se.index, se, width=1.0, yerr=err3)
plt.show()
Output:
A bug/feature/design-decision of pandas maybe?
Based on #Quanghoang comment, I started to think it was a a bug. So, I tried to change the yerr shape, and surprisely, the following code worked:
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [[err_low, err_up]]
print (err3)
se.plot(kind="bar", width=1.0, yerr=err3)
Observe I included a new axis in err3. Now it's a (1,2,N) array. However, the documentation says it should be (2,N).
In addition, a possible work around that I found was set the ax.ylim(0,1). It doesn't solve the problem, but plots the graph correctly.
I am trying to read in the complete Titanic dataset, which can be found here:
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('titanic3.xls')
y = dataset.iloc[:, 1].values
x = dataset.iloc[:, 2:14].values
# Create Dataset for Men
men_on_board = dataset[dataset['sex'] == 'male']
male_fatalities = men_on_board[men_on_board['survived'] ==0]
X_male = male_fatalities.iloc[:, [4,8]].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X_male[:,0])
X_male[:,0] = imputer.transform(X_male[:,0])
When I run all but the last line, I get the following warning:
/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
When I run the last line, it throws the following error:
File "<ipython-input-2-07afef05ee1c>", line 1, in <module>
X_male[:,0] = imputer.transform(X_male[:,0])
ValueError: could not broadcast input array from shape (523) into shape (682)
I've used the above code snippet for imputation on other projects, not sure why it's not working.
A quick solution is to change axis = 0 to axis = 1. This will make it work, though I'm not sure if that's what you want. So I want to give some explanation about what happened here as following:
The warning basically tells you sklearn estimator now requires 2D data arrays rather than 1D data arrays where interpreting data as samples (rows) vs as features (columns) matters. During this deprecation process, this requirement is enforce by np.atleast_2d which assume your data has a single sample (row). Meanwhile, you passed axis = 0 to the Imputer which "impute along columns" by strategy = 'mean'. However, you have only 1 row now. When it comes across a missing value, there is no mean to replace that missing value. Therefore the entire column (which contains just this missing value) is discarded. As you can see, this is equal to
X_male[:,0][~np.isnan(X_male[:,0])].reshape(1, -1)
That's why the assignment X_male[:,0] = imputer.transform(X_male[:,0]) failed: X_male[:,0] is shape(682) while imputer.transform(X_male[:,0]) is shape(523). My previous solution basically changes it to "impute along rows" where you do have mean to replace missing values. You won't drop anything this time and your imputer.transform(X_male[:,0]) is shape(682) which can be assigned to X_male[:,0].
Now I don't know why your code snippet for imputation works on other projects. For your specific case here, a (logically) better way in regarding to the deprecation warning could be using X.reshape(-1, 1) since your data has a single feature and 682 samples. However, you need to reshape the transformed data back before being able to be assigned to X_male[:,0]:
imputer = imputer.fit(X_male[:,0].reshape(-1, 1))
X_male[:,0] = imputer.transform(X_male[:,0].reshape(-1, 1)).reshape(-1)