I am trying to plot the SHAP waterfall plot for my dataset using the code below. I am working on binary classification problem.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.iloc[[idx]]) # pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.iloc[[idx]].values, # pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0])
But I get an error like below
AttributeError: 'Index' object has no attribute 'items'
but when I manually try the values under Explanation object, they produce output successfully.
I guess the Explanation function is expecting something more but am missing it.
I verified that my X is a dataframe and my y is a Series as shown in the sample code here. Shape of X is (2986, 29) and y is (2986,)
Can help me understand what is the issue here?
update - waterfall in dataframe. This works.
explainer = TreeExplainer(model)
bv = explainer.expected_value[1]
sv = explainer(X, check_additivity=False)
sdf = pd.DataFrame({
'row_id': X.index.values.repeat(X.shape[1]),
'feature': X.columns.to_list() * X.shape[0],
'feature_value': X.values.flatten(),
'base_value': bv,
'shap_values': sv.values[:,:,1].flatten()
})
Related
I am trying to run T-distributed Stochastic Neighbor Embedding (t-SNE) in Jupyter but always facing a issue with
ValueError: could not convert string to float: '<Null>'
Code:
enter image description here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Reading the data using pandas
df = pd.read_csv("E:\\Field data\Output\\Pixel values7.csv")
# print first five rows of df
print(df.head(9))
# save the labels into a variable l.
l = df['label']
# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)
I got error after this line
# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(df)
print(standardized_data.shape)
# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
tsne_data = model.fit_transform(data_1000)
# creating a new data frame which
# help us in plotting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dim_1", "Dim_2", "label"))
# Plotting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
I got this link from somewhere, I am not expert in python. I request you to kindly help me out.
I am trying to run this program for my data but always getting a error
ValueError: could not convert string to float: '<Null>'
If there is any other code for T-distributed Stochastic Neighbor Embedding (t-SNE). Please let me know.
My data look like this
I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])
I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below
row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])
This generated the plot as shown below
However, I want to export this to dataframe and how can I do it?
I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?
Let's do a small experiment:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:
explainer.expected_value
which is of interest to you because this is base on which SHAP values add up.
Furthermore:
sv = explainer.shap_values(X)
len(sv)
will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).
Hence:
sv1 = sv[1]
Now you have everything to pack it to the desired format:
df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Q: How do I know?
A: Read docs and source code.
If I recall correctly, you can do something like this with pandas
import pandas as pd
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)
to get the feature names, you should do something like this (if data_for_prediction is a dataframe):
feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
I'm a currenty using that :
def getShapReport(classifier,X_test):
shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values[1], X_test)
return pd.DataFrame(shap_values[1])
It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)
It is for a Tree explainer and not a waterfall, but it is basically the same.
I'm working on a classification problem using a dataset containing 39 attributes (38 independent features + the class attribute). When I try to calculate the correlation matrix the class attribute is not taken into consideration. To my knowledge, it should be included in the matrix as well.
len(heartdata.columns)
39
Since the number of columns in my dataframe is 39 then the correlation matrix should be of shape (39,39) but what I get is:
cor = heartdata.corr()
cor.shape
(38, 38)
if your features are categorical, you should use LabelEncoding
from sklearn.preprocessing import LabelEncoder
train = train_df
label_encoder = LabelEncoder()
for i in range(len(train.columns)):
column = train_df.columns[i]
train[column] = label_encoder.fit_transform(train_df[column])
print(f"train {column} uniques: {len(train[column].unique())} ")
x = train
y = train_df['gender'].to_frame(name='gender')
Then you can get Correlation Matrix:
cor = x.corr()
print(cor)
and if you want use plot to show correlation between features, I suggest heatmap plot:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,8),linewidth=10,edgecolor="#04253a" )
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
Output:
My class attribute had a categorical type that's why corr() function didn't take it into consideration. A simple econding solved the problem.
le = LabelEncoder()
heartdata['class'] = le.fit_transform(heartdata['class'])
I have a KMeans function I made takes the input def kmeans(x,k, no_of_iterations): and returns the following return points, centroids it gets plotted perfectly, the code for that isn't very relevant. But I want to calculate for it, the silhouette score and graph this for each value.
#Load Data
data = load_digits().data
pca = PCA(2)
#Transform the data
df = pca.fit_transform(data)
X= df
#y = kmeans.fit_predict(X)
#Applying our function
label, centroids = kmeans(df,10,1000)#returns points value and centroids
y = label.fit_predict(data)
#Visualize the results
u_labels = np.unique(label)
for i in u_labels:
plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)
plt.scatter(centroids[:,0] , centroids[:,1] , s = 80, color = 'k')
plt.legend()
plt.show()
the above is code for running the KMeans plot
Below is my attempt to calculate silhouette. This is from an example that imports from KMeans but I don't really want to do that nor did it work with my code.
silhouette_avg = silhouette_score(X, y)
print("The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, y)
You may notice that there is no value here for y, as I have found y is supposed to be the amount of clusters I think? So I had it as 10 at first and it give an error message. I don't know if from this code anyone could tell me what I do next to get this value?
Try this:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
mpl.rcParams["figure.figsize"] = (9,6)
# Generate synthetic dataset with 8 blobs
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, shuffle=True, random_state=42)
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof()
# Instantiate the clustering model and visualizer
model = KMeans(8)
visualizer = SilhouetteVisualizer(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
Also, see this.
https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html