Related
I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below
row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])
This generated the plot as shown below
However, I want to export this to dataframe and how can I do it?
I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?
Let's do a small experiment:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:
explainer.expected_value
which is of interest to you because this is base on which SHAP values add up.
Furthermore:
sv = explainer.shap_values(X)
len(sv)
will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).
Hence:
sv1 = sv[1]
Now you have everything to pack it to the desired format:
df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Q: How do I know?
A: Read docs and source code.
If I recall correctly, you can do something like this with pandas
import pandas as pd
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)
to get the feature names, you should do something like this (if data_for_prediction is a dataframe):
feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
I'm a currenty using that :
def getShapReport(classifier,X_test):
shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values[1], X_test)
return pd.DataFrame(shap_values[1])
It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)
It is for a Tree explainer and not a waterfall, but it is basically the same.
I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values donĀ“t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks
import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)
For a regression problem, I have a training data set with :
- 3 variables with a gaussian distribution
- 20 variables with a uniform distribution.
All my variables are continious, between [0;1].
The problem is the test data, used to score my regression model has an uniform distribution for all the variables.
Actually, I have bad results at tail-end distribution, so I want to oversample my training set, in order to duplicate the rarest rows.
So my idea is to bootstrap (using sampling with replacement) on my training set in order to have a set of data with the same distribution as the test set.
In order to do that, my idea (don't know if it's a good one !) is to add 3 columns with intervals for my 3 variables and use this columns to stratify the resampling.
Example :
First, generating the data
from scipy.stats import truncnorm
def get_truncated_normal(mean=0.5, sd=0.15, min_value=0, max_value=1):
return truncnorm(
(min_value - mean) / sd, (max_value - mean) / sd, loc=mean, scale=sd)
generator = get_truncated_normal()
import numpy as np
from sklearn.preprocessing import MinMaxScaler
S1 = generator.rvs(1000)
S2 = generator.rvs(1000)
S3 = generator.rvs(1000)
u = np.random.uniform(0, 1, 1000)
Then check the distribution :
import seaborn as sns
sns.distplot(u);
sns.distplot(S2);
It's OK, so I'll add categories columns
import pandas as pd
df = pd.DataFrame({'S1':S1,'S2':S2,'S3':S3,'Unif':u})
BINS_NUMBER = 10
df['S1_range'] = pd.cut(df.S1,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
df['S2_range'] = pd.cut(df.S2,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
df['S3_range'] = pd.cut(df.S3,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
a check
df.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709] 3
(0.116709, 0.210454] 15
(0.210454, 0.304199] 64
(0.304199, 0.397944] 152
(0.397944, 0.491689] 254
(0.491689, 0.585434] 217
(0.585434, 0.679179] 173
(0.679179, 0.772924] 86
(0.772924, 0.866669] 30
(0.866669, 0.960414] 6
dtype: int64
It's good for me.
So now I'll try to resample but it's not working as intended
from sklearn.utils import resample
df_resampled = resample(df,replace=True,n_samples=1000, stratify=df['S1_range'])
df_resampled.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709] 3
(0.116709, 0.210454] 15
(0.210454, 0.304199] 64
(0.304199, 0.397944] 152
(0.397944, 0.491689] 254
(0.491689, 0.585434] 217
(0.585434, 0.679179] 173
(0.679179, 0.772924] 86
(0.772924, 0.866669] 30
(0.866669, 0.960414] 6
dtype: int64
So it's not working, I get the same distribution in output as in input...
Can you help me ?
Perhaps it's not the good way to do this ?
Thanks !!
Rather than writing code from scratch to resample your continuous data, you should take advantage a library for resampling regression data.
Whereas the popular libraries (imbalanced-learn, etc), focus on classification (categorical) variables, there is a recent Python library (called resreg - RESampling for REGression) that allows you to resample your continuous data (resreg GitHub page)
Also, rather than bootstraping, you may want to generate synthetic data points at the tail ends of your normally distributed variables, as doing this will likely lead to much better results (see this paper). Similar to SMOTE for classification, which interpolates between features, you can use SMOTER (SMOTE for regression) in the resreg package to generate synthetic values in regression/continuous data.
Here is an example of how you would use resreg to achieve resampling with a few lines of code:
import numpy as np
import resreg
cl = np.percentile(y,10) # Oversample values less than the 10th percentile
ch = np.percentile(y,90) # Oversample values less than the 10th percentile
# Assign relevance scores to indicate which samples in your dataset are
# to be resampled. Values below cl and above ch are assigned a relevance
# value above 0.5, other values are assigned a relevance value above 0.5
relevance = resreg.sigmoid_relevance(X, y, cl=cl, ch=ch)
# Resample the relevant values (i.e relevance >= 0.5) by interpolating
# between nearest k-neighbors (k=5). By setting over='balance', the
# relevant values are oversampled so that the number of relevant and
# irrelevant values are equal
X_res, y_res = resreg.smoter(X, y, relevance=relevance, relevance_threshold=0.5, k=5, over='balance', random_state=0)
My solution:
def create_sampled_data_set(n_samples_by_bin=1000,
n_bins=10,
replace=True,
save_csv=True):
"""In order to have the same distribution for S1..S3 between training
set and test set, this function will generate a new
training set resampled
Return: (X_train, y_train)
"""
def stratified_sample_df_(df, col, n_samples, replace=True):
if replace:
n = n_samples
else:
n = min(n_samples, df[col].value_counts().min())
df_ = df.groupby(col).apply(lambda x: x.sample(n, replace=replace))
df_.index = df_.index.droplevel(0)
return df_
X_train, y_train = load_data_for_train()
# merge the dataframe for the sampling. Target will be removed after
X_train = pd.merge(
X_train, y_train[['Target']], left_index=True, right_index=True)
del y_train
# build a categorical feature, from S1..S3 distribution
disc = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='kmeans')
disc.fit(X_train[['S1', 'S2', 'S3']])
y_bin = disc.transform(X_train[['S1', 'S2', 'S3']])
del disc
vint = np.vectorize(np.int)
y_bin = vint(y_bin)
y_concat = []
for i in range(len(y_bin)):
a = y_bin[i, 0].astype('str')
b = y_bin[i, 1].astype('str')
c = y_bin[i, 2].astype('str')
y_concat.append(a + ';' + b + ';' + c)
del y_bin
X_train['S_Class'] = y_concat
del y_concat
X_train_resampled = stratified_sample_df_(
X_train, 'S_Class', n_samples_by_bin)
del X_train
y_train_resampled = X_train_resampled[['Target']].copy()
y_train_resampled.rename(
columns={y_train_resampled.columns[0]: 'Target'}, inplace=True)
X_train_resampled = X_train_resampled.drop(['S_Class', 'Target'], axis=1)
# save in file for further usage
if save_csv:
X_train_resampled.to_csv(
"./data/training_input_resampled.csv", sep=",")
y_train_resampled.to_csv(
"./data/training_output_resampled.csv", sep=",")
return(X_train_resampled,
y_train_resampled)
I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support
I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64