SHAP: Combine results across validation splits - python

I am hoping to combine (and plot) SHAP results across validation splits for my xgboost model. The closest I have found online is this with k-fold CV, but when I try both k-fold and train_test_split, i'm thrown this error:
AssertionError: The shape of the shap_values matrix does not match the shape of the provided data matrix.
For reproducibility, I fetched the data from here
Below is my code, adapted a little bit to work for my own data. A couple notes:
shap.summary_plot(shap_values[1], X_test) is changed to shap.summary_plot(shap_values, X_test) as otherwise I was given this error: AssertionError: Summary plots need a matrix of shap_values, not a vector.
I used Explainer rather than TreeExplainer as that was what I was able to run
import numpy as np,warnings,shap
from sklearn.model_selection import KFold
from xgboost import XGBClassifier
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from tqdm import tqdm
mod_dir = 'C:/Users/User/OneDrive - UHN/ML examples/'
df = pd.read_csv('{}heart_disease.csv'.format(mod_dir))
model_vars = pd.read_csv('{}hd_model_vars.csv'.format(mod_dir))
cat_vars = model_vars[model_vars['data_type']=="category"]
cat_vars = cat_vars['variable'].to_list()
df[cat_vars] = df[cat_vars].astype("category")
ids_outcome = df[['id','out']]
df = df.drop('out',axis=1)
xgb = XGBClassifier(enable_categorical = True,eval_metric="logloss",use_label_encoder=False,tree_method = "hist")
x = df.copy()
y = ids_outcome.copy()
y['out'] = y['out'].astype(int)
ls_shap_values = []
ls_x_val = []
for i in tqdm(range(1,4)):
kf = KFold(n_splits=3,shuffle=True,random_state=i)
for train_index, val_index in kf.split(x):
pass
x_train = x.iloc[train_index]
y_train = y.iloc[train_index]
x_val = x.iloc[val_index]
y_val = y.iloc[val_index]
# Save IDs for merging later
train_ids = x_train[['id']]
val_ids = x_val[['id']]
# Set ID column as index for modelling
x_train = x_train.set_index('id')
y_train = y_train.set_index('id')
x_val = x_val.set_index('id')
y_val = y_val.set_index('id')
xgb.fit(x_train,y_train)
ls_x_val.append(val_index)
explainer = shap.Explainer(xgb.predict,x_val)
shap_values = explainer(x_val)
ls_shap_values.append(shap_values)
val_set = ls_x_val[0]
shap_values = np.array(ls_shap_values[0])
for i in range(1,3):
test_set = np.concatenate((val_set,ls_x_val[i]),axis=0)
shap_values = np.concatenate((shap_values,np.array(ls_shap_values[i])),axis=1)
#bringing back variable names
X_val = pd.DataFrame(x.iloc[test_set],columns=x.columns)
shap.summary_plot(shap_values,X_val)

Related

'numpy.ndarray' object has no attribute 'columns'

I was following the machine learning tutorial on youtube and using this dataset. However while the person in the video had no problem runnning the code, I received an error that the numpy.ndarray object has no attribute 'columns'
below is the code I ran
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
cols = ['integrated_mean','integrated_standard_deviation','integrated_excess_kurtosis','integrated_skewness','DM_mean','DM_standard_deviation','DM_excess_kurtosis','DM_skewness','class']
df = pd.read_csv("HTRU_2.data", names = cols)
train, valid, test = np.split(df.sample(frac = 1), [int(0.6*len(df)), int(0.8*len(df))])
def scale_dataset(dataframe, oversample = False):
X = dataframe[dataframe.columns[:-1]].values
y = dataframe[dataframe.columns[-1]].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
if oversample:
ros = RandomOverSampler()
X, y = ros.fit_resample(X, y)
data = np.hstack((X, np.reshape(y, (-1, 1))))
return data, X, y
train, X_train, y_train = scale_dataset(train, oversample = True)
valid, X_train, y_train = scale_dataset(train, oversample = False)
test, X_train, y_train = scale_dataset(train, oversample = False)
I do not know what is happening and how to fix it, I've tried searching elsewhere but I have no idea. If anyone can help it would be much appreciated.
I couldn't find the minute in the tutorial, but may be it's just a consequence of copy-paste.
In the function scale_dataset you make data a numpy array and then you assign that value to train variable. But when you come again with scale_dataset for valid data set you want to use this `train' data set as a pandas dataframe but in that moment it's a numpy array.
My common sense tells me you want to use valid data set instead of train and so on like this:
train, X_train, y_train = scale_dataset(train, oversample = True)
valid, X_train, y_train = scale_dataset(valid, oversample = False)
test, X_train, y_train = scale_dataset(test, oversample = False)
Instead of
X = dataframe[dataframe.columns[:-1]].values
y = dataframe[dataframe.columns[-1]].values
I did
X = dataframe[:, :-1]
y = dataframe[:, -1]
And now all the codes work fine now

Why is my machine learning svm code not converging to a predicted value?

I have been writing this code and i have gotten to the point where it runs but does not converge unfortunately. Could someone please have a look because I have checked many things and not too sure why it isn't converging. The data set is from here: https://github.com/nshomron/covidpred/blob/master/data/corona_tested_individuals_ver_006.english.csv.zip
The code I have hoped to split it up to make it a bit clear:
#---------- IMPORTS ----------
import numpy as np
import matplotlib as plt
from numpy.core.defchararray import index
import pandas as pd
from pandas.core.tools.datetimes import Scalar
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import svm
#---------- PREPROCESSING ----------
#---------- Import data ----------
data = pd.read_csv(r'C:\Users\Saaqib\Documents\Python\PythonProjects\Covidproject\corona_tested_individuals.csv', )
X = data.loc[:, data.columns != 'corona_result']
X = X.loc[:, X.columns != 'test_date']
y = data.iloc[:,6]
#---------- Encode data ----------
Le_X = LabelEncoder()
X['age_60_and_above'] = Le_X.fit_transform(X['age_60_and_above'])
X['gender'] = Le_X.fit_transform(X['gender'])
X['test_indication'] = Le_X.fit_transform(X['test_indication'])
# print('data=',X)
y = Le_X.fit_transform(y)
y = np.array(y)
Hot_enc_X = OneHotEncoder()
enc_X = pd.DataFrame(Hot_enc_X.fit_transform(X[['gender','test_indication']]).toarray())
X = X.join(enc_X)
X = X.drop(columns=['gender','test_indication'])
X = X.replace("None", float('nan'))
X["cough"] = X["cough"].fillna(0)
X["fever"] = X["fever"].fillna(0)
X["sore_throat"] = X["sore_throat"].fillna(0)
X["shortness_of_breath"] = X["shortness_of_breath"].fillna(0)
X["head_ache"] = X["head_ache"].fillna(0)
X["age_60_and_above"] = X["age_60_and_above"].fillna(0)
X['cough'] = X['cough'].astype(float)
X['fever'] = X['fever'].astype(float)
X['sore_throat'] = X['sore_throat'].astype(float)
X['shortness_of_breath'] = X['shortness_of_breath'].astype(float)
X['head_ache'] = X['head_ache'].astype(float)
X['age_60_and_above'] = X['age_60_and_above'].astype(float)
#---------- Split data set ----------
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
#---------- Train Model ----------
covid_model = svm.SVC(kernel='linear')
covid_model.fit(X_train, y_train)
predictions = covid_model.predict(X_test)
acc = accuracy_score(y_test,predictions)
print("pred:", predictions)
print("acc:", acc)

Training loop for XGBoost in different dataset

I have developed some different datasets and I want to write a for loop to do the training for each of which and at the end, I want to have RMSE for each dataset. I tried by passing through a for loop but it does not work since it gives back the same value for each dataset while I know that it should be different. The code that I have written is below:
for i in NEW_middle_index:
DF = df1.iloc[i-100:i+100,:]
# Append an empty sublist inside the list
FINAL_DF.append(DF)
y = DF.iloc[:,3]
X = DF.drop(columns='Target')
index_train = int(0.7 * len(X))
X_train = X[:index_train]
y_train = y[:index_train]
X_test = X[index_train:]
y_test = y[index_train:]
scaler_x = MinMaxScaler().fit(X_train)
X_train = scaler_x.transform(X_train)
X_test = scaler_x.transform(X_test)
xgb_r = xg.XGBRegressor(objective ='reg:linear',
n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
# print(i)
# Fitting the model
xgb_r.fit(X_train,y_train)
# Predict the model
pred = xgb_r.predict(X_test)
# RMSE Computation
rmse = np.sqrt(mean_squared_error(y_test,pred))
# print(rmse)
RMSE.append(rmse)
Not sure if you indented it correctly. You are overwriting X_train and X_test and when you fit your model, its always on the same dataset, hence you get the same results.
One option is to fit the model once you create the train / test dataframes. Else if you want to keep the train / test set, maybe something like below, to store them in a list of dictionaries, without changing too much of your code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import xgboost as xg
df1 = pd.DataFrame(np.random.normal(0,1,(600,3)))
df1['Target'] = np.random.uniform(0,1,600)
NEW_middle_index = [100,300,500]
NEWDF = []
for i in NEW_middle_index:
y = df1.iloc[i-100:i+100:,3]
X = df1.iloc[i-100:i+100,:].drop(columns='Target')
index_train = int(0.7 * len(X))
scaler_x = MinMaxScaler().fit(X)
X_train = scaler_x.transform(X[:index_train])
y_train = y[:index_train]
X_test = scaler_x.transform(X[index_train:])
y_test = y[index_train:]
NEWDF.append({'X_train':X_train,'y_train':y_train,'X_test':X_test,'y_test':y_test})
Then we fit and calculate RMSE:
RMSE = []
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
xgb_r.fit(NEWDF[i]['X_train'],NEWDF[i]['y_train'])
pred = xgb_r.predict(NEWDF[i]['X_test'])
rmse = np.sqrt(mean_squared_error(NEWDF[i]['y_test'],pred))
RMSE.append(rmse)
RMSE
[0.3524827559800294, 0.3098101362502435, 0.3843173269966071]

Is there any way i can optimize the code for Logistic Regression model

X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
X_test = np.asarray(X_test)
y_test = np.asarray(y_test)
X_val = np.asarray(X_valid)
y_val = np.asarray(y_valid)
import cv2
X_train_full = []
X_test_full = []
X_valid_full = []
for i in X_train:
res = cv2.resize(i, dsize=(28, 28), interpolation=cv2.INTER_CUBIC)
X_train_full.append(res)
for i in X_test:
res = cv2.resize(i, dsize=(28, 28), interpolation=cv2.INTER_CUBIC)
X_test_full.append(res)
for i in X_val:
res = cv2.resize(i, dsize=(28, 28), interpolation=cv2.INTER_CUBIC)
X_valid_full.append(res)
4.
X_train_full = np.asarray(X_train_full)
X_test_full = np.asarray(X_test_full)
X_valid_full = np.asarray(X_valid_full)
5.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train_full)
X_train_full = scaler.transform(X_train_full)
X_test_full = scaler.transform(X_test_full)
6.
from sklearn.linear_model import LogisticRegression
models = list()
accuracy = list()
save = 'svm/'
name = 'svm'
for i in range(len(dataset)):
name = 'model'+str(i)
data = dataset[i]
X_train = data[0][0]
y_train = data[0][1]
X_test = data[1][0]
y_test = data[1][1]
logisticRegr = LogisticRegression(solver='lbfgs',multi_class='multinomial')
logisticRegr.fit(X_train,y_train)
prediction = logisticRegr.predict(X_test)
accuracy.append(accuracy_score(y_test,prediction))
print('Accuracy ',str(i),': ',accuracy_score(y_test,prediction))
7.
logisticRegr = LogisticRegression(solver='lbfgs',multi_class='multinomial')
logisticRegr.fit(X_train,y_train)
predictions = logisticRegr.predict(X_test)
After i run the StandardScaler() part in my jupyter/colab, it crashed bcuz the memory overallocated. Is there way i can fix the code for the LogisticRegression Model?
At first , i load the datasets consist of 161 folders with 500 data in each folder
Then, i run a random() to shuffle the data among all of them.
and create a X_train_full to resize the image and save them in
Then i perform the StandardScaler on my latest X_train_full but it crashed, Is there any solution since i have already resize my image dimension to 28,28 from 192,256.
Well, you have a lot of data and simultaneously loading that data and performing operations on them is leading to memory crash.
In such a scenario, we should use dataset pipelines like tensorflow dataset tf.data.Dataset . This will load your data in batches rather than whole and is very memory efficient.
If you are using pytorch, then you can use torch.utils.data.DataLoader which is also a data loader.
For more information visit
Tensorflow
https://www.tensorflow.org/api_docs/python/tf/data/Dataset
Pytorch
https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

Why is my output dataframe shape not 1459 x 2 but 1460 x 2

Below is what i have done so far.
#importing the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
filepath = r"C:\Users...Kaggle data\house prediction iowa\house_predtrain (3).csv"
train = pd.read_csv(filepath)
print(train.shape)
filepath2 = r"C:\Users...Kaggle data\house prediction iowa\house_predtest (1).csv"
test = pd.read_csv (filepath2)
print(test.shape)
#first we raplace all the NANs by 0 in botht the train and test data
train = train.fillna(0)
test = test.fillna(0) #error one
train.dtypes.value_counts()
#isolating all the object/categorical feature and converting them to numeric features
encode_cols = train.dtypes[train.dtypes == np.object]
encode_cols2 = test.dtypes[test.dtypes == np.object]
#print(encode_cols)
encode_cols = encode_cols.index.tolist()
encode_cols2 = encode_cols2.index.tolist()
print(encode_cols2)
# Do the one hot encoding
train_dummies = pd.get_dummies(train, columns=encode_cols)
test_dummies = pd.get_dummies(test, columns=encode_cols2)
#align your test and train data (error2)
train, test = train_dummies.align(test_dummies, join = 'left', axis = 1)
print(train.shape)
print(test.shape)
#Now working with Floats features
numericals_floats = train.dtypes == np.float
numericals = train.columns[numericals_floats]
print(numericals)
#we check for skewness in the float data
skew_limit = 0.35
skew_vals = train[numericals].skew()
skew_cols = (skew_vals
.sort_values(ascending=False)
.to_frame()
.rename(columns={0:'Skewness'}))
skew_cols
#Visualising them above data before and after log transforming
%matplotlib inline
field = 'GarageYrBlt'
fig, (ax_before, ax_after) = plt.subplots(1, 2, figsize=(10,5))
train[field].hist(ax=ax_before)
train[field].apply(np.log1p).hist(ax=ax_after)
ax_before.set (title = 'Before np.log1p', ylabel = 'frequency', xlabel = 'Value')
ax_after.set (title = 'After np.log1p', ylabel = 'frequency', xlabel = 'Value')
fig.suptitle('Field: "{}"'.format (field));
#note how applying log transformation on GarageYrBuilt does not do much
print(skew_cols.index.tolist()) #returns a list of the values
for i in skew_cols.index.tolist():
if i == "SalePrice": #we do not want to transform the feature to be predicted
continue
train[i] = train[i].apply(np.log1p)
test[i] = test[i].apply(np.log1p)
feature_cols = [x for x in train.columns if x != ('SalePrice')]
X_train = train[feature_cols]
y_train = train['SalePrice']
X_test = test[feature_cols]
y_test = train['SalePrice']
print(X_test.shape)
print(y_train.shape)
print(X_train.shape)
#now to the most fun part. Feature engineering is over!!!
#i am going to use linear regression, L1 regularization, L2 regularization and ElasticNet(blend of L1 and L2)
#first up, Linear Regression
alphas =[0.00005, 0.0005, 0.005, 0.05, 0.5, 0.1, 0.3, 1, 3, 5, 10, 25, 50, 100] #i choosed this
l1_ratios = np.linspace(0.1, 0.9, 9)
#LinearRegression
linearRegression = LinearRegression().fit(X_train, y_train)
prediction1 = linearRegression.predict(X_test)
LR_score = linearRegression.score(X_train, y_train)
print(LR_score)
#ridge
ridgeCV = RidgeCV(alphas=alphas).fit(X_train, y_train)
prediction2 = ridgeCV.predict(X_test)
R_score = ridgeCV.score(X_train, y_train)
print(R_score)
#lasso
lassoCV = LassoCV(alphas=alphas, max_iter=1e2).fit(X_train, y_train)
prediction3 = lassoCV.predict(X_test)
L_score = lassoCV.score(X_train, y_train)
print(L_score)
#elasticNetCV
elasticnetCV = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, max_iter=1e2).fit(X_train, y_train)
prediction4 = elasticnetCV.predict(X_test)
EN_score = elasticnetCV.score(X_train, y_train)
print(EN_score)
from sklearn.ensemble import RandomForestRegressor
randfr = RandomForestRegressor()
randfr = randfr.fit(X_train, y_train)
prediction5 = randfr.predict(X_test)
print(prediction5.shape)
RF_score = randfr.score(X_train, y_train)
print(RF_score)
#putting it lall together
rmse_vals = [LR_score, R_score, L_score, EN_score, RF_score]
labels = ['Linear', 'Ridge', 'Lasso', 'ElasticNet', 'RandomForest']
rmse_df = pd.Series(rmse_vals, index=labels).to_frame()
rmse_df.rename(columns={0: 'SCORES'}, inplace=1)
rmse_df
\\KaggleHouse_submission_1 = pd.DataFrame({'Id': test.Id, 'SalePrice': prediction5})
KaggleHouse_submission_1 = KaggleHouse_submission_1
print(KaggleHouse_submission_1.shape)
In the kaggle house prediction there is a train dataset and a test dataset. here is the link to the actual data link. The output dataframe size should be a 1459 X 2 but mine is 1460 X 2 for some reason. I am not sure why this is happening. Any feedbacks is highly appreciated.
In the following line:
test = train.fillna(0)
you are assigning (overwriting) test variable with the "train" data ...
Scikit learn is very sensitive o ordering of columns, so if your train data set and the test data set are misaligned, you may have a problem similar to that above. so you need to first ensure that the test data is encoded same as the train data by using the following align command.
train, test = train_dummies.align(test_dummies, join='left', axis = 1)
see changes in my code above

Categories