I was following the machine learning tutorial on youtube and using this dataset. However while the person in the video had no problem runnning the code, I received an error that the numpy.ndarray object has no attribute 'columns'
below is the code I ran
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
cols = ['integrated_mean','integrated_standard_deviation','integrated_excess_kurtosis','integrated_skewness','DM_mean','DM_standard_deviation','DM_excess_kurtosis','DM_skewness','class']
df = pd.read_csv("HTRU_2.data", names = cols)
train, valid, test = np.split(df.sample(frac = 1), [int(0.6*len(df)), int(0.8*len(df))])
def scale_dataset(dataframe, oversample = False):
X = dataframe[dataframe.columns[:-1]].values
y = dataframe[dataframe.columns[-1]].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
if oversample:
ros = RandomOverSampler()
X, y = ros.fit_resample(X, y)
data = np.hstack((X, np.reshape(y, (-1, 1))))
return data, X, y
train, X_train, y_train = scale_dataset(train, oversample = True)
valid, X_train, y_train = scale_dataset(train, oversample = False)
test, X_train, y_train = scale_dataset(train, oversample = False)
I do not know what is happening and how to fix it, I've tried searching elsewhere but I have no idea. If anyone can help it would be much appreciated.
I couldn't find the minute in the tutorial, but may be it's just a consequence of copy-paste.
In the function scale_dataset you make data a numpy array and then you assign that value to train variable. But when you come again with scale_dataset for valid data set you want to use this `train' data set as a pandas dataframe but in that moment it's a numpy array.
My common sense tells me you want to use valid data set instead of train and so on like this:
train, X_train, y_train = scale_dataset(train, oversample = True)
valid, X_train, y_train = scale_dataset(valid, oversample = False)
test, X_train, y_train = scale_dataset(test, oversample = False)
Instead of
X = dataframe[dataframe.columns[:-1]].values
y = dataframe[dataframe.columns[-1]].values
I did
X = dataframe[:, :-1]
y = dataframe[:, -1]
And now all the codes work fine now
Related
I am hoping to combine (and plot) SHAP results across validation splits for my xgboost model. The closest I have found online is this with k-fold CV, but when I try both k-fold and train_test_split, i'm thrown this error:
AssertionError: The shape of the shap_values matrix does not match the shape of the provided data matrix.
For reproducibility, I fetched the data from here
Below is my code, adapted a little bit to work for my own data. A couple notes:
shap.summary_plot(shap_values[1], X_test) is changed to shap.summary_plot(shap_values, X_test) as otherwise I was given this error: AssertionError: Summary plots need a matrix of shap_values, not a vector.
I used Explainer rather than TreeExplainer as that was what I was able to run
import numpy as np,warnings,shap
from sklearn.model_selection import KFold
from xgboost import XGBClassifier
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from tqdm import tqdm
mod_dir = 'C:/Users/User/OneDrive - UHN/ML examples/'
df = pd.read_csv('{}heart_disease.csv'.format(mod_dir))
model_vars = pd.read_csv('{}hd_model_vars.csv'.format(mod_dir))
cat_vars = model_vars[model_vars['data_type']=="category"]
cat_vars = cat_vars['variable'].to_list()
df[cat_vars] = df[cat_vars].astype("category")
ids_outcome = df[['id','out']]
df = df.drop('out',axis=1)
xgb = XGBClassifier(enable_categorical = True,eval_metric="logloss",use_label_encoder=False,tree_method = "hist")
x = df.copy()
y = ids_outcome.copy()
y['out'] = y['out'].astype(int)
ls_shap_values = []
ls_x_val = []
for i in tqdm(range(1,4)):
kf = KFold(n_splits=3,shuffle=True,random_state=i)
for train_index, val_index in kf.split(x):
pass
x_train = x.iloc[train_index]
y_train = y.iloc[train_index]
x_val = x.iloc[val_index]
y_val = y.iloc[val_index]
# Save IDs for merging later
train_ids = x_train[['id']]
val_ids = x_val[['id']]
# Set ID column as index for modelling
x_train = x_train.set_index('id')
y_train = y_train.set_index('id')
x_val = x_val.set_index('id')
y_val = y_val.set_index('id')
xgb.fit(x_train,y_train)
ls_x_val.append(val_index)
explainer = shap.Explainer(xgb.predict,x_val)
shap_values = explainer(x_val)
ls_shap_values.append(shap_values)
val_set = ls_x_val[0]
shap_values = np.array(ls_shap_values[0])
for i in range(1,3):
test_set = np.concatenate((val_set,ls_x_val[i]),axis=0)
shap_values = np.concatenate((shap_values,np.array(ls_shap_values[i])),axis=1)
#bringing back variable names
X_val = pd.DataFrame(x.iloc[test_set],columns=x.columns)
shap.summary_plot(shap_values,X_val)
I have developed some different datasets and I want to write a for loop to do the training for each of which and at the end, I want to have RMSE for each dataset. I tried by passing through a for loop but it does not work since it gives back the same value for each dataset while I know that it should be different. The code that I have written is below:
for i in NEW_middle_index:
DF = df1.iloc[i-100:i+100,:]
# Append an empty sublist inside the list
FINAL_DF.append(DF)
y = DF.iloc[:,3]
X = DF.drop(columns='Target')
index_train = int(0.7 * len(X))
X_train = X[:index_train]
y_train = y[:index_train]
X_test = X[index_train:]
y_test = y[index_train:]
scaler_x = MinMaxScaler().fit(X_train)
X_train = scaler_x.transform(X_train)
X_test = scaler_x.transform(X_test)
xgb_r = xg.XGBRegressor(objective ='reg:linear',
n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
# print(i)
# Fitting the model
xgb_r.fit(X_train,y_train)
# Predict the model
pred = xgb_r.predict(X_test)
# RMSE Computation
rmse = np.sqrt(mean_squared_error(y_test,pred))
# print(rmse)
RMSE.append(rmse)
Not sure if you indented it correctly. You are overwriting X_train and X_test and when you fit your model, its always on the same dataset, hence you get the same results.
One option is to fit the model once you create the train / test dataframes. Else if you want to keep the train / test set, maybe something like below, to store them in a list of dictionaries, without changing too much of your code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import xgboost as xg
df1 = pd.DataFrame(np.random.normal(0,1,(600,3)))
df1['Target'] = np.random.uniform(0,1,600)
NEW_middle_index = [100,300,500]
NEWDF = []
for i in NEW_middle_index:
y = df1.iloc[i-100:i+100:,3]
X = df1.iloc[i-100:i+100,:].drop(columns='Target')
index_train = int(0.7 * len(X))
scaler_x = MinMaxScaler().fit(X)
X_train = scaler_x.transform(X[:index_train])
y_train = y[:index_train]
X_test = scaler_x.transform(X[index_train:])
y_test = y[index_train:]
NEWDF.append({'X_train':X_train,'y_train':y_train,'X_test':X_test,'y_test':y_test})
Then we fit and calculate RMSE:
RMSE = []
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
xgb_r.fit(NEWDF[i]['X_train'],NEWDF[i]['y_train'])
pred = xgb_r.predict(NEWDF[i]['X_test'])
rmse = np.sqrt(mean_squared_error(NEWDF[i]['y_test'],pred))
RMSE.append(rmse)
RMSE
[0.3524827559800294, 0.3098101362502435, 0.3843173269966071]
Let's take data
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
And consider code following :
#Defining X,y - independent variable and dependent variables
X=df.drop(df.columns[[1]], axis=1)
y = (df[1] == 'B').astype(int)
clf=LogisticRegression(solver="lbfgs")
kfold = StratifiedKFold(n_splits=10, shuffle=True)
for train, validation in kfold.split(X, y):
# Fit the model
clf.fit(X[train], y[train])
And the following error occurs :
Do you have any idea why it occurs ? I think I did really not complicated things, so I'm not sure what exactly I did wrong.
X is a DataFrame so you need to use .iloc to select the indices:
for train_index, validation_index in kfold.split(X, y):
# Fit the model
X_train = X.iloc[train_index]
y_train = y[train_index]
clf.fit(X_train, y_train)
I think I'm missing something in the code below.
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
# Split into training and test sets
# Testing Count Vectorizer
X = df[['Spam']]
y = df['Value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
sm = pd.concat([X_resampled, y_resampled], axis=1)
as I'm getting the error
ValueError: could not convert string to float:
---> 19 X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)
Example of data is
Spam Value
Your microsoft account was compromised 1
Manchester United lost against PSG 0
I like cooking 0
I'd consider to transform both train and test sets to fix the issue which is causing the error, but I don't know how to apply to both. I've tried some examples on google, but it hasn't fixed the issue.
convert text data to numeric before applying SMOTE , like below.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X_train.values.ravel())
X_train=vectorizer.transform(X_train.values.ravel())
X_test=vectorizer.transform(X_test.values.ravel())
X_train=X_train.toarray()
X_test=X_test.toarray()
and then add your SMOTE code
x_train = pd.DataFrame(X_train)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
You can use SMOTENC instead of SMOTE. SMOTENC deals with categorical variables directly.
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html#imblearn.over_sampling.SMOTENC
Tokenizing your string data before feeding it into SMOTE is an option. You can use any tokenizer and following torch implementation would be something like:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64)
X, y = [], []
for batch in dataloader:
input_ids = batch['input_ids']
labels = batch['labels']
X.append(input_ids)
y.append(labels)
X_tensor = torch.cat(X, dim=0)
y_tensor = torch.cat(y, dim=0)
X = X_tensor.numpy()
y = y_tensor.numpy()
smote = SMOTE(random_state=42, sampling_strategy=0.6)
X_resampled, y_resampled = smote.fit_resample(X, y)
I am attempting to train models with GradientBoostingClassifier using categorical variables.
The following is a primitive code sample, just for trying to input categorical variables into GradientBoostingClassifier.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
X_train = pandas.DataFrame(X_train)
# Insert fake categorical variable.
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40
# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
The following error appears:
ValueError: could not convert string to float: 'b'
From what I gather, it seems that One Hot Encoding on categorical variables is required before GradientBoostingClassifier can build the model.
Can GradientBoostingClassifier build models using categorical variables without having to do one hot encoding?
R gbm package is capable of handling the sample data above. I'm looking for a Python library with equivalent capability.
pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. We can then merge the dummy matrix back to the training data.
Below is the example code from the question with the above procedure carried out.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.
# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)
catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################
# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
prob = clf.predict_proba(X_test)[:,1] # Only look at P(y==1).
fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)
print(prob)
print(y_test)
print(roc_auc_prob)
Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators.
Sure it can handle it, you just have to encode the categorical variables as a separate step on the pipeline. Sklearn is perfectly capable of handling categorical variables as well as R or any other ML package. The R package is still (presumably) doing one-hot encoding behind the scenes, it just doesn't separate the concerns of encoding and fitting in this case (as it arguably should).