Training loop for XGBoost in different dataset - python

I have developed some different datasets and I want to write a for loop to do the training for each of which and at the end, I want to have RMSE for each dataset. I tried by passing through a for loop but it does not work since it gives back the same value for each dataset while I know that it should be different. The code that I have written is below:
for i in NEW_middle_index:
DF = df1.iloc[i-100:i+100,:]
# Append an empty sublist inside the list
FINAL_DF.append(DF)
y = DF.iloc[:,3]
X = DF.drop(columns='Target')
index_train = int(0.7 * len(X))
X_train = X[:index_train]
y_train = y[:index_train]
X_test = X[index_train:]
y_test = y[index_train:]
scaler_x = MinMaxScaler().fit(X_train)
X_train = scaler_x.transform(X_train)
X_test = scaler_x.transform(X_test)
xgb_r = xg.XGBRegressor(objective ='reg:linear',
n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
# print(i)
# Fitting the model
xgb_r.fit(X_train,y_train)
# Predict the model
pred = xgb_r.predict(X_test)
# RMSE Computation
rmse = np.sqrt(mean_squared_error(y_test,pred))
# print(rmse)
RMSE.append(rmse)

Not sure if you indented it correctly. You are overwriting X_train and X_test and when you fit your model, its always on the same dataset, hence you get the same results.
One option is to fit the model once you create the train / test dataframes. Else if you want to keep the train / test set, maybe something like below, to store them in a list of dictionaries, without changing too much of your code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import xgboost as xg
df1 = pd.DataFrame(np.random.normal(0,1,(600,3)))
df1['Target'] = np.random.uniform(0,1,600)
NEW_middle_index = [100,300,500]
NEWDF = []
for i in NEW_middle_index:
y = df1.iloc[i-100:i+100:,3]
X = df1.iloc[i-100:i+100,:].drop(columns='Target')
index_train = int(0.7 * len(X))
scaler_x = MinMaxScaler().fit(X)
X_train = scaler_x.transform(X[:index_train])
y_train = y[:index_train]
X_test = scaler_x.transform(X[index_train:])
y_test = y[index_train:]
NEWDF.append({'X_train':X_train,'y_train':y_train,'X_test':X_test,'y_test':y_test})
Then we fit and calculate RMSE:
RMSE = []
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
xgb_r.fit(NEWDF[i]['X_train'],NEWDF[i]['y_train'])
pred = xgb_r.predict(NEWDF[i]['X_test'])
rmse = np.sqrt(mean_squared_error(NEWDF[i]['y_test'],pred))
RMSE.append(rmse)
RMSE
[0.3524827559800294, 0.3098101362502435, 0.3843173269966071]

Related

SHAP: Combine results across validation splits

I am hoping to combine (and plot) SHAP results across validation splits for my xgboost model. The closest I have found online is this with k-fold CV, but when I try both k-fold and train_test_split, i'm thrown this error:
AssertionError: The shape of the shap_values matrix does not match the shape of the provided data matrix.
For reproducibility, I fetched the data from here
Below is my code, adapted a little bit to work for my own data. A couple notes:
shap.summary_plot(shap_values[1], X_test) is changed to shap.summary_plot(shap_values, X_test) as otherwise I was given this error: AssertionError: Summary plots need a matrix of shap_values, not a vector.
I used Explainer rather than TreeExplainer as that was what I was able to run
import numpy as np,warnings,shap
from sklearn.model_selection import KFold
from xgboost import XGBClassifier
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from tqdm import tqdm
mod_dir = 'C:/Users/User/OneDrive - UHN/ML examples/'
df = pd.read_csv('{}heart_disease.csv'.format(mod_dir))
model_vars = pd.read_csv('{}hd_model_vars.csv'.format(mod_dir))
cat_vars = model_vars[model_vars['data_type']=="category"]
cat_vars = cat_vars['variable'].to_list()
df[cat_vars] = df[cat_vars].astype("category")
ids_outcome = df[['id','out']]
df = df.drop('out',axis=1)
xgb = XGBClassifier(enable_categorical = True,eval_metric="logloss",use_label_encoder=False,tree_method = "hist")
x = df.copy()
y = ids_outcome.copy()
y['out'] = y['out'].astype(int)
ls_shap_values = []
ls_x_val = []
for i in tqdm(range(1,4)):
kf = KFold(n_splits=3,shuffle=True,random_state=i)
for train_index, val_index in kf.split(x):
pass
x_train = x.iloc[train_index]
y_train = y.iloc[train_index]
x_val = x.iloc[val_index]
y_val = y.iloc[val_index]
# Save IDs for merging later
train_ids = x_train[['id']]
val_ids = x_val[['id']]
# Set ID column as index for modelling
x_train = x_train.set_index('id')
y_train = y_train.set_index('id')
x_val = x_val.set_index('id')
y_val = y_val.set_index('id')
xgb.fit(x_train,y_train)
ls_x_val.append(val_index)
explainer = shap.Explainer(xgb.predict,x_val)
shap_values = explainer(x_val)
ls_shap_values.append(shap_values)
val_set = ls_x_val[0]
shap_values = np.array(ls_shap_values[0])
for i in range(1,3):
test_set = np.concatenate((val_set,ls_x_val[i]),axis=0)
shap_values = np.concatenate((shap_values,np.array(ls_shap_values[i])),axis=1)
#bringing back variable names
X_val = pd.DataFrame(x.iloc[test_set],columns=x.columns)
shap.summary_plot(shap_values,X_val)

Predicting 100 values of function f(x) using LSTM

I am trying to predict 100 values of the function f(x) in the future. I found a script online where it was predicting the price of a stock into the future and decided to modify it for the purpose of my research. Unfortunately, the prediction is quite a bit off. Here is the script:
#import packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
import tensorflow as tf
#for normalizing data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
#read the file
df = pd.read_csv('MyDataSet')
#print the head
df.head()
look_back = 60
num_periods = 20
NUM_NEURONS_FirstLayer = 128
NUM_NEURONS_SecondLayer = 64
EPOCHS = 1
num_prediction = 100
training_size = 8000 #initial
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['XValue', 'YValue'])
for i in range(0,len(data)):
new_data['XValue'][i] = data['XValue'][i]
new_data['YValue'][i] = data['YValue'][i]
#setting index
new_data.index = new_data.XValue
new_data.drop('XValue', axis=1, inplace=True)
#creating train and test sets
dataset = new_data.values
train = dataset[0:training_size,:]
valid = dataset[training_size:,:]
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(dataset)
x_train, y_train = [], []
for i in range(look_back,len(train)):
x_train.append(scaled_data[i-look_back:i,0])
y_train.append(scaled_data[i,0])
x_train, y_train = np.array(x_train), np.array(y_train)
x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1))
#Custom Loss Function
def my_loss_fn(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
return tf.reduce_mean(squared_difference, axis=-1)
#Training Phase
model = Sequential()
model.add(LSTM(NUM_NEURONS_FirstLayer,input_shape=(look_back,1),
return_sequences=True))
model.add(LSTM(NUM_NEURONS_SecondLayer,input_shape=(NUM_NEURONS_FirstLayer,1)))
model.add(Dense(1))
model.compile(loss=my_loss_fn, optimizer='adam')
model.fit(x_train,y_train,epochs=EPOCHS,batch_size=2, verbose=2)
inputs = dataset[(len(dataset) - len(valid) - look_back):]
inputs = inputs.reshape(-1,1)
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back,inputs.shape[0]):
X_test.append(inputs[i-look_back:i,0])
X_test = np.array(X_test)
#Validation Phase
X_test = np.reshape(X_test, (X_test.shape[0],X_test.shape[1],1))
PredictValue = model.predict(X_test)
PredictValue = scaler.inverse_transform(PredictValue)
#Measures the RMSD between the validation data and the predicted data
rms=np.sqrt(np.mean(np.power((valid-PredictValue ),2)))
train = new_data[:int(new_data.index[training_size])]
valid = new_data[int(new_data.index[training_size]):]
valid['Predictions'] = PredictValue
valid = np.asarray(valid).astype('float32')
valid = valid.reshape(-1,1)
#Prediction Phase
def predict(num_prediction, model):
prediction_list = valid[-look_back:]
for _ in range(num_prediction):
x = prediction_list[-look_back:]
x = x.reshape((1, look_back, 1))
out = model.predict(x)[0][0]
prediction_list = np.append(prediction_list, out)
prediction_list = prediction_list[look_back-1:]
return prediction_list
#Get the predicted x-value
def predict_x(num_prediction):
last_x = df['XValue'].values[-1]
XValue = range(int(last_x),int(last_x)+num_prediction+1)
prediction_x = list(XValue)
return prediction_x
#Predict the y-value and the x-value
forecast = predict(num_prediction, model)
forecast_x = predict_x(num_prediction)
The file I was using was just for the function f(x) = sin(2πx/10), where x is all natural numbers from 0 to 10,000. Following the above, I get the following plot:
Prediction-1
Granted I am using only epochs of 1 and I am busy running a new job with a value of 50, but I was hoping for something a lot better. Any advice to improve the prediction here?
I also decided to modify the script and only predict one value at a time, add it to the dataset and then re-run the script. 100 times. It takes long and it obviously has its own issues (such as using the predicted value of the previous x (step)), but it's the best solution I can think of now. This is the plot I got for it. I can attach this file if you want, but it's not the one I want to continue focusing on with this project:
Prediction-2
Any sort of help would be appreciated in getting a better prediction.
Much appreciated

Sci-kit learn machine learning script for 2 datasets

Not alot of wisdom here... But I have a script that will compile and test the algorithm two times with the for i in range loop to see if there is any variation in root mean squared error.
Is it possible to modify the code where the loop will work to test two different datasets? IE, a df would run first one time compile rmse and then a df2 could run compile rmse and then I can compare/print the rmse between the two.. Both datasets would have the same ['Demand'] as the response variable.
#Test random Forest
import numpy as np
from sklearn import preprocessing, neighbors
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.externals import joblib
import math
rmses = []
for i in range(2):
X = np.array(df2.drop(['Demand'],1))
y = np.array(df2['Demand'])
offset = int(X.shape[0] * 0.7)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
clf = RandomForestRegressor(n_estimators=60, min_samples_split=6)
clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
rmse = math.sqrt(mse)
print("rmse: %.4f" % rmse)
rmses.append(rmse)
print(sum(rmses)/len(rmses))
You can create a list of dfs and iterate over that:
rmses = []
df_lst = [df1, df2]
for df in df_lst:
X = np.array(df.drop(['Demand'],1))
y = np.array(df['Demand'])
offset = int(X.shape[0] * 0.7)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
clf = RandomForestRegressor(n_estimators=60, min_samples_split=6)
clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
rmse = math.sqrt(mse)
print("rmse: %.4f" % rmse)
rmses.append(rmse)
print(sum(rmses)/len(rmses))
You could use an auxiliar df and assign the dataframe you want to compile on each iteration by using a condition:
for i in range(2):
if i==1:
aux_df = df
else:
aux_df = df2
.
.
.
That way you can use the first df in the first iteration and df2 in the second iteration.

How to implement a model on a new data set

I'm new to machine learning using python. I'm trying to predict a factor lets say Price of a house, but i'm using polynomial feature of higher order degree to create a model.
So i have 2 data sets. I've prepared my model using one data set.
How to implement this model on an entirely new data set?
I'm attaching my code below:
data1 = pd.read_csv(r"C:\Users\DELL\Desktop\experimental data/xyz1.csv", engine = 'c', dtype=float, delimiter = ",")
data2 = pd.read_csv(r"C:\Users\DELL\Desktop\experimental data/xyz2.csv", engine = 'c', dtype=float, delimiter = ",")
#I have to do this step otherwise everytime i get an error of NaN or infinite value
data1.fillna(0.000, inplace=True)
data2.fillna(0.000, inplace=True)
X_train = data1.drop('result', axis = 1)
y_train = data1.result
X_test = data2.drop('result', axis = 1)
y_test = data2.result
x2_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X_train)
x3_ = PolynomialFeatures(degree=3, include_bias=False).fit_transform(X_train)
model2 = LinearRegression().fit(x2_, y_train)
model3 = LinearRegression().fit(x3_, y_train)
r_sq2 = model2.score(x2_, y_train)
r_sq3 = model3.score(x3_, y_train)
y_pred2 = model2.predict(x2_)
y_pred3 = model3.predict(x3_)
So basically I'm stuck after this.
How do i implement this same model on my test data to predict y_test value and compute the score?
To reproduce the effect of PolynomialFeatures, you need to store the object itself (once for degree=2 and again for degree=3.) Otherwise, you have no way to apply the fitted transform to the test dataset.
X_train = data1.drop('result', axis = 1)
y_train = data1.result
X_test = data2.drop('result', axis = 1)
y_test = data2.result
# store these data transform objects
pf2 = PolynomialFeatures(degree=2, include_bias=False)
pf3 = PolynomialFeatures(degree=3, include_bias=False)
# then apply the transform to the training set
x2_ = pf2.fit_transform(X_train)
x3_ = pf3.fit_transform(X_train)
model2 = LinearRegression().fit(x2_, y_train)
model3 = LinearRegression().fit(x3_, y_train)
r_sq2 = model2.score(x2_, y_train)
r_sq3 = model3.score(x3_, y_train)
y_pred2 = model2.predict(x2_)
y_pred3 = model3.predict(x3_)
# now apply the fitted transform to the test set
x2_test = pf2.transform(X_test)
x3_test = pf3.transform(X_test)
# apply trained model to transformed test data
y2_test_pred = model2.predict(x2_test)
y3_test_pred = model3.predict(x3_test)
# compute the model accuracy for the test data
r_sq2_test = model2.score(x2_test, y_test)
r_sq3_test = model3.score(x3_test, y_test)

Why is my output dataframe shape not 1459 x 2 but 1460 x 2

Below is what i have done so far.
#importing the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
filepath = r"C:\Users...Kaggle data\house prediction iowa\house_predtrain (3).csv"
train = pd.read_csv(filepath)
print(train.shape)
filepath2 = r"C:\Users...Kaggle data\house prediction iowa\house_predtest (1).csv"
test = pd.read_csv (filepath2)
print(test.shape)
#first we raplace all the NANs by 0 in botht the train and test data
train = train.fillna(0)
test = test.fillna(0) #error one
train.dtypes.value_counts()
#isolating all the object/categorical feature and converting them to numeric features
encode_cols = train.dtypes[train.dtypes == np.object]
encode_cols2 = test.dtypes[test.dtypes == np.object]
#print(encode_cols)
encode_cols = encode_cols.index.tolist()
encode_cols2 = encode_cols2.index.tolist()
print(encode_cols2)
# Do the one hot encoding
train_dummies = pd.get_dummies(train, columns=encode_cols)
test_dummies = pd.get_dummies(test, columns=encode_cols2)
#align your test and train data (error2)
train, test = train_dummies.align(test_dummies, join = 'left', axis = 1)
print(train.shape)
print(test.shape)
#Now working with Floats features
numericals_floats = train.dtypes == np.float
numericals = train.columns[numericals_floats]
print(numericals)
#we check for skewness in the float data
skew_limit = 0.35
skew_vals = train[numericals].skew()
skew_cols = (skew_vals
.sort_values(ascending=False)
.to_frame()
.rename(columns={0:'Skewness'}))
skew_cols
#Visualising them above data before and after log transforming
%matplotlib inline
field = 'GarageYrBlt'
fig, (ax_before, ax_after) = plt.subplots(1, 2, figsize=(10,5))
train[field].hist(ax=ax_before)
train[field].apply(np.log1p).hist(ax=ax_after)
ax_before.set (title = 'Before np.log1p', ylabel = 'frequency', xlabel = 'Value')
ax_after.set (title = 'After np.log1p', ylabel = 'frequency', xlabel = 'Value')
fig.suptitle('Field: "{}"'.format (field));
#note how applying log transformation on GarageYrBuilt does not do much
print(skew_cols.index.tolist()) #returns a list of the values
for i in skew_cols.index.tolist():
if i == "SalePrice": #we do not want to transform the feature to be predicted
continue
train[i] = train[i].apply(np.log1p)
test[i] = test[i].apply(np.log1p)
feature_cols = [x for x in train.columns if x != ('SalePrice')]
X_train = train[feature_cols]
y_train = train['SalePrice']
X_test = test[feature_cols]
y_test = train['SalePrice']
print(X_test.shape)
print(y_train.shape)
print(X_train.shape)
#now to the most fun part. Feature engineering is over!!!
#i am going to use linear regression, L1 regularization, L2 regularization and ElasticNet(blend of L1 and L2)
#first up, Linear Regression
alphas =[0.00005, 0.0005, 0.005, 0.05, 0.5, 0.1, 0.3, 1, 3, 5, 10, 25, 50, 100] #i choosed this
l1_ratios = np.linspace(0.1, 0.9, 9)
#LinearRegression
linearRegression = LinearRegression().fit(X_train, y_train)
prediction1 = linearRegression.predict(X_test)
LR_score = linearRegression.score(X_train, y_train)
print(LR_score)
#ridge
ridgeCV = RidgeCV(alphas=alphas).fit(X_train, y_train)
prediction2 = ridgeCV.predict(X_test)
R_score = ridgeCV.score(X_train, y_train)
print(R_score)
#lasso
lassoCV = LassoCV(alphas=alphas, max_iter=1e2).fit(X_train, y_train)
prediction3 = lassoCV.predict(X_test)
L_score = lassoCV.score(X_train, y_train)
print(L_score)
#elasticNetCV
elasticnetCV = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, max_iter=1e2).fit(X_train, y_train)
prediction4 = elasticnetCV.predict(X_test)
EN_score = elasticnetCV.score(X_train, y_train)
print(EN_score)
from sklearn.ensemble import RandomForestRegressor
randfr = RandomForestRegressor()
randfr = randfr.fit(X_train, y_train)
prediction5 = randfr.predict(X_test)
print(prediction5.shape)
RF_score = randfr.score(X_train, y_train)
print(RF_score)
#putting it lall together
rmse_vals = [LR_score, R_score, L_score, EN_score, RF_score]
labels = ['Linear', 'Ridge', 'Lasso', 'ElasticNet', 'RandomForest']
rmse_df = pd.Series(rmse_vals, index=labels).to_frame()
rmse_df.rename(columns={0: 'SCORES'}, inplace=1)
rmse_df
\\KaggleHouse_submission_1 = pd.DataFrame({'Id': test.Id, 'SalePrice': prediction5})
KaggleHouse_submission_1 = KaggleHouse_submission_1
print(KaggleHouse_submission_1.shape)
In the kaggle house prediction there is a train dataset and a test dataset. here is the link to the actual data link. The output dataframe size should be a 1459 X 2 but mine is 1460 X 2 for some reason. I am not sure why this is happening. Any feedbacks is highly appreciated.
In the following line:
test = train.fillna(0)
you are assigning (overwriting) test variable with the "train" data ...
Scikit learn is very sensitive o ordering of columns, so if your train data set and the test data set are misaligned, you may have a problem similar to that above. so you need to first ensure that the test data is encoded same as the train data by using the following align command.
train, test = train_dummies.align(test_dummies, join='left', axis = 1)
see changes in my code above

Categories