The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.
Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.
Here's a small experiment that I ran to convince myself that it works:
First, split the boston dataset into training and testing sets.
Then split the training set into halves.
Fit a model with the first half and get a score that will serve as a benchmark.
Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar..
But, fortunately, the new model seems to perform much better than the first.
import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
X = load_boston()['data']
y = load_boston()['target']
# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,
y_train,
test_size=0.5,
random_state=0)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')
print(mse(model_1.predict(xg_test), y_test)) # benchmark
print(mse(model_2_v1.predict(xg_test), y_test)) # "before"
print(mse(model_2_v2.predict(xg_test), y_test)) # "after"
# 23.0475232194
# 39.6776876084
# 27.2053239482
reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py
There is now (version 0.6?) a process_update parameter that might help. Here's an experiment with it:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target
X=pd.DataFrame(X,columns=features)
y = pd.Series(y,index=X.index)
# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X): # this looks silly
pass
train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]
xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:linear', 'verbose': False}
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
params.update({'process_type': 'update',
'updater' : 'refresh',
'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mse(model_1.predict(xg_test), y_test))
print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test)) # "before"
print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test)) # "after"
print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test)) # "after"
Output:
full train 17.8364309709
model 1 24.2542132108
model 2 25.6967017352
model 1+2 22.8846455135
model 1+update2 14.2816257268
I created a gist of jupyter notebook to demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments - one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.
The gist of the gist is that you'll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.
Here is the corresponding code for doing iterative incremental learning with xgboost.
batch_size = 50
iterations = 25
model = None
for i in range(iterations):
for start in range(0, len(x_tr), batch_size):
model = xgb.train({
'learning_rate': 0.007,
'update':'refresh',
'process_type': 'update',
'refresh_leaf': True,
#'reg_lambda': 3, # L2
'reg_alpha': 3, # L1
'silent': False,
}, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)
y_pr = model.predict(xgb.DMatrix(x_te))
#print(' MSE itr#{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))
print('MSE itr#{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))
y_pr = model.predict(xgb.DMatrix(x_te))
print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))
XGBoost version: 0.6
looks like you don't need anything other than call your xgb.train(....) again but provide the model result from the previous batch:
# python
params = {} # your params here
ith_batch = 0
n_batches = 100
model = None
while ith_batch < n_batches:
d_train = getBatchData(ith_batch)
model = xgb.train(params, d_train, xgb_model=model)
ith_batch += 1
this is based on https://xgboost.readthedocs.io/en/latest/python/python_api.html
If your problem is regarding the dataset size and you do not really need Incremental Learning (you are not dealing with an Streaming app, for instance), then you should check out Spark or Flink.
This two frameworks can train on very large datasets with a small RAM, leveraging disk memory. Both framework deal with memory issues internally. While Flink had it solved first, Spark has caught up in recent releases.
Take a look at:
"XGBoost4J: Portable Distributed XGBoost in Spark, Flink and Dataflow": http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
Spark Integration: http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html
To paulperry's code, If change one line from "train_split = round(len(train_idx) / 2)" to "train_split = len(train_idx) - 50". model 1+update2 will changed from 14.2816257268 to 45.60806270012028. And a lot of "leaf=0" result in dump file.
Updated model is not good when update sample set is relative small.
For binary:logistic, updated model is unusable when update sample set has only one class.
One possible solution that I have not tested is to used a dask dataframe which should act the same as a pandas dataframe but (I assume) utilize disk and reads in and out of RAM. here are some helpful links.
this link mentions how to use it with xgboost also see
also see.
further there is an experimental options from XGBoost as well here but it is "not ready for production"
It's not based on xgboost, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.
I agree with #desertnaut in his solution.
I have a dataset where I split it into 4 batches. I have to do an initial fit without the xgb_model parameter first, then the next fits will have the xgb_model parameter, like in this (I'm using the Sklearn API):
for i, (X_batch, y_batch) in enumerate(zip(self.X_train_batched, self.y_train_batched)):
print(f'Step: {i}',end = ' ')
if i == 0:
model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
verbose=False, eval_metric = ['logloss'],
early_stopping_rounds = 400)
else:
model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
verbose=False, eval_metric = ['logloss'],
early_stopping_rounds = 400, xgb_model=model_xgbc)
preds = model_xgbc.predict(self.X_valid)
rmse = metrics.mean_squared_error(self.y_valid, preds,squared=False)
Hey guys you can use my simple code for incremental model training with xgb base class :
batch_size = 10000000
X_train="your pandas training DataFrame"
y_train="Your lables"
#Store eval results
evals_result={}
Deval = xgb.DMatrix(X_valid, y_valid)
eval_sets = [(Dtrain, 'train'), (Deval, 'eval')]
for start in range(0, n, batch_size):
model = xgb.train({'refresh_leaf': True,
'process_type': 'default',
'max_depth': 5,
'objective': 'reg:squarederror',
'num_parallel_tree': 2,
'learning_rate':0.05,
'n_jobs':-1},
dtrain=xgb.DMatrix(X_train, y_train), evals=eval_sets, early_stopping_rounds=5,num_boost_round=100,evals_result=evals_result,xgb_model=model)
Related
I'm currently working on a XGBoost regression model to predict ticket bookings.
My issue is that my model has a good accuracy for the training set (around 96%) and for the testing set (around 94%) but when I try to use the model to predict my booking on another held out dataset the accuracy on this one drop to 82%.
I tried switching some data from my testing set to this held out set and the accuracy is still pretty bad, even though the model can efficiently predict these data when they're inside my testing set.
I assume I'm doing something wrong but I can't figure out what.
Any help would be appreciated, thanks
Here's the XGBoost model part of my code:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
X_conso, y_conso = data_conso2.iloc[:,:-1],data_conso2.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_conso, y_conso, test_size=0.3, random_state=20)
d_train = xgb.DMatrix(X_train, label = y_train)
d_test = xgb.DMatrix(X_test, label = y_test)
d_fcst_held_out = xgb.DMatrix(X_fcst_held_out)
params = {'p_colsample_bytree_conso' : 0.9,
'p_colsample_bylevel_conso': 0.9,
'p_colsample_bynode_conso': 0.9,
'p_learning_rate_conso': 0.3,
'p_max_depth_conso': 10,
'p_alpha_conso': 3,
'p_n_estimators_conso': 10,
'p_gamma_conso': 0.8}
steps = 100
watchlist = [(d_train, 'train'), (d_test, 'test')]
model = xgb.train(params, d_train, steps, watchlist, early_stopping_rounds = 50)
preds_train = model.predict(d_train)
preds_test = model.predict(d_test)
preds_fcst = model.predict(d_fcst_held_out)
And my accuracy levels :
Error train: 4.524787%
Error test: 5.978759%
Error fcst: 18.008451%
This is generally normal, the unseen data usually has lower accuracy.
To improve accuracy on data you may optimize your parameters using for example optuna.
I have limited memory and training this model is taking too much:
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
clf = RandomForestClassifier(n_estimators=10)
print("Created Random Forest classifier\n")
data = pd.read_csv("House_2_ALL.csv")
print("Finished reading data\n")
data.drop("UnixTimeStamp",1)
predict = "Aggregate_Power"
print("Dropped UnixTimeStamp\n")
X = np.array(data.drop([predict],1))
Y = np.array(data[predict])
print("Created numpy Arrays\n")
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.1)
print("Assigned Testing/Training Variables\n")
clf.fit(X_train, Y_train)
print("Fit model\n")
print("Attempting to predict\n")
print(clf.predict(X_test))
When I run this program, my computer states that it has run out of memory and that I need to quit some applications.
Any ideas on how to manage memory better or is the only solution to reduce the size of my training dataset?
I have learned that the program runs smoothly until it gets to the "clf.fit(X_train, Y_train)" line so I don't know if this is a problem with pandas' memory hungry datafrmes or sklearn.
In my opinion, the size of your dataset is quite large. You should hence load your dataset in parts for training your model. I will share an example
df = pd.read_csv(dataset_path, chunksize=10000)
# This will load only 10000 rows at a time (you can tune for your RAM)
# Now the df is a generator and hence you can do something like this
for part_df in df:
'''
Now here you just consider the "part_df" as your original df and do all
the preprocessing and stuff on it and train the model on it. After training
the model on this part_df you save the model and reload it in the next iteration.
'''
df = preprocess_df(df) # Some preprocessing function
xtrain, xvalid, ytrain, yvalid = train_test_split(df) # Some split
model = None
if (os.exists(model_path)): # you won't have a model for first iteration
model = # Here you load the model
else:
model = # Define the model for first iteration of df
model.fit(...) # train the model
# Now you save the model for the next iteration
There are two possible scenarios here that could cause Memory error.
1.Pandas.read_csv() with chunk_size
You could use chunk_size parameters and load the data a smaller chunk at a time(returns an object we can iterate over).
chunk_size=50000
reader = pd.read_csv('big_file.csv', chunksize=chunk_size)
for i in range(num):
data_chunk = next(reader)
# process chunk
1.Random Forest Classifier/Regressor
It has default parameters max_depth=None,min_samples_leaf=1 which means full trees are grown. If the dataset is large then the RandomForest could grow fully deep trees and nodes leading to a faster memory consumption.
Let clf = RandomForestClassifier(),
clf.fit(X_train, y_train)
then you could check on few things like
print(clf.estimators_[0].tree_.max_depth) # max_depth on a chunk of data.
joblib.dump(clf.estimators_[0], "first_tree_clf.joblib") # get the size of a tree.
Now you can try a definite value for hyperparameter max_depth and again fit the model. Tuning of the RandomForest classifier model hyperparameters would create shallow trees per chunk and avoid too much of memory consumption
I am using ElasticNet to obtain a fit of my data. To determine the hyperparameters (l1, alpha), I am using ElasticNetCV. With the obtained hyperparamers, I refit the model to the whole dataset for production use. I am unsure if this is correct in both, the machine learning aspect and - if so - how I do it. The code "works" and presumably does what it should, but I wanted to be certain that it is also correct.
My procedure is:
X_tr, X_te, y_tr, y_te = train_test_split(X,y)
optimizer = ElasticNetCV(l1_ratio = [.1,.5,.7,.9,.99,1], n_alphas=400, cv=5, normalize=True)
optimizer.fit(X_tr, y_tr)
best = ElasticNet(alpha=optimizer.alpha_, l1_ratio=optimizer.l1_ratio_, normalize=True)
best.fit(X,y)
Thank you in advance
I am a beginner on this but I would love to share my approach to ElasticNet hyperparameters tuning. I would suggest to use RandomizedSearchCV instead. Here is part of the current code I am writing now:
#-----------------------------------------------
# input:
# X_train, X_test, Y_train, Y_test: datasets
# Returns:
# R² and RMSE Scores
#-----------------------------------------------
# Standardize data before
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# define grid
params = dict()
# values for alpha: 100 values between e^-5 and e^5
params['alpha'] = np.logspace(-5, 5, 100, endpoint=True)
# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)
Warning: you are testing 100 x 100 = 10 000 possible combinations.
# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()
# Call the RanddomizedSearch with Cross Validation using the chosen regressor
rs_cv= RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=5, verbose=0, refit=True)
rs_cv.fit(X_train, Y_train.values.ravel())
# Results
Y_pred = rs_cv.predict(X_test)
R2_score = rs_cv.score(X_test, Y_test)
RMSE_score = np.sqrt(mean_squared_error(Y_test, Y_pred))
return R2_score, RMSE_score, rs_cv.best_params_
The advantage is that in RandomizedSearchCV the number of iterations can be predetermined in advance. The choices of points to be tested are random but 90% (in some cases) faster than GridSearchCV (that tests all possible combinations).
I am using this same approach for other regressors like RandomForests and GradientBoosting who parameters grids are far more complicated and demand much more computer power to run.
As I said at the beginning I am new to this field, so any constructive comment will be welcomed.
Johnny
I have the following line of code:
# Setting the values for the number of folds
num_folds = 10
seed = 7
# Separating data into folds
kfold = KFold(num_folds, True, random_state = seed)
# Create the unit model (classificador fraco)
cart = DecisionTreeClassifier()
# Setting the number of trees
num_trees = 100
# Creating the bagging model
model = BaggingClassifier(base_estimator = cart, n_estimators = num_trees, random_state = seed)
# Cross Validation
resultado = cross_val_score(model, X, Y, cv = kfold)
# Result print
print("Acurácia: %.3f" % (resultado.mean() * 100))
This is a ready-made code that I got from the internet, which is obviously predefined for testing my cross-validated TRAINING data and knowing the accuracy of the bagging algorithm.
I would like to know if I can apply it to my TEST data (data without output 'Y')
The code is a bit confusing and I can't model it.
I'm looking for something like:
# Training the model
model.fit(X, Y)
# Making predictions
Y_pred = model.predict(X_test)
I want to use the trained bagging model on top of the training data in the test data and make predictions but I don't know how to modify the code
You have everything to predict new data already. I am providing a small example with toy data and comments to make it clear.
from sklearn.ensemble import BaggingClassifier
cart = BaggingClassifier()
X_train = [[0, 0], [1, 1]] # training data
Y_train = [0, 1] # training labels
cart.fit(X_train, Y_train) # model is trained
y_pred = cart.predict([ [0,1] ]) # new data
print(y_pred)
# prints [0], so it predicts the new sample (0,1) as 0 class
I am trying to run my lightgbm for feature selection as below;
initialization
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary',
boosting_type = 'goss',
n_estimators = 10000, class_weight ='balanced')
then i fit the model as below
# Fit the model twice to avoid overfitting
for i in range(2):
# Split into training and validation set
train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)
# Train using early stopping
model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
but i get the below error
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is: [6] valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,)
An example for getting feature importance in lightgbm when using train model.
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
def plotImp(model, X , num = 20, fig_size = (40, 20)):
feature_imp = pd.DataFrame({'Value':model.feature_importance(),'Feature':X.columns})
plt.figure(figsize=fig_size)
sns.set(font_scale = 5)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value",
ascending=False)[0:num])
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances-01.png')
plt.show()
Depending on whether we trained the model using scikit-learn or lightgbm methods, to get importance we should choose respectively feature_importances_ property or feature_importance() function, like in this example (where model is a result of lgbm.fit() / lgbm.train(), and train_columns = x_train_df.columns):
import pandas as pd
def get_lgbm_varimp(model, train_columns, max_vars=50):
if "basic.Booster" in str(model.__class__):
# lightgbm.basic.Booster was trained directly, so using feature_importance() function
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
else:
# Scikit-learn API LGBMClassifier or LGBMRegressor was fitted,
# so using feature_importances_ property
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T
cv_varimp_df.columns = ['feature_name', 'varimp']
cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)
cv_varimp_df = cv_varimp_df.iloc[0:max_vars]
return cv_varimp_df
Note that we rely on the assumption that feature importance values are ordered just like the model matrix columns were ordered during training (incl. one-hot dummy cols), see LightGBM #209.
For the LightGBM's 3.1.1 version, extending the comment of #user3067175 :
pd.DataFrame({'Value':model.feature_importance(),'Feature':features}).sort_values(by="Value",ascending=False)
is a list of feature names,within the same order of your dataset, can be replaced by features = df_train.columns.tolist().
This should return the feature importance with the same order of plot.
Note: If you use LGBMRegressor or LGBMClassifier, you should use
pd.DataFrame({'Value':model.feature_importances_,'Feature':features}).sort_values(by="Value",ascending=False)
If you want to examine a loaded model that you don't have the training data, you can get feature importance and the feature name by
df_feature_importance = (
pd.DataFrame({
'feature': model.feature_name(),
'importance': model.feature_importance(),
})
.sort_values('importance', ascending=False)
)