Running out of memory while training machine learning model

Running out of memory while training machine learning model - python

I have limited memory and training this model is taking too much:
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
clf = RandomForestClassifier(n_estimators=10)
print("Created Random Forest classifier\n")
data = pd.read_csv("House_2_ALL.csv")
print("Finished reading data\n")
data.drop("UnixTimeStamp",1)
predict = "Aggregate_Power"
print("Dropped UnixTimeStamp\n")
X = np.array(data.drop([predict],1))
Y = np.array(data[predict])
print("Created numpy Arrays\n")
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.1)
print("Assigned Testing/Training Variables\n")
clf.fit(X_train, Y_train)
print("Fit model\n")
print("Attempting to predict\n")
print(clf.predict(X_test))
When I run this program, my computer states that it has run out of memory and that I need to quit some applications.
Any ideas on how to manage memory better or is the only solution to reduce the size of my training dataset?
I have learned that the program runs smoothly until it gets to the "clf.fit(X_train, Y_train)" line so I don't know if this is a problem with pandas' memory hungry datafrmes or sklearn.

In my opinion, the size of your dataset is quite large. You should hence load your dataset in parts for training your model. I will share an example
df = pd.read_csv(dataset_path, chunksize=10000)
# This will load only 10000 rows at a time (you can tune for your RAM)
# Now the df is a generator and hence you can do something like this
for part_df in df:
'''
Now here you just consider the "part_df" as your original df and do all
the preprocessing and stuff on it and train the model on it. After training
the model on this part_df you save the model and reload it in the next iteration.
'''
df = preprocess_df(df) # Some preprocessing function
xtrain, xvalid, ytrain, yvalid = train_test_split(df) # Some split
model = None
if (os.exists(model_path)): # you won't have a model for first iteration
model = # Here you load the model
else:
model = # Define the model for first iteration of df
model.fit(...) # train the model
# Now you save the model for the next iteration

There are two possible scenarios here that could cause Memory error.
1.Pandas.read_csv() with chunk_size
You could use chunk_size parameters and load the data a smaller chunk at a time(returns an object we can iterate over).
chunk_size=50000
reader = pd.read_csv('big_file.csv', chunksize=chunk_size)
for i in range(num):
data_chunk = next(reader)
# process chunk
1.Random Forest Classifier/Regressor
It has default parameters max_depth=None,min_samples_leaf=1 which means full trees are grown. If the dataset is large then the RandomForest could grow fully deep trees and nodes leading to a faster memory consumption.
Let clf = RandomForestClassifier(),
clf.fit(X_train, y_train)
then you could check on few things like
print(clf.estimators_[0].tree_.max_depth) # max_depth on a chunk of data.
joblib.dump(clf.estimators_[0], "first_tree_clf.joblib") # get the size of a tree.
Now you can try a definite value for hyperparameter max_depth and again fit the model. Tuning of the RandomForest classifier model hyperparameters would create shallow trees per chunk and avoid too much of memory consumption

Related

Keras prediction incorrect with scaler and feature selection

I build an application that trains a Keras binary classifier model (0 or 1) every x time (hourly,daily) given the new data. The data preparation, training and testing works well, or at least as expected. It tests different features and scales it with MinMaxScaler (some values are negative).
On live data predictions with one single data point, the values are unrealistic (around 0.9987 to 1 most of the time, which is inaccurate). Since the result should be how close to "1" the prediction is, getting such high numbers constantly raises alerts.
Code for live prediction is as follows
current_df is a pandas dataframe that contains the 1 row with the data pulled live and the column headers, we select the "features" (since why load the features from the db and we implement dynamic feature selection when training the model, which could mean on every model there are different features)
Get the features as a list:
# Convert literal str to list
features = ast.literal_eval(features)
Then select only the features that I need in the dataframe:
# Select the features
selected_df = current_df[features]
Get the values as a list:
# Get the values of the df
current_list = selected_df.values.tolist()[0]
Then I reshape it:
# Reshape to allow scaling and predicting
current_list = np.reshape(current_list, (-1, 1))
If I call "transform" instead of "fit_transform" in the line above, I get the following error: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Reshape again:
# Reshape to be able to scale
current_list = np.reshape(current_list, (1, -1))
Loads the model using Keras (model_location is a Path) and predict:
# Loads the model from the local folder
reconstructed_model = keras.models.load_model(model_location)
prediction = reconstructed_model.predict(current_list)
prediction = prediction.flat[0]
Updated
The data gets scaled using fit_transform and transform (MinMaxScaler although it can be Standard Scaler):
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
And this is run when training the model (the "model" config is not shown):
# Compile the model
model.compile(optimizer=optimizer,
loss=loss,
metrics=['binary_accuracy'])
# build the model
model.fit(X_train, y_train, epochs=epochs, verbose=0)
# Evaluate using Keras built-in function
scores = model.evaluate(X_test, y_test, verbose=0)
testing_accuracy = scores[1]
# create model with sklearn KerasClassifier for evaluation
eval_model = KerasClassifier(model=model, epochs=epochs, batch_size=10, verbose=0)
# Evaluate model using RepeatedStratifiedKFold
accuracy = ML.evaluate_model_KFold(eval_model, X_test, y_test)
# Predict testing data
pred_test= model.predict(X_test, verbose=0)
pred_test = pred_test.flatten()
# extract the predicted class labels
y_predicted_test = np.where(pred_test > 0.5, 1, 0)
Regarding feature selection, the features are not always the same --I use both SelectKBest (10 or 15 features) or RFECV. And select the trained model with highest accuracy, meaning the features can be different.
Is there anything I'm doing wrong here? I'm thinking maybe the scaling should be done before the feature selection or there's some issue with the scaling (since maybe some values might be 0 when training and 100 when using it and the features are not necessarily the same when scaling).

The issues seems to stem from a StandardScaler / MinMaxScaler. The following example shows how to apply the former. However, if there are separate scripts handling learning/prediction, then the scaler will also need to be serialized and loaded at prediction time.
Set up a classification problem:
X, y = make_classification(n_samples=10_000)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
Fit a StandardScaler instance on the training set and use the same parameters to .transform the test set:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# Train time: Serialize the scaler to a pickle file.
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
# Test time: Load the scaler and apply to the test set.
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
X_test = new_scaler.transform(X_test)
Which means that the model should be fit on features with similar distributions:
model = keras.Sequential([
keras.Input(shape=X_train.shape[1]),
layers.Dense(100),
layers.Dropout(0.1),
layers.Dense(1, activation="relu")])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])
model.fit(X_train, y_train, epochs=25)
y_pred = np.where(model.predict(X_test)[:, 0] > 0.5, 1, 0)
print(accuracy_score(y_test, y_pred))
# 0.8708

Alexander's answer is correct, I think there is just some confusion between testing and live prediction. What he said regarding the testing step is equally applicable to live prediction step. After you've called scaler.fit_transform on your training set, add the following code to save the scaler:
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
Then, during live prediction step, you don't call fit_transform. Instead, you load the scaler saved during training and call transform:
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
# Load features, reshape them, etc
# Scaling step
current_list = new_scaler.transform(current_list)
# Features are scaled properly now, put the rest of your prediction code here
You always call fit_transform only once per model, during the training step on your training pool. After that (during testing or calculating predictions after model deployment) you never call it, only call transform. Treat scaler as part of the model. Naturally, you fit the model on the training set and then during testing and live prediction you use the same model, never refitting it. The same should be true for the scaler.
If you call scaler.fit_transform on live prediction features it creates a new scaler that has no prior knowledge of feature distribution on training set.

Scaling Features For Prediction in Scikit Learn

I have been working on a machine learning model and I'm currently using a Pipeline with GridSearchCV. My data is scaled with MinMaxScaler and I'm using an SVR with RBR kernel. My question is now that my model is complete, fitted, and has a decent evaluation score, do I need to also scale new data for predictions with MinMaxScaler or can I just make predictions with the data as is? I've read 3 books on scikit learn but they all focus on feature engineering and fitting. They don't really cover any additional steps in the prediction step other than use the predict method.
This is the code:
pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())])
time_split = TimeSeriesSplit(n_splits=5)
param_grid = {'clf__kernel': ['rbf'],
'clf__C':[0.0001, 0.001],
'clf__gamma': [0.0001, 0.001]}
grid = GridSearchCV(pipe, param_grid, cv= time_split,
scoring='neg_mean_squared_error', n_jobs = -1)
grid.fit(X_train, y_train)

Sure, if you get new (in the sense of unprocessed) data you need to do the same preparation steps as you did when training the model. For example if you use MinMaxScaler with default proporties the model is used to have data with zero mean and standard variance in each feature, if you don't preprocess data the model can't produce accurate results.
Keep in mind to use exactly the same MinMaxScaler object you used for the training data. So in case you save your model to a file, save also your preprocessing objects.

I wanted to follow up my question with a solution thanks to pythonic833's answer. I think the proper procedure to scale new data for prediction if you used a pipeline is to do the whole scaling process from beginning to end with the original training data that was used on the pipeline. Even though the pipeline did the scaling for you during the training process, it's necessary to scale the training data manually to be able to have the new data predict accurately and scaled correctly by having a MinMaxScaler object. Below is my code based on pythonic833 answer and some of the other comments such as saving the model with Pickle.
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())])
time_split = TimeSeriesSplit(n_splits=5)
param_grid = {'clf__kernel': ['rbf'],
'clf__C':[0.0001, 0.001],
'clf__gamma': [0.0001, 0.001]}
grid = GridSearchCV(pipe, param_grid, cv= time_split,
scoring='neg_mean_squared_error', n_jobs = -1)
grid.fit(X_train, y_train)
# Pickle the data with a content manager
with open('Pickles/{}.pkl'.format(file_name), 'wb') as file:
pickle.dump(grid, file)
# Load Pickle with a content manager
with open('Pickles/{}.pkl'.format(file_name), 'rb') as file:
model = pickle.load(file)
scaler = MinMaxScaler()
scaler.fit(X_train) # Original training data for Pipeline
X_train_scaled = scaler.transform(X_train)
new_data_scaled = scaler.transform(new_data)
model.predict(new_data_scaled)

Logistic regression sklearn - train and apply model

I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.

I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.

Load and predict new data sklearn

I trained a Logistic model, cross-validated and saved it to file using joblib module. Now I want to load this model and predict new data with it.
Is this the correct way to do this? Especially the standardization. Should I use scaler.fit() on my new data too? In the tutorials I followed, scaler.fit was only used on the training set, so I'm a bit lost here.
Here is my code:
#Loading the saved model with joblib
model = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)[:-1]
# Standardize new data
scaler = StandardScaler()
X_pred = scaler.fit(pr[pred_cols]).transform(pr[pred_cols])
pred = pd.Series(model.predict(X_pred))
print pred

No, it's incorrect. All the data preparation steps should be fit using train data. Otherwise, you risk applying the wrong transformations, because means and variances that StandardScaler estimates do probably differ between train and test data.
The easiest way to train, save, load and apply all the steps simultaneously is to use Pipelines:
At training:
# prepare the pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
pipe = make_pipeline(StandardScaler(), LogisticRegression)
pipe.fit(X_train, y_train)
joblib.dump(pipe, 'model.pkl')
At prediction:
#Loading the saved model with joblib
pipe = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)[:-1]
# apply the whole pipeline to data
pred = pd.Series(pipe.predict(pr[pred_cols]))
print pred

How can I implement incremental training for xgboost?

The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch.

Try saving your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.
Here's a small experiment that I ran to convince myself that it works:
First, split the boston dataset into training and testing sets.
Then split the training set into halves.
Fit a model with the first half and get a score that will serve as a benchmark.
Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar..
But, fortunately, the new model seems to perform much better than the first.
import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
X = load_boston()['data']
y = load_boston()['target']
# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,
y_train,
test_size=0.5,
random_state=0)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')
print(mse(model_1.predict(xg_test), y_test)) # benchmark
print(mse(model_2_v1.predict(xg_test), y_test)) # "before"
print(mse(model_2_v2.predict(xg_test), y_test)) # "after"
# 23.0475232194
# 39.6776876084
# 27.2053239482
reference: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

There is now (version 0.6?) a process_update parameter that might help. Here's an experiment with it:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target
X=pd.DataFrame(X,columns=features)
y = pd.Series(y,index=X.index)
# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X): # this looks silly
pass
train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]
xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:linear', 'verbose': False}
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
params.update({'process_type': 'update',
'updater' : 'refresh',
'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mse(model_1.predict(xg_test), y_test))
print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test)) # "before"
print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test)) # "after"
print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test)) # "after"
Output:
full train 17.8364309709
model 1 24.2542132108
model 2 25.6967017352
model 1+2 22.8846455135
model 1+update2 14.2816257268

I created a gist of jupyter notebook to demonstrate that xgboost model can be trained incrementally. I used boston dataset to train the model. I did 3 experiments - one shot learning, iterative one shot learning, iterative incremental learning. In incremental training, I passed the boston data to the model in batches of size 50.
The gist of the gist is that you'll have to iterate over the data multiple times for the model to converge to the accuracy attained by one shot (all data) learning.
Here is the corresponding code for doing iterative incremental learning with xgboost.
batch_size = 50
iterations = 25
model = None
for i in range(iterations):
for start in range(0, len(x_tr), batch_size):
model = xgb.train({
'learning_rate': 0.007,
'update':'refresh',
'process_type': 'update',
'refresh_leaf': True,
#'reg_lambda': 3, # L2
'reg_alpha': 3, # L1
'silent': False,
}, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)
y_pr = model.predict(xgb.DMatrix(x_te))
#print(' MSE itr#{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))
print('MSE itr#{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))
y_pr = model.predict(xgb.DMatrix(x_te))
print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))
XGBoost version: 0.6

looks like you don't need anything other than call your xgb.train(....) again but provide the model result from the previous batch:
# python
params = {} # your params here
ith_batch = 0
n_batches = 100
model = None
while ith_batch < n_batches:
d_train = getBatchData(ith_batch)
model = xgb.train(params, d_train, xgb_model=model)
ith_batch += 1
this is based on https://xgboost.readthedocs.io/en/latest/python/python_api.html

If your problem is regarding the dataset size and you do not really need Incremental Learning (you are not dealing with an Streaming app, for instance), then you should check out Spark or Flink.
This two frameworks can train on very large datasets with a small RAM, leveraging disk memory. Both framework deal with memory issues internally. While Flink had it solved first, Spark has caught up in recent releases.
Take a look at:
"XGBoost4J: Portable Distributed XGBoost in Spark, Flink and Dataflow": http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
Spark Integration: http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html

To paulperry's code, If change one line from "train_split = round(len(train_idx) / 2)" to "train_split = len(train_idx) - 50". model 1+update2 will changed from 14.2816257268 to 45.60806270012028. And a lot of "leaf=0" result in dump file.
Updated model is not good when update sample set is relative small.
For binary:logistic, updated model is unusable when update sample set has only one class.

One possible solution that I have not tested is to used a dask dataframe which should act the same as a pandas dataframe but (I assume) utilize disk and reads in and out of RAM. here are some helpful links.
this link mentions how to use it with xgboost also see
also see.
further there is an experimental options from XGBoost as well here but it is "not ready for production"

It's not based on xgboost, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

I agree with #desertnaut in his solution.
I have a dataset where I split it into 4 batches. I have to do an initial fit without the xgb_model parameter first, then the next fits will have the xgb_model parameter, like in this (I'm using the Sklearn API):
for i, (X_batch, y_batch) in enumerate(zip(self.X_train_batched, self.y_train_batched)):
print(f'Step: {i}',end = ' ')
if i == 0:
model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
verbose=False, eval_metric = ['logloss'],
early_stopping_rounds = 400)
else:
model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
verbose=False, eval_metric = ['logloss'],
early_stopping_rounds = 400, xgb_model=model_xgbc)
preds = model_xgbc.predict(self.X_valid)
rmse = metrics.mean_squared_error(self.y_valid, preds,squared=False)

Hey guys you can use my simple code for incremental model training with xgb base class :
batch_size = 10000000
X_train="your pandas training DataFrame"
y_train="Your lables"
#Store eval results
evals_result={}
Deval = xgb.DMatrix(X_valid, y_valid)
eval_sets = [(Dtrain, 'train'), (Deval, 'eval')]
for start in range(0, n, batch_size):
model = xgb.train({'refresh_leaf': True,
'process_type': 'default',
'max_depth': 5,
'objective': 'reg:squarederror',
'num_parallel_tree': 2,
'learning_rate':0.05,
'n_jobs':-1},
dtrain=xgb.DMatrix(X_train, y_train), evals=eval_sets, early_stopping_rounds=5,num_boost_round=100,evals_result=evals_result,xgb_model=model)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running out of memory while training machine learning model - python

Related

Keras prediction incorrect with scaler and feature selection

Scaling Features For Prediction in Scikit Learn

Logistic regression sklearn - train and apply model

Load and predict new data sklearn

How can I implement incremental training for xgboost?

Categories

Resources