What is the difference between partial fit and warm start?

What is the difference between partial fit and warm start? - python

Context:
I am using Passive Aggressor from scikit library and confused whether to use warm start or partial fit.
Efforts hitherto:
Referred this thread discussion:
https://github.com/scikit-learn/scikit-learn/issues/1585
Gone through the scikit code for _fit and _partial_fit.
My observations:
_fit in turn calls _partial_fit.
When warm_start is set, _fit calls _partial_fit with self.coef_
When _partial_fit is called without coef_init parameter and self.coef_ is set, it continues to use self.coef_
Question:
I feel both are ultimately providing the same functionalities.Then, what is the basic difference between them? In which contexts, either of them are used?
Am I missing something evident? Any help is appreciated!

I don't know about the Passive Aggressor, but at least when using the SGDRegressor, partial_fit will only fit for 1 epoch, whereas fit will fit for multiple epochs (until the loss converges or max_iter
is reached). Therefore, when fitting new data to your model, partial_fit will only correct the model one step towards the new data, but with fit and warm_start it will act as if you would combine your old data and your new data together and fit the model once until convergence.
Example:
from sklearn.linear_model import SGDRegressor
import numpy as np
np.random.seed(0)
X = np.linspace(-1, 1, num=50).reshape(-1, 1)
Y = (X * 1.5 + 2).reshape(50,)
modelFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
shuffle=True, max_iter=2000, tol=1e-3, warm_start=True)
modelPartialFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
shuffle=True, max_iter=2000, tol=1e-3, warm_start=False)
# first fit some data
modelFit.fit(X, Y)
modelPartialFit.fit(X, Y)
# for both: Convergence after 50 epochs, Norm: 1.46, NNZs: 1, Bias: 2.000027, T: 2500, Avg. loss: 0.000237
print(modelFit.coef_, modelPartialFit.coef_) # for both: [1.46303288]
# now fit new data (zeros)
newX = X
newY = 0 * Y
# fits only for 1 epoch, Norm: 1.23, NNZs: 1, Bias: 1.208630, T: 50, Avg. loss: 1.595492:
modelPartialFit.partial_fit(newX, newY)
# Convergence after 49 epochs, Norm: 0.04, NNZs: 1, Bias: 0.000077, T: 2450, Avg. loss: 0.000313:
modelFit.fit(newX, newY)
print(modelFit.coef_, modelPartialFit.coef_) # [0.04245779] vs. [1.22919864]
newX = np.reshape([2], (-1, 1))
print(modelFit.predict(newX), modelPartialFit.predict(newX)) # [0.08499296] vs. [3.66702685]

If warm_start = False, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will reset the model's trainable parameters for the initialisation. If warm_start = True, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will retain the values of the model's trainable parameters from the previous run, and use those initially.
Regardless of the value of warm_start, each call to partial_fit() will retain the previous run's model parameters and use those initially.
Example using MLPRegressor:
import sklearn.neural_network
import numpy as np
np.random.seed(0)
x = np.linspace(-1, 1, num=50).reshape(-1, 1)
y = (x * 1.5 + 2).reshape(50,)
cold_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=False, max_iter=1)
warm_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=True, max_iter=1)
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[0.17009494]])] [array([0.74643783])]
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60819342]])] [array([-1.21256186])]
#after second run of .fit(), values are completely different
#because they were re-initialised before doing the second run for the cold model
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39815616]])] [array([1.651504])]
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39715616]])] [array([1.652504])]
#this time with the warm model, params change relatively little, as params were
#not re-initialised during second call to .fit()
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60719343]])] [array([-1.21156187])]
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60619347]])] [array([-1.21056189])]
#with partial_fit(), params barely change even for cold model,
#as no re-initialisation occurs
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39615617]])] [array([1.65350392])]
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39515619]])] [array([1.65450372])]
#and of course the same goes for the warm model

First, let us look at the difference between .fit() and .partial_fit().
.fit() would let you train from the scratch. Hence, you could think of this as a option that can be used only once for a model. If you call .fit() again with a new set of data, the model would be build on the new data and will have no influence of previous dataset.
.partial_fit() would let you update the model with incremental data. Hence, this option can be used more than once for a model. This could be useful, when the whole dataset cannot be loaded into the memory, refer here.
If both .fit() or .partial_fit() are going to be used once, then it makes no difference.
warm_start can be only used in .fit(), it would let you start the learning from the co-eff of previous fit(). Now it might sound similar to the purpose to partial_fit(), but recommended way would be partial_fit(). May be do the partial_fit() with same incremental data few number of times, to improve the learning.

About difference. Warm start it just an attribute of class. Partial fit it is method of this class. It's basically different things.
About same functionalities. Yes, partial fit will use self.coef_ because it still needed to get some values to update on training period. And for empty coef_init we just put zero values to self.coef_ and go to the next step of training.
Description.
For first start:
Whatever how (with or without warm start). We will train on zero coefficients but in result we will save average of our coefficients.
N+1 start:
With warm start. We will check via method _allocate_parameter_mem our previous coefficients and take it to train. In result save our average coefficients.
Without warm start. We will put zero coefficients (as first start) and go to training step. In result we will still write average coefficients to memory.

Related

LightGBM running simulation after each fit step

My model performs a multi-class (3) classification task.
I would like to change the way model "fits". Instead of calculation of a metric such as acc or logloss - I would like to run a simulation on whole data set to see how the model performs after each fit, in real-time.
Please note that simulation != loss/error. Simulation takes into the consideration time component of the data, the sequence in which events occur. Whereas the loss function simply calculates the error based on true values.
Currently I do the simulation after the whole "fitting" process has been done:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
all_ds = lgb.Dataset(X, label=y)
train_ds = lgb.Dataset(X_train, label=y_train)
test_ds = lgb.Dataset(X_test, label=y_test)
params = {
'device_type': "gpu",
'objective': 'multiclass',
'metric': 'multi_logloss',
"boosting_type": "gbdt",
"num_class": 3,
'random_state': 123
}
# fit
model = lgb.train(
params,
train_ds,
num_boost_round=20
valid_sets=[test_ds]
)
# make prediction on a whole data set
y_pred= model.predict(all_ds)
# simulate
simulation_result = simulate(X, y_pred) # float value
current process is:
fit step 1 - error x
fit step 2 - error y
..
fit step 20 - error z
simulate - see how the model performs
I would like to change the process to
fit step 1 - simulate - use result of simulation as an error
fit step 2 - simulate - use result of simulation as an error
..
fit step 20 - simulate - use result of simulation as an error
Is there a way to achieve it through a custom callback or a custom evaluation metric or some other way?
I tried creating a custom eval metric, unfortunately I cannot invoke predict() from within the function. Moreover I find the preds parameter value to be something I cannot simply use without transformations of some sort.. It contains some sort of multidimensional array that I have no idea how to convert to actual predictions.
def customEvalMetric(preds, eval_data):
# how to invoke predict() method on a whole dataset here?
# OR how to convert preds to one-hot encoded values?
# simulation_result = simulate(all_ds, ..?..)
return 'simulation_result', simulation_result, True
and using as
model = lgb.train(
params,
train_ds,
num_boost_round=20
valid_sets=[all_ds],
feval=customEvalMetric,
)
p.s. now that I think about it - I could in theory fit once in a loop, then use init_model to load the existing model weights.. Is this the only way?
I suppose this question is applicable to other tree boosting libraries since the API are similar (xgboost for example)

The custom eval function should work. As per the docs, preds is:
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.
So if this is a classification problem, you might need to apply the softmax transformation to each row. For a regression problem, you should be able to use this output as-is.

How to balance dataset using fit_generator() in Keras?

I am trying to use keras to fit a CNN model to classify 2 classes of data . I have imbalanced dataset I want to balance the data. I don't know can I use class_weight in model.fit_generator . I wonder if I used class_weight="balanced" in model.fit_generator
The main code:
def generate_arrays_for_training(indexPat, paths, start=0, end=100):
while True:
from_=int(len(paths)/100*start)
to_=int(len(paths)/100*end)
for i in range(from_, int(to_)):
f=paths[i]
x = np.load(PathSpectogramFolder+f)
x = np.expand_dims(x, axis=0)
if('P' in f):
y = np.repeat([[0,1]],x.shape[0], axis=0)
else:
y =np.repeat([[1,0]],x.shape[0], axis=0)
yield(x,y)
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),
verbose=2,
epochs=15, max_queue_size=2, shuffle=True, callbacks=[callback])

If you don't want to change your data creation process, you can use class_weight in your fit generator. You can use dictionary to set your class_weight and observe with fine tuning. For instance when class_weight is not used, and you have 50 examples for class0 and 100 examples for class1. Then, loss function calculate loss uniformly. It means that class1 will be a problem. But, when you set:
class_weight = {0:2 , 1:1}
It means that loss function will give 2 times weight to your class 0 now. Therefore, misclassification of underrepresented data will take 2 times more punishment than before. Thus, model can handle imbalanced data.
If you use class_weight='balanced' model can make that setting automatically. But my suggestion is that, create a dictionary like class_weight = {0:a1 , 1:a2} and try different values for a1 and a2, so you can understand difference.
Also, you can use undersampling methods for imbalanced data instead of using class_weight. Check Bootstrapping methods for that purpose.

Set "training=False" of "tf.layers.batch_normalization" when training will get a better validation result

I use TensorFlow to train DNN. I learned that Batch Normalization is very helpful for DNN , so I used it in DNN.
I use "tf.layers.batch_normalization" and follow the instructions of the API document to build the network: when training, set its parameter "training=True", and when validate, set "training=False". And add tf.get_collection(tf.GraphKeys.UPDATE_OPS).
Here is my code:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
input_node_num=257*7
output_node_num=257
tf_X = tf.placeholder(tf.float32,[None,input_node_num])
tf_Y = tf.placeholder(tf.float32,[None,output_node_num])
dropout_rate=tf.placeholder(tf.float32)
flag_training=tf.placeholder(tf.bool)
hid_node_num=2048
h1=tf.contrib.layers.fully_connected(tf_X, hid_node_num, activation_fn=None)
h1_2=tf.nn.relu(tf.layers.batch_normalization(h1,training=flag_training))
h1_3=tf.nn.dropout(h1_2,dropout_rate)
h2=tf.contrib.layers.fully_connected(h1_3, hid_node_num, activation_fn=None)
h2_2=tf.nn.relu(tf.layers.batch_normalization(h2,training=flag_training))
h2_3=tf.nn.dropout(h2_2,dropout_rate)
h3=tf.contrib.layers.fully_connected(h2_3, hid_node_num, activation_fn=None)
h3_2=tf.nn.relu(tf.layers.batch_normalization(h3,training=flag_training))
h3_3=tf.nn.dropout(h3_2,dropout_rate)
tf_Y_pre=tf.contrib.layers.fully_connected(h3_3, output_node_num, activation_fn=None)
loss=tf.reduce_mean(tf.square(tf_Y-tf_Y_pre))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i1 in range(3000*num_batch):
train_feature=... # Some processing
train_label=... # Some processing
sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1}) # when train , set "training=True" , when validate ,set "training=False" , get a bad result . However when train , set "training=False" ,when validate ,set "training=False" , get a better result .
if((i1+1)%277200==0):# print validate loss every 0.1 epoch
validate_feature=... # Some processing
validate_label=... # Some processing
validate_loss = sess.run(loss,feed_dict={tf_X:validate_feature,tf_Y:validate_label,flag_training:False,dropout_rate:1})
print(validate_loss)
Is there any error in my code ?
if my code is right , I think I get a strange result:
when training, I set "training = True", when validate, set "training = False", the result is not good . I print validate loss every 0.1 epoch , the validate loss in 1st to 3st epoch is
0.929624
0.992692
0.814033
0.858562
1.042705
0.665418
0.753507
0.700503
0.508338
0.761886
0.787044
0.817034
0.726586
0.901634
0.633383
0.783920
0.528140
0.847496
0.804937
0.828761
0.802314
0.855557
0.702335
0.764318
0.776465
0.719034
0.678497
0.596230
0.739280
0.970555
However , when I change the code "sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1})" , that : set "training=False" when training, set "training=False" when validate . The result is good . The validate loss in 1st epoch is
0.474313
0.391002
0.369357
0.366732
0.383477
0.346027
0.336518
0.368153
0.330749
0.322070
0.335551
Why does this result appear ? Is it necessary to set "training=True" when training, set "training=False" when validate ?

TL;DR: Use smaller than the default momentum for the normalization layers like this:
tf.layers.batch_normalization( h1, momentum = 0.9, training=flag_training )
TS;WM:
When you set training = False that means the batch normalization layer will use its internally stored average of mean and variance to normalize the batch, not the batch's own mean and variance. When training = False, those internal variables also don't get updated. Since they are initialized to mean = 0 and variance = 1 it means that batch normalization is effectively turned off - the layer subtracts zero and divides the result by 1.
So if you train with training = False and evaluate like that, that just means you're training your network without any batch normalization whatsoever. It will still yield reasonable results, because hey, there was life before batch normalization, albeit admittedly not that glamorous...
If you turn on batch normalization with training = True that will start to normalize the batches within themselves and collect a moving average of the mean and variance of each batch. Now here's the tricky part. The moving average is an exponential moving average, with a default momentum of 0.99 for tf.layers.batch_normalization(). The mean starts at 0, the variance at 1 again. But since each update is applied with a weight of ( 1 - momentum ), it will asymptotically reach the actual mean and variance in infinity. For example in 100 steps it will reach about 73.4% of the real value, because 0.99100 is 0.366. If you have numerically large values, the difference can be enormous.
So if you have a relatively small number of batches you processed, then the internally stored mean and variance can still be significantly off by the time you're running the test. Then your network is trained on properly normalized data and is tested on mis-normalized data.
In order to speed up the convergence of the internal batch normalization values, you can apply a smaller momentum, like 0.9:
tf.layers.batch_normalization( h1, momentum = 0.9, training=flag_training )
(repeat for all batch normalization layers.) Please note that there is a downside to this, however. Random fluctuations in your data will "tug" on your stored mean and variance a lot more with a small momentum like this and the resulting values (later used in inference) can be greatly influenced by where you exactly stop the training, which is clearly not optimal. It is useful to have as large a momentum as possible. Depending on the number of training steps, we generally use 0.9, 0.99, 0.999 for 100, 1,000, 10,000 training steps respectively. No point in going over 0.999.
Another important thing is proper randomization of the training data. If you're training first with let's say the smaller numeric values of your whole data set, then the normalization will converge even slower. Best to completely randomize the order of training data and making sure you use a batch size of at least 14 (rule of thumb.)
Side note: it is known that zero debiasing the values can speed up convergence significantly, and the ExponentialMovingAverage class has this feature. But the batch normalization layers don't have this feature, save for tf.slim's batch_norm, if you're willing to restructure your code for slim.

The reason that you set Training = False improves performance is that Batch normalization has four variables (beta, gamma, mean, variance). It is true that mean and variance don't get updated when Training = False. However, gamma and beta still get updated. So your model has two extra variables and thus has a better performance.
Also, I guess that your model has a relatively good performance without batch normalization.

Scikit-Learn: Custom Loss Function for GridSearchCV

I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:
def custom_loss(truth, preds):
truth_logs = np.log(truth)
print(truth_logs)
preds_logs = np.log(preds)
numerator = np.sum(np.square(truth_logs - preds_logs))
return np.sum(np.sqrt(numerator / len(truth)))
custom_scorer = make_scorer(custom_loss, greater_is_better=False)
Two questions:
1) Should my custom loss function return a numpy array of scores (one for each (truth, prediction) pair? Or should it be the total loss over those (truth, prediction) pairs, returning a single number?
I looked into the docs but they weren't super helpful re: what my custom loss function should return.
2) When I run:
xgb_model = xgb.XGBRegressor()
params = {"max_depth": [3, 4], "learning_rate": [0.05],
"n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}
grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,
n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)
grid_search_cv.fit(X, y)
grid_search_cv.best_score_
I get back:
-0.12137097567803554
which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.
Any idea why it's negative?
Thanks!

1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.
By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.
2) Why does it return negative? - Without your actual data and complete code we cant say.

You should be careful with the notation.
There are 2 levels of optimization here:
The loss function optimized when the XGBRegressor is fitted to the data.
The scoring function that is optimized during the grid search.
I prefer calling the second scoring function instead of loss function, since loss function usually refers to a term that is subject to optimization during the model fitting process itself.
However, your custom function only specifies 2. whilst leaving 1. untouched. In case you want to change the loss function of XGBRegressor see here. Most regression models have several criteria from which you can choose such as mean_square_error or mean_absolute_error.
Note, that passing customized loss functions is not supported at the moment (see reasons here and here).

The make_scorer function sign flips if greater_is_better is False

Delay gap between reality and prediction

Using machine learning (as library I've tried Tensorflow and Tflearn (which, I know is just a wrapping of Tensorflow)) I'm trying to predict the congestion in an area for the next week (see my previous questions if you want more backstory on it). My training set is composed of 400K tagged entry (with the date an congestion value for each minute).
My problem is that I now have a time gap between predictions and reality.
If I had to draw a chart with the reality and prediction you would see that my prediction, while have the same shape as the reality is in advance. She increase/decrease before the reality. It started to make me think that maybe my training had a problem. It would seem like that my prediction didn't start when my training ended.
Both of my data-sets (training/testing) are on 2 different file. First I train on my training set (for convenience sake let's say it end at 100th minutes and my testing set start at 101th minute), once my model saved I do my predictions, it should normally then start to predict 101 or am I wrong somewhere? Because it seem like it's starting to predict way way after my training stopped (if I keep my example it would start predicting value 107 for example).
For now one of a bad fix was to remove from the training set as many value as I had of delay (take this example, it would be 7) and it worked, no more delay but I don't understand why I have this problem or how to fix it so it wouldn't happen later.
Following some advices found on different website it seem like having gap in my training dataset (missing timestamp in this case) could be a problem, seeing that there do was some (in total around 7 to 9% of the whole dataset was missing) I've used Pandas to add the missing timestamps (I've also gave them the congestion value of the last know timestamp) while I do think that it may have helped a little (the gap is smaller) it haven't fixed the problem.
I tried multistep forecasting, multivariate forecasting, LSTM, GRU, MLP, Tensorflow, Tflearn but it change nothing making me think it could come from my training.
Here is my model training.
def fit_lstm(train, batch_size, nb_epoch, neurons):
X, y = train[:, 0:-1], train[:, -1]
X = X.reshape(X.shape[0], 1, X.shape[1])
print X.shape
print y.shape
model = Sequential()
model.add(LSTM(neurons, batch_input_shape=(None, X.shape[1], X.shape[2]), stateful=False))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(nb_epoch):
model.fit(X, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
model.reset_states()
return model
The 2 shape are :
(80485, 1, 1)
(80485,)
(On this example I'm using only 80K of data as training for speed purpose).
As parameter I'm using 1 neuron, 64 of batch_size and 5 epoch.
My dataset is made of 2 file. First is the training file with 2 column:
timestamp | values
The second have the same shape but is the testing set (separated to avoid any influence of it on my prediction), the file is only used once every prediction have been made and to compare reality and prediction. The testing set start where the training set stop.
Do you have an idea of what could be the reason of this problem?
Edit:
On my code I have this function:
# invert differencing
yhat = inverse_difference(raw_values, yhat, len(test_scaled)+1-i)
# invert differenced value
def inverse_difference(history, yhat, interval=1):
return yhat + history[-interval]
It's supposed to invert the difference (to go from a scaled value to the real one).
When using it like in the pasted example (using the testing set) I get perfection, accuracy above 95% and no gap.
Since in reality we wouldn't know theses values I had to change it.
I tried first to use the training set but got the problem explained on this post:
Why is this happening? Is there an explanation for this problem?

Found it. It was a problem with the "def inverse_difference(history, yhat, interval=1):" function. In fact it make my result look like my last lines of training. This is why I had a gap, since there is a pattern in my data (peak always at more or less the same moment) I thought he was doing prediction while he was just giving me back values from training.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.