Mixture of experts - Train best model only at each iteration

Mixture of experts - Train best model only at each iteration - python

I am trying to implement a crude method based on the Mixture-of-Experts paper in tensorflow - https://arxiv.org/abs/1701.06538
There would be n models defined:
model_1:
var_11
var_12
loss_1
optimizer_1
model_2:
var_21
var_22
loss_2
optimizer_2
model_3:
var_31
var_32
loss_3
optimizer_3
At every iteration, I want to train the model with the least loss only while keeping the other variables constant. Is it possible to place a switch to execute one of the optimizer only?
P.S: This base of this problem is similar to one I had asked previously. http://stackoverflow.com/questions/42073239/tf-get-collection-to-extract-variables-of-one-scope/42074009?noredirect=1#comment71359330_42074009
Since the suggestion there did not work, I am trying to approach the problem differently.
Thanks in advance!

This seems to be doable with tf.cond:
import tensorflow as tf
def make_conditional_train_op(
should_update, optimizers, variable_lists, losses):
"""Conditionally trains variables.
Each argument is a Python list of Tensors, and each list must have the same
length. Variables are updated based on their optimizer only if the
corresponding `should_update` boolean Tensor is True at a given step.
Returns a single train op which performs the conditional updates.
"""
assert len(optimizers) == len(variable_lists)
assert len(variable_lists) == len(losses)
assert len(should_update) == len(variable_lists)
conditional_updates = []
for model_number, (update_boolean, optimizer, variables, loss) in enumerate(
zip(should_update, optimizers, variable_lists, losses)):
conditional_updates.append(
tf.cond(update_boolean,
lambda: tf.group(
optimizer.minimize(loss, var_list=variables),
tf.Print(0, ["Model {} updating".format(model_number), loss])),
lambda: tf.no_op()))
return tf.group(*conditional_updates)
The basic strategy is to make sure the optimizer's variable updates are defined in the lambda of one of the cond branches, in which case there is true conditional op execution, meaning that the assignment to variables (and optimizer accumulators) only happens if that branch of the cond is triggered.
As an example, we can construct some models:
def make_model_and_optimizer():
scalar_variable = tf.get_variable("scalar", shape=[])
vector_variable = tf.get_variable("vector", shape=[3])
loss = tf.reduce_sum(scalar_variable * vector_variable)
optimizer = tf.train.AdamOptimizer(0.1)
return optimizer, [scalar_variable, vector_variable], loss
# Construct each model
optimizers = []
variable_lists = []
losses = []
for i in range(10):
with tf.variable_scope("model_{}".format(i)):
optimizer, variables, loss = make_model_and_optimizer()
optimizers.append(optimizer)
variable_lists.append(variables)
losses.append(loss)
Then determine a conditional update strategy, in this case only training the model with the maximum loss (just because that results in more switching; the output is rather boring if only one model ever updates):
# Determine which model should be updated (in this case, the one with the
# maximum loss)
integer_one_hot = tf.one_hot(
tf.argmax(tf.stack(losses),
axis=0),
depth=len(losses))
is_max = tf.equal(
integer_one_hot,
tf.ones_like(integer_one_hot))
Finally, we can call the make_conditional_train_op function to create the train op, then do some training iterations:
train_op = make_conditional_train_op(
tf.unstack(is_max), optimizers, variable_lists, losses)
# Repeatedly call the conditional train op
with tf.Session():
tf.global_variables_initializer().run()
for i in range(20):
print("Iteration {}".format(i))
train_op.run()
This is printing the index which is updated and its loss at each iteration, confirming the conditional execution:
Iteration 0
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][2.7271919]
Iteration 1
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][2.1755948]
Iteration 2
I tensorflow/core/kernels/logging_ops.cc:79] [Model 2 updating][1.9858969]
Iteration 3
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][1.6859927]

Related

Keras loss value significant jump

I am working on a simple neural network in Keras with Tensorflow. There is a significant jump in loss value from the last mini-batch of epoch L-1 to the first mini-batch of epoch L.
I am aware that the loss should decrease with an increase in the number of iterations but a significant jump in loss after each epoch does looks strange. Here is the code snippet
tf.keras.initializers.he_uniform(seed=None)
initializer = tf.keras.initializers.he_uniform()
def my_loss(y_true, y_pred):
epsilon=1e-30 #epsilon is added to avoid inf/nan
y_pred = K.cast(y_pred, K.floatx())
y_true = K.cast(y_true, K.floatx())
loss = y_true* K.log(y_pred+epsilon) + (1-y_true)*K.log(1-y_pred+epsilon)
loss = K.mean(loss, axis= -1)
loss = K.mean(loss)
loss = -1*loss
return loss
inputs = tf.keras.Input(shape=(140,))
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(inputs)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Dense(100, kernel_initializer=initializer)(x)
outputs = tf.keras.activations.sigmoid(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
opt = tf.keras.optimizers.Adam()
recall1 = tf.keras.metrics.Recall(top_k = 8)
c_entropy = tf.keras.losses.BinaryCrossentropy()
model.compile(loss=c_entropy, optimizer= opt , metrics = [recall1,my_loss], run_eagerly=True)
model.fit(X_train_test, Y_train_test, epochs=epochs, batch_size=batch, shuffle=True, verbose = 1)
When I search online, I found this article, which suggests that Keras calculates the moving average over the mini-batches. Also, I found somewhere that the array for calculating the moving average is reset after each epoch that's why we obtain a very smooth curve within an epoch but a jump after the epoch.
In order to avoid the moving average, I implemented my own loss function, which should output the loss values of the mini-batch instead of the moving average over the batches. As each mini-batch is different from each other; therefore the corresponding loss must also be different from each other. Due to this reason, I was expecting an arbitrary loss value on each mini-batch through my implementation of the loss function. Instead, I obtain exactly the same values as the loss function by Keras.
I am unclear about:
Is Keras calculating the moving average over the mini-batches, the array of which is reset after each epoch causing the jump. If not, then what is causing the jump behaviour in loss value.
Is my implementation of loss for each mini-batch correct? If not, then how can I obtain the loss value of the mini-batch during the training.

Keras in fact shows the moving average instead of the "raw" loss values. The moving average array is reset after each epoch that's why we can see a huge jump after each epoch. In order to acquire the raw loss values, one should implement a callback as shown below:
class LossHistory(keras.callbacks.Callback):
def on_train_begin(self, logs={}):
#initialize a list at the begining of training
self.losses = []
def on_batch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
mycallback = LossHistory()
Then call it in model.fit
model.fit(X, Y, epochs=epochs, batch_size=batch, shuffle=True, verbose = 0, callbacks=[mycallback])
print(mycallback.losses)
I tested with the following configuration
Keras 2.3.1
Tensorflow 2.1.0
Python 3.7.9
For some reason, it didn't work with the following configuration
Keras 2.4.3
Tensorflow 2.2.0
Python 3.8.5
To answer the second question, the implementation of the loss function my_loss is correct and the values obtained are pretty much close to the values generated by the built-in function.
tf.keras.losses.BinaryCrossentropy()

In TensorFlow version 2.2 and newer, the loss provided to on_train_batch_end is now the average loss of all batches up until the current batch within the given epoch. This is also the case for additional metrics, and applies to the built-in losses/metrics as well as any custom losses/metrics.
Fortunately, the loss for the current batch can be calculated from the average loss as follows:
from tensorflow.keras.callbacks import Callback
class CustomCallback(Callback):
''' This callback converts the average loss (default behavior in TF>=2.2)
into the loss for only the current batch.
'''
def on_epoch_begin(self, epoch, logs={}):
self.previous_loss_sum = 0
def on_train_batch_end(self, batch, logs={}):
# calculate loss of current batch:
current_loss_sum = (batch + 1) * logs['loss']
current_loss = current_loss_sum - self.previous_loss_sum
self.previous_loss_sum = current_loss_sum
# use current_loss:
# ...
This code can be added to any custom callback that needs the loss for the current batch instead of the average loss, including the LossHistory callback provided in Doc Jazzy's answer.
Also, if you are using Tensorflow 1 or TensorFlow 2 version <= 2.1, then do not include this code in your callback, as in those versions the current loss is already provided, instead of the average loss.

Updating based on two different loss functions, but with a different optimizer learning rate after each one (pytorch)?

I have a set up as follows, where I have an outer for loop iterating over epochs, and an inner for loop iterating over batches.
In the inner for loop, over batches, I'm usin a cross entropy loss, and using an Adam optimizer with a certain learning rate.
After the inner for loop (after all batches are evaluated), I'm then calculating another loss function based off of the output (a custom loss function), and optimizing.
However, I notice that when I define a different optimizer with a different learning rate, it doesn't seem to be training. When I keep the same optimizer, it seems like things change, but when I replace it, it doesn't. Example as follows:
net = <my defined model, from another function>
optimizer_1 = torch.optim.Adam(net.parameters(), lr=0.1)
optimizer_2 = torch.optim.Adam(net.parameters(), lr=0.01)
for epoch in range(num_epochs):
for data in training_data: # these are the batches
<get output here>
loss1 = <compute loss function>
optimizer_1.zero_grad()
loss1.backward()
optimizer_1.step()
loss2 = <compute a different loss function here>
optimizer_2.zero_grad() #use a second optimizer with a different learning rate
loss.backward()
loss.step()
When I do this, it seems like it doesn't actually carry through with the second optimization on the second loss function. Why is this? I want the second optimization to have a different learning rate than the first one. However, it seems like only continuing to use the first optimizer, optimizer_1, with its respective learning rate seems to work.

First you can accumulate the loss1 in the inner loop.
Next, You might want to consider merging two loss functions.
(sum(accumulated_loss1) + loss2).backward()
This ensures both are losses are considered during training and all the gradients are propagated in the backward pass

what is the first value returned by adam optimizer in tensorflow

i am using tensorflow to create a two layered DNN. i used AdamOptimizer as my optimizer. below is my code
pred_raw = create_feedforward_nn_model(x, weights, biases)
pred = tf.round(tf.nn.sigmoid(pred_raw))
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=pred, logits=y))
train_op = tf.train.AdamOptimizer(learning_rate).minimize(cost)
when i do seession.run on train_op it returns two values, i know that second value is the loss, but what is the first value here. I have seen many tutorials, but, in most of them they are ignoring the first value by using _ like as shown below
_,loss = sess.run([train_op, cost],={x: batch_features, y: batch_labels})

In mnist/mechanics under Check the Status you can find the following line:
Since train_op is an Operation with no output value, the corresponding element in the returned tuple is None and, thus, discarded. However, the value of the loss tensor may become NaN if the model diverges during training, so we capture this value for logging.

How to gradually train with more and more classes?

I'm trying to create an incremental classifier that will get trained on data containing n classes for some set number of epochs, then n+m classes for a set number of epochs, then n+m+k, etc, where each successive set of classes contains the previous set as a subset.
In order to do this without having to train the model, save it, manually edit the graph, re-train, repeat, I'm simply defining all the weights I will need to classify the entire set of classes, but keeping the weights corresponding to unseen classes frozen at 0 until the classifier is introduced to those classes.
My strategy for this is to define a placeholder that is fed in an array of Boolean values defining whether or not some given set of weights are trainable.
Relevant code below:
output_train = tf.placeholder(tf.int32, shape = (num_incremental_grps), name = "output_train")
.
.
.
weights = []
biases = []
for i in range(num_incremental_grps):
W = tf.Variable(tf.zeros([batch_size, classes_per_grp]),
trainable=tf.cond(tf.equal(output_train[i], tf.constant(1)),lambda: tf.constant(True), lambda: tf.constant(False)))
weights.append(W)
b = tf.Variable(tf.zeros([classes_per_grp]), trainable=tf.cond(tf.equal(output_train[i],
tf.constant(1)), lambda:tf.constant(True), lambda: tf.constant(False)))
biases.append(b)
out_weights = tf.stack(weights, axis=1).reshape((batch_size, -1))
out_biases = tf.stack(biases, axis=1).reshape((batch_size, -1))
outputs = tf.identity(tf.matmul(inputs, out_weights) + out_biases, name='values')
.
.
.
# Will change this to an array that progressively updates as classes are added.
output_trainable = np.ones(num_incremental_grps, dtype=bool)
.
.
.
with tf.Session() as sess:
init.run()
for epoch in range(epochs):
for iteration in range(iterations):
X_batch, y_batch = batch.getBatch()
fd={X: X_batch, y: y_batch, training: True, output_train: output_trainable}
_, loss_val = sess.run([training_op, loss], feed_dict=fd)
This returns the error message
Using a 'tf.Tensor' as a Python `bool` is not allowed. Use `if t is not None:` instead of
`if t:` to test if a tensor is defined,and use TensorFlow ops such as tf.cond to execute
subgraphs conditioned on the value of a tensor.
I've tried tinkering around with this, like making the initial placeholder datatype tf.bool instead of tf.int32. I've also tried just feeding in a slice of the tensor into the 'trainable' argument in the weights/biases like this
W = tf.Variable(tf.zeros([batch_size, classes_per_grp]), trainable=output_variable[i])
but I get the same error message. I'm not sure how to proceed from here, aside from trying a completely different approach to updating the number of predictable classes. Any help would be much appreciated.

The error occurs because tf.cond takes a decision based on a single boolean — much like an if statement. What you want here is to make a choice per element of your tensor.
You could use tf.where to fix that problem, but then you will run into another one, which is that trainable is not a property that you can fix at runtime, it is part of the definition of a variable. If a variable will be trained at some point, perhaps not at the beginning but definitely later, then it must be trainable.
I would suggest to take a much simpler route: define output_train to be an array of tf.float32
output_train = tf.placeholder(tf.float32, shape=(num_incremental_grps), name="output_train")
then later simply multiply your weights and variables with this vector.
W = tf.Variable(...)
W = W * output_train
...
Provide values of 1 to output_train where you want training to happen, 0 otherwise.
Be careful to also mask your loss to ignore output from unwanted channels, because event though they now always output 0, that may still affect your loss. For example,
logits = ...
logits = tf.matrix_transpose(tf.boolean_mask(
tf.matrix_transpose(logits ),
output_train == 1))
loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=labels)

Global Step for Differential Learning Rate

Based on this question, I am trying to implement differential learning rates as follows:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
#Create Two Separate Optimizers
opt1 = tf.train.AdamOptimizer(0.00001)
opt2 = tf.train.AdamOptimizer(0.0001)
# Compute Gradients for eacch set of variables
grads1, variables1 = zip(*opt1.compute_gradients(loss, var_list1))
grads2, variables2 = zip(*opt2.compute_gradients(loss, var_list2))
# Apply Gradients
train_op1 = opt1.apply_gradients(zip(grads1, variables1))
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
train_op = tf.group(train_op1, train_op2)
I am unsure if global_step should be included in each apply_gradients call or if it should only be included in 1? My understanding is that when apply_gradients is called, global_step is incremented by 1 if it is supplied (code here). Based on this, I believe that I should only include global_step in one of my apply_gradients() calls. Can anybody confirm that this is the correct approach?
The alternative to what I have above would be to do the following:
train_op1 = opt1.apply_gradients(zip(grads1, variables1), global_step=global_step)
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
While technically each call to apply_gradients is a step, my understanding is that global_step should represent the number of mini-batches that have been completed so if I were to reference it in both apply_gradients() calls then the global step would increase twice per mini-batch. So, based onthis I believe the more accurate implementation would be the first implementation where it is called once. Would others agree this is the correct implementation? Does it matter which apply_gradients() the global_step is included in?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mixture of experts - Train best model only at each iteration - python

Related

Keras loss value significant jump

Updating based on two different loss functions, but with a different optimizer learning rate after each one (pytorch)?

what is the first value returned by adam optimizer in tensorflow

How to gradually train with more and more classes?

Global Step for Differential Learning Rate

Categories

Resources