Tensorflow: Multiple loss functions vs Multiple training ops - python

I am creating a Tensorflow model which predicts multiple outputs (with different activations). I think there are two ways to do this:
Method 1: Create multiple loss functions (one for each output), merge them (using tf.reduce_mean or tf.reduce_sum) and pass it to the training op like so:
final_loss = tf.reduce_mean(loss1 + loss2)
train_op = tf.train.AdamOptimizer().minimize(final_loss)
Method 2: Create multiple training operations and then group them like so:
train_op1 = tf.train.AdamOptimizer().minimize(loss1)
train_op2 = tf.train.AdamOptimizer().minimize(loss2)
final_train_op = tf.group(train_op1 train_op2)
My question is whether one method is advantageous over the other? Is there a third method I don't know?
Thanks

I want to make a subtle point that I don't think was made in previous answers.
If you were using something like GradientDescentOptimizer, these would be very similar operations. That's because taking gradients is a linear operation, and the gradient of a sum is the same as the sum of the gradients.
But, ADAM does something special: regardless of the scale of your loss, it scales the gradients so that they're always on the order of your learning rate. If you multiplied your loss by 1000, it wouldn't affect ADAM, because the change it would be normalized away.
So, if your two losses are roughly the same magnitude, then it shouldn't make a difference. If one is much larger than the other, then keep in mind that summing before the minimization will essentially ignore the small one, while making two ops will spend equal effort minimizing both.
I personally like dividing them up, which gives you more control over how much to focus on one loss or the other. For example, if it was multi-task learning, and one task was more important to get right than the other, two ops with different learning rates roughly accomplishes this.

The difference between the two methods is demonstrated clearly in this post on multi-task learning in tensorflow.
In short:
Method 1:
This is called joint training, since it directly adds the losses together, the result is that all the gradients and updates are done with respect to both losses at the same time. Generally this is used when training multiple outputs using the same set of input features.
Method 2:
This creates two separate optimizers and is called alternate training. This is used when you use a subset of input features for each of the outputs. Therefore, when feeding in the feature subset for train_op1, the sub-graph for train_op2 is untouched. Each optimizer can be called in an alternating order using different input features.
If you run both optimizer concurrently with the same input data, then the differences with method 1 is probably very minor.

The method 1 is the correct one because you're defining only once the gradient graph (for computing the backpropagation). In this way, you use a single loss function with a single graph, for doing a single update of the same parameter (the update takes into account both terms of the loss).
The second method, instead, defines 2 different graphs for computing the gradient, and is wrong.
When you execute the training op, you're executing in parallel (because you used tf.group / tf.tuple / tf.control_dependencies) the computation of the training operations.
The operations will compute two different losses and two different set of updated variables.
When the moment of updating the variables comes, you have a problem:
which update operation executes first, the one defined by the first graph or the other?
And in any case, you're discarding one computation, because one will overwrite the other. There's no synchronization in the update and there's no relation in the computed losses.

Both of the method you recommended are correct. The difference is quite subtle. Main difference is that AdamOptimizer keeps separate gradient accumulators for each loss in second solution. Which one works better needs an experiment.

I will showcase how to implement a regression model using Tensorflow's functional API.
In multi-task learning, we need a base network that is shared between tasks and a network head for each individual task:
from tensorflow.keras import layers, models, Model
def create_base_cnn(input_shape):
model = models.Sequential()
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), padding="same", activation="relu", input_shape=input_shape))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), padding="same", activation="relu"))
# put more layers if you like
model.add(layers.Dense(128, activation="relu"))
return model
def create_head(input_shape, name):
model = models.Sequential(name=name)
model.add(layers.Dense(128, activation="relu", input_shape=input_shape))
model.add(layers.Dense(64, activation="relu"))
# put more layers if you like
model.add(layers.Dense(1, activation="linear"))
return model
We can now combine the base model with the heads.
# Create the model.
input_shape = (240, 180, 1)
base_model = create_base_cnn(input_shape)
head_model1 = create_head((128,), name="head1")
head_model2 = create_head((128,), name="head2")
model_input = layers.Input(shape=input_shape)
# Combine base with heads (using TF's functional API)
features = base_model(model_input)
model_output1 = head_model1(features)
model_output2 = head_model2(features)
model = Model(inputs=model_input, outputs=[model_output1, model_output2])
Finally to train the model we can refer to the different outputs by name (in my case: "head1" and "head2"). We can define a hyperparameter for the weight of each head in the loss function:
HEAD1_WEIGHT = 0.4
HEAD2_WEIGHT = 0.6
model.compile(
optimizer="Adam",
loss={"head1": "mse", "head2": "mse"},
loss_weights={"head1": HEAD1_WEIGHT, "head2": HEAD2_WEIGHT},
metrics={"head1": ["mae"], "head2": ["mae"]}
)
model.fit(dataset_training, validation_data, epochs)

Related

Neural network with linear activation output. Calculate output range for each of the output neurons

Let's assume I have a neural network like the following:
model = keras.models.Sequential()
model.add(keras.layers.Dense(10, input_shape=(5,), activation='relu'))
model.add(keras.layers.Dense(4, activation='linear'))
With n output neurons with a linear activation function.
The training process is not important here, so we can take a look at the random weights that keras initialized using:
model.weights
Of course, in a real example, these weights should be adjusted in the training process.
Depending on these model.weights, each of the output neurons returns values in a range.
I would like to calculate this exact range.
Does keras offer any function to calculate it?
I built a flawed piece of code to make an approximation of it, using a loop and predicting random inputs. But this would not be really useful in a real example with much more inputs/neurons/weights.
Here a few examples trying to clarify my question (All of them assume that the input values are between and 1):
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,),
activation='linear', use_bias=False))
model.set_weights([np.array([1, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between 0 and 2
model.set_weights([np.array([-0.5, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between -0.5 and 1
model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(2,), activation='linear', use_bias=False))
model.add(keras.layers.Dense(1, activation='linear', use_bias=False))
model.set_weights([np.array([1, 1, 1, 1]).reshape(2,2), np.array([1, 1]).reshape(2,1)])
For the previous example, the output neuron results would be between 0 and 4
These are simplified examples. In a real scenario with a much complex network structure, activation functions, bias..... these ranges are not obvious to calculate.
It sounds like you are roughly interested in what is referred to as neural network verification. This field broadly consists of answering the question: given a range of possible inputs, what is the range of possible outputs from a neural network with a given set of weights? A few things to note:
A neural network is essentially a complex, non-linear function. That is, it maps the input space to the output space. Defining an output range does not make sense except with respect to an input range. In your question you make no reference to the inputs, so your examples are flawed/incomplete.
In general, neural network verification is an emerging field with most published works being fairly recent (last 5-7 years). That being said, there are exact and approximate methods for fully connected networks with a variety of activation functions. I'll list a few such methods here:
https://arxiv.org/abs/2004.05519 - MATLAB toolbox, but you could export your neural network in ONNX format and then use MATLAB for the verification/output range analysis.
https://arxiv.org/abs/1804.10829 - specifically for ReLU activation function.
https://anwu1219.github.io/download/Marabou.pdf with python API available here: https://github.com/NeuralNetworkVerification/Marabou
The field is still evolving so you may have to do some of the codings yourself rather than using pre-existing libraries in some cases, but these papers/ a search query for neural network verification should at least give you some ideas of where to start.
IMO, there is no such a function, as far as I know, to estimate the output value's range( without imposing your restriction).
For example, a dense function without bias is just a plain linear function of a=bx, in your case, you are restricting x to 0-1 range and explicitly setting b to your desired values.
You will always get the value in those ranges you`ve cited in your question. A hypothetical example is to choose b randomly and the range in your questions would not hold the ground.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,), activation='linear', use_bias=False))
import matplotlib.pyplot as plt
#model.set_weights([np.array([1, 1]).reshape(2, 1)])
eval_func = keras.backend.function([model.input], model.layers[-1].output)
outputs = eval_func(np.array([[2,1]]))
counts, bins = np.histogram(outputs)
plt.hist(bins[:-1], bins, weights=counts)

Access output of intermediate layers in Tensor-flow 2.0 in eager mode

I have CNN that I have built using on Tensor-flow 2.0. I need to access outputs of the intermediate layers. I was going over other stackoverflow questions that were similar but all had solutions involving Keras sequential model.
I have tried using model.layers[index].output but I get
Layer conv2d has no inbound nodes.
I can post my code here (which is super long) but I am sure even without that someone can point to me how it can be done using just Tensorflow 2.0 in eager mode.
I stumbled onto this question while looking for an answer and it took me some time to figure out as I use the model subclassing API in TF 2.0 by default (as in here https://www.tensorflow.org/tutorials/quickstart/advanced).
If somebody is in a similar situation, all you need to do is assign the intermediate output you want, as an attribute of the class. Then keep the test_step without the #tf.function decorator and create its decorated copy, say val_step, for efficient internal computation of validation performance during training. As a short example, I have modified a few functions of the tutorial from the link accordingly. I'm assuming we need to access the output after flattening.
def call(self, x):
x = self.conv1(x)
x = self.flatten(x)
self.intermediate=x #assign it as an object attribute for accessing later
x = self.d1(x)
return self.d2(x)
#Remove #tf.function decorator from test_step for prediction
def test_step(images, labels):
predictions = model(images, training=False)
t_loss = loss_object(labels, predictions)
test_loss(t_loss)
test_accuracy(labels, predictions)
return
#Create a decorated val_step for object's internal use during training
#tf.function
def val_step(images, labels):
return test_step(images, labels)
Now when you run model.predict() after training, using the un-decorated test step, you can access the intermediate output using model.intermediate which would be an EagerTensor whose value is obtained simply by model.intermediate.numpy(). However, if you don't remove the #tf_function decorator from test_step, this would return a Tensor whose value is not so straightforward to obtain.
Thanks for answering my earlier question. I wrote this simple example to illustrate how what you're trying to do might be done in TensorFlow 2.x, using the MNIST dataset as the example problem.
The gist of the approach:
Build an auxiliary model (aux_model in the example below), which is so-called "functional model" with multiple outputs. The first output is the output of the original model and will be used for loss calculation and backprop, while the remaining output(s) are the intermediate-layer outputs that you want to access.
Use tf.GradientTape() to write a custom training loop and expose the detailed gradient values on each individual variable of the model. Then you can pick out the gradients that are of interest to you. This requires that you know the ordering of the model's variables. But that should be relatively easy for a sequential model.
import tensorflow as tf
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
# This is the original model.
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=[28, 28, 1]),
tf.keras.layers.Dense(100, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax")])
# Make an auxiliary model that exposes the output from the intermediate layer
# of interest, which is the first Dense layer in this case.
aux_model = tf.keras.Model(inputs=model.inputs,
outputs=model.outputs + [model.layers[1].output])
# Define a custom training loop using `tf.GradientTape()`, to make it easier
# to access gradients on specific variables (the kernel and bias of the first
# Dense layer in this case).
cce = tf.keras.losses.CategoricalCrossentropy()
optimizer = tf.optimizers.Adam()
with tf.GradientTape() as tape:
# Do a forward pass on the model, retrieving the intermediate layer's output.
y_pred, intermediate_output = aux_model(x_train)
print(intermediate_output) # Now you can access the intermediate layer's output.
# Compute loss, to enable backprop.
loss = cce(tf.one_hot(y_train, 10), y_pred)
# Do backprop. `gradients` here are for all variables of the model.
# But we know we want the gradients on the kernel and bias of the first
# Dense layer, which happens to be the first two variables of the model.
gradients = tape.gradient(loss, aux_model.variables)
# This is the gradient on the first Dense layer's kernel.
intermediate_layer_kerenl_gradients = gradients[0]
print(intermediate_layer_kerenl_gradients)
# This is the gradient on the first Dense layer's bias.
intermediate_layer_bias_gradients = gradients[1]
print(intermediate_layer_bias_gradients)
# Update the variables of the model.
optimizer.apply_gradients(zip(gradients, aux_model.variables))
The most straightforward solution would go like this:
mid_layer = model.get_layer("layer_name")
you can now treat the "mid_layer" as a model, and for instance:
mid_layer.predict(X)
Oh, also, to get the name of a hidden layer, you can use this:
model.summary()
this will give you some insights about the layer input/output as well.

Experiment shows that LSTM does worse than Random Forest... Why?

LSTM is supposed to be the right tool to capture path-dependency in time-series data.
I decided to run a simple experiment (simulation) to assess the extent to which LSTM is better able to understand path-dependency.
The setting is very simple. I just simulate a bunch (N=100) of paths coming from 4 different data generating processes. Two of these processes represent a real increase and a real decrease, while the other two fake trends that eventually revert to zero.
The following plot shows the simulated paths for each category:
The candidate machine learning algorithm will be given the first 8 values of the path ( t in [1,8] ) and will be trained to predict the subsequent movement over the last 2 steps.
In other words:
the feature vector is X = (p1, p2, p3, p4, p5, p6, p7, p8)
the target is y = p10 - p8
I compared LSTM with a simple Random Forest model with 20 estimators. Here are the definitions and the training of the two models, using Keras and scikit-learn:
# LSTM
model = Sequential()
model.add(LSTM((1), batch_input_shape=(None, H, 1), return_sequences=True))
model.add(LSTM((1), return_sequences=False))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_X_LS, train_y_LS, epochs=100, validation_data=(vali_X_LS, vali_y_LS), verbose=0)
# Random Forest
RF = RandomForestRegressor(random_state=0, n_estimators=20)
RF.fit(train_X_RF, train_y_RF);
The out-of-sample results are the summarized by the following scatter plots:
As you can see, the Random Forest model is clearly outperforming the LSTM. The latter seems to be not able to distinguish between the real and the fake trends.
Do you have any idea to explain why this is happening?
How would you modify the LSTM model to make it better at this problem?
Some remarks:
The data points are divided by 100 to make sure gradients do not explode
I tried to increase the sample size, but I noticed no differences
I tried to increase the number of epochs over which the LSTM is trained, but I noticed no differences (the loss becomes stagnant after a bunch of epochs)
You can find the code I used to run the experiment here
Update:
Thanks to SaTa's reply, I changed the model and obtained much better results:
# Updated LSTM Model
model = Sequential()
model.add(LSTM((8), batch_input_shape=(None, H, 1), return_sequences=False))
model.add(Dense(4))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
Still, the Random Forest model does better. The point is that RF seems to understand that, conditional on the class, a higher p8 predicts a lower outcome p10-p8 and viceversa because of the way the noise is added. LSTM seems to fail on that, so it predicts the class rather well, but we see that within-class downward-sloping pattern in the final scatter plot.
Any suggestion to improve on that?
I won't expect LSTM to win at all the battles against traditional methods, but I do expect it to perform well for the problem you have posed. Here are couple things you can try:
1) Increase the number of hidden units in the first layer.
model.add(LSTM((32), batch_input_shape=(None, H, 1), return_sequences=True))
2) The output of an LSTM layer is tanh by default which limits the output to (-1, 1) as you can see in the right plot. I recommend either adding a Dense layer or using LSTM with linear activation on the output. Like this:
model.add(LSTM((1), return_sequences=False, activation='linear'))
Or
model.add(LSTM((16), return_sequences=False))
model.add(Dense(1))
Try the above with 10K samples that you have.

Regressor Neural Network built with Keras only ever predicts one value

I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.

Q-values exploding when training DQN

I'm training a DQN to play OpenAI's Atari environment, but the Q-values of my network quickly explode far above what is realistic.
Here's the relevant portion of the code:
for state, action, reward, next_state, done in minibatch:
if not done:
# To save on memory, next_state is just one frame
# So we have to add it to the current state to get the actual input for the network
next_4_states = np.array(state)
next_4_states = np.roll(next_4_states, 1, axis=3)
next_4_states[:, :, :, 0] = next_state
target = reward + self.gamma * \
np.amax(self.target_model.predict(next_4_states))
else:
target = reward
target_f = self.target_model.predict(state)
target_f[0][action] = target
self.target_model.fit(state, target_f, epochs=1, verbose=0)
The discount factor is 0.99 (it doesn't happen with discount factor 0.9, but also doesn't converge because it can't think far enough ahead).
Stepping through the code, the reason it's happening is all the Q values that aren't meant to be updated (the ones for actions we didn't take) increase slightly. It's my understanding that passing the networks own output to the network during training should keep the output the same, not increase or decrease it. Is there something wrong with my model? Is there some way I can mask the update so it only updates the relevant Q value?
EDIT: My model creation code is here:
def create_model(self, input_shape, num_actions, learning_rate):
model = Sequential()
model.add(Convolution2D(32, 8, strides=(4, 4),
activation='relu', input_shape=(input_shape)))
model.add(Convolution2D(64, 4, strides=(2, 2), activation='relu'))
model.add(Convolution2D(64, 3, strides=(1, 1), activation='relu'))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(num_actions))
model.compile(loss='mse', optimizer=Adam(lr=learning_rate))
return model
I create two of these. One for the online network and one for the target.
Which predictions get updated?
Stepping through the code, the reason it's happening is all the Q values that aren't meant to be updated (the ones for actions we didn't take) increase slightly. It's my understanding that passing the networks own output to the network during training should keep the output the same, not increase or decrease it.
Below I have drawn a very simple neural network with 3 input nodes, 3 hidden nodes, and 3 output nodes. Suppose that you have only set a new target for the first action, and simply use the existing predictions as targets again for the other actions. This results in only a non-zero (for simplicity I'll just assume greater than zero) error (denoted by delta in the image) for the first action/output, and errors of 0 for the others.
I have drawn the connections through which this error will be propagated from output to hidden layer in bold. Note how each of the nodes in the hidden layer still gets an error. When these nodes then propagate their errors back to the input layer, they'll do this through all of the connections between input and hidden layer, so all of those weights can be modified.
So, imagine all those weights got updated, and now imagine doing a new forwards pass with the original inputs. Do you expect output nodes 2 and 3 to have exactly the same outputs as before? No, probably not; the connections from hidden nodes to the last two outputs may still have the same weights, but all three hidden nodes will have different activation levels. So no, the other outputs are not guaranteed to remain the same.
Is there some way I can mask the update so it only updates the relevant Q value?
Not easily no, if at all. The problem is that the connections between pairs of layers other than the connections between the final pair are not action-specific, and I don't think you want them to be either.
Target Network
Is there something wrong with my model?
One thing I'm seeing is that you seem to be updating the same network that is used to generate targets:
target_f = self.target_model.predict(state)
and
self.target_model.fit(state, target_f, epochs=1, verbose=0)
both use self.target_model. You should use separate copies of the network for those two lines, and only after longer periods of time copy the updated network's weights into the network used to compute targets. For a bit more on this, see Addition 3 in this post.
Double DQN
Apart from that, it is well known that DQN can still have a tendency to overestimate Q values (though it generally shouldn't completely explode). This can be addressed by using Double DQN (note: this is an improvement that was added later on top of DQN).
I encountered similar problem in double DQN. Lowering the learning rate resolved it.

Categories