I am attempting to train an LSTM that reads a variable length input sequence and has a custom loss function applied to it. In order to be able to train on batches, I pad my inputs to all be the maxmimum length.
My input data is a float tensor of shape (7789, 491, 11) where the form is (num_samples, max_sequence_length, dimension).
Any sample that is shorter than the maximum length I pad with -float('inf'), so a sequence with 10 values would start with 481 sets of 11 '-inf' values followed by the real values at the end.
The way I am attempting to evaluate this model doesn't fit into any standard loss functions, so I had to make my own. I've tested it and it performs as expected on sample tensors. I don't believe this is the source of the issue so I won't go into details, but I could be wrong.
The problem I'm having comes from the model itself. Here is how I define and train it:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Masking(mask_value=-float('inf'),
input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(tf.keras.layers.LSTM(32))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(30),
kernel_initializer=tf.keras.initializers.zeros())
model.add(tf.keras.layers.Reshape((3, 10)))
model.compile(loss=batched_custom_loss, optimizer='rmsprop', run_eagerly=True)
model.fit(x=train_X, y=train_y, validation_data=val, epochs=5, batch_size=32)
No errors are thrown when I try to fit the model, but it only works on the first batch of training. As soon as the second batch starts, the loss becomes 'nan'. Upon closer inspection, it seems like the LSTM layer is outputting 'nan' after the first epoch of training.
My two guesses for what is going on are:
I set up the masking layer wrong, and it for some reason fails to mask out all of the -inf values after the first training iteration. Thus, -inf gets passed through the LSTM and it goes haywire.
I did something wrong with the format of my loss function, and the when the optimizer applies my calculated loss to the model it ruins the weights of the LSTM. For reference, my loss function outputs a 1D tensor with length equal to the number of samples in the batch. Each item in the output is a float with the loss of that sample.
I know that the math in my loss function is good since I've tested it on sample data, but maybe the output format is wrong even though it seems to match what I've found online.
Let me know if the problem is obvious from what I've shown or if you need more information.
I can't find the info in the documentation so I am asking here.
I have a multioutput model with 3 different outputs:
model = tf.keras.Model(inputs=[input], outputs=[output1, output2, output3])
The predicted labels for validation are constructed from these 3 outputs to form only one, it's a post-processing step. The dataset used for training is a dataset of those 3 intermediary outputs, for validation I evaluate on a dataset of labels instead of the 3 kind of intermediary data.
I would like to evaluate my model using a custom metric that handle the post processing and comparaison with the ground truth.
My question is, in the code of the custom metric, will y_pred be a list of the 3 outputs of the model?
class MyCustomMetric(tf.keras.metrics.Metric):
def __init__(self, name='my_custom_metric', **kwargs):
super(MyCustomMetric, self).__init__(name=name, **kwargs)
def update_state(self, y_true, y_pred, sample_weight=None):
# ? is y_pred a list [batch_output_1, batch_output_2, batch_output_3] ?
def result(self):
pass
# one single metric handling the 3 outputs?
model.compile(optimizer=tf.compat.v1.train.RMSPropOptimizer(0.01),
loss=tf.keras.losses.categorical_crossentropy,
metrics=[MyCustomMetric()])
With your given model definition, this is a standard multi-output Model.
model = tf.keras.Model(inputs=[input], outputs=[output_1, output_2, output_3])
In general, all (custom) Metrics as well as (custom) Losses will be called on every output separately (as y_pred)! Within the loss/metric function you will only see one output together with the one
corresponding target tensor.
By passing a list of loss functions (length == number of outputs of your model) you can specifiy which loss will be used for which output:
model.compile(optimizer=Adam(), loss=[loss_for_output_1, loss_for_output_2, loss_for_output_3], loss_weights=[1, 4, 8])
The total loss (which is the objective function to minimize) will be the additive combination of all losses multiplied with the given loss weights.
It is almost the same for the metrics! Here you can pass (as for the loss) a list (lenght == number of outputs) of metrics and tell Keras which metric to use for which of your model outputs.
model.compile(optimizer=Adam(), loss='mse', metrics=[metrics_for_output_1, metrics_for_output2, metrics_for_output3])
Here metrics_for_output_X can be either a function or a list of functions, which all be called with the one corresponding output_X as y_pred.
This is explained in detail in the documentation of Multi-Output Models in Keras. They also show examples for using dictionarys (to map loss/metric functions to a specific output) instead of lists.
https://keras.io/getting-started/functional-api-guide/#multi-input-and-multi-output-models
Further information:
If I understand you correctly you want to train your model using a loss function comparing the
three model outputs with three ground truth values and want to do some sort of performance evaluation by comparing
a derived value from the three model outputs and a single ground truth value.
Usually the model gets trained on the same objective it is evaluated on, otherwise you might get poorer results when
evaluating your model!
Anyways... for evaluating your model on a single label I suggest you either:
1. (The clean solution)
Rewrite your model and incorporate the post-processing steps. Add all the necessary operations (as layers) and map those
to an auxiliary output. For training your model you can set the loss_weight of the auxiliary output to zero.
Merge your Datasets so you can feed your model the model input, the intermediate target outputs as well as the labels.
As explained above you can define now a metric comparing the auxiliary model output with the given target labels.
2.
Or you train your model and derive the metric e.g. in a custom Callback by calculating your post-processing steps on the three outputs of model.predict(input).
This will make it necessary to write custom summaries if you want to track those values in your tensorboard! That's why I would not recommend this solution.
I've been recently trying to implement a multi-class classification LSTM architecture, based on this example: biLSTM example
After I changed
self.label = tf.placeholder(tf.int32, [None])
to
self.label = tf.placeholder(tf.int32, [None,self.n_class)
The model seems to train normally, yet I am having trouble with this step:
self.loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y_hat, labels=self.label))
# prediction
self.prediction = tf.argmax(tf.nn.softmax(y_hat), 1)
As, even though the model learns normally, the predictions does not seem to work for multiple variables. I was wondering how should one code the self.prediction object, so that it emits a vector of predictions for individual instances?
Thank you very much.
I was wondering how should one code the self.prediction object, so
that it emits a vector of predictions for individual instances?
In general tf.nn.softmax returns a vector of probabilities. You just can't see them because your are using tf.argmax, which returns the index of the largest value. Therefore you will just get one number. Just remove tf.argmax and you should be fine.
I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.
I cannot find anywhere how exactly is backpropagation done in Keras? Let me explain:
Lets say i have network
input = Input(shape=(X,X,Y))
x = Conv2D(32,(3,3),padding="same")(input)
x = Conv2D(64,(3,3),padding="same")(x)
x = Conv2D(128,(3,3),padding="same")(x)
x = Conv2D(64,(3,3),padding="same")(x)
Output = Flatten(1024)(x)
Output = Flatten(6)(Output)
model = Model(input,Output)
model.compile(loss="mean_squared_error", optimizer=keras.optimizers.Adam(),metrics=['accuracy'])
model.fit(trainingData,trainingLabels)
The output of last layer is compared to trainingLabels, mean squared error is computed and Backpropagation happens based on the value of mean squaed error
However, what if i wanted to something more. And for example I want to try every permutation of output vector, and the one that results in minimal mean squared error should be treated as output, thus Backpropagation happens based on permutation with least error.
Is something like this possible in Keras? If so, how can i achieve it
The loss argument of model.compile method accepts a python function. You could compute the minimum over the set of permutations in a custom function:
def custom_loss(y_true, y_predicted):
''' code here '''
and then pass it to model.compile:
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
See here and here for reference.