Keras custom loss function doesn't update my layer - python

I'm trying to get an embedding of some data I have using Keras. I have a model consisting of only an embedding layer, and then I have my own custom loss function that uses the weights of the embedding layer to calculate a loss function, however, the network won't train at all, meaning the loss value is always the same.
My loss function looks like this
def custom_loss(a,b):
layer = model.layers[0].get_weights()[0]
distances = [distance(layer[i],layer[j]) for i in range(len(layer)-1) for j in range(i+1, len(layer))]
loss = [-math.log10(sims[i]*distances[i])for i in range(len(sims))]
return K.sum(loss)+0*K.sum(b)
model.compile(optimizer="rmsprop",loss=custom_loss)
I get the distances between the different n-dimensional embedding values for all different points I pass to the embedding layer. I have the +0*K.sum(b) because otherwise, I get a no gradient exception. sims is a list of values that I multiply the distances with. loss ends up being a list with floats.
I don't get any exceptions, however, the loss is just constant. What might be wrong?
I must say I'm not really familiar with creating custom loss functions like this in Keras, so what I've done might be totally wrong.
My training data consists of simply passing ones as input values, as I'm, again, only interested in getting the embedding wrt my custom loss function. I know I'm basically abusing Keras in order to not write my own optimization function.
Does my K.sum(loss) simply give no gradient? My intuition tells me that layer = model.layers[0].get_weights()[0] is simply not updated between batches, but I don't know if this is correct, or if it is, how to fix it. Is there some other way to access the weights that would work?
Network summary looks like this
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 400, 3) 1200
=================================================================
Total params: 1,200
Trainable params: 1,200
Non-trainable params: 0
The distance function used looks like this
def distance(a,b):
powers = [(a[i]-b[i])**2 for i in range(len(a))]
return math.sqrt(sum(powers))

Related

What is the meaning of "trainable_weights" in Keras?

If I freeze my base_model with trainable=false, I get strange numbers with trainable_weights.
Before freezing my model has 162 trainable_weights. After freezing, the model only has 2. I tied 2 layers to the pre-trained network. Does trainable_weights show me the layers to train? I find the number weird, when I see 2,253,335 Trainable params.
Trainable weights are the weights that will be learnt during the training process. If you do trainable=False then those weights are kept as it is and are not changed because they are not learnt. You might see some "strange numbers" because either you are using a pre-trained network that has its weights already learnt or you might be using random initialization when defining the model. When using transfer learning with pre-trained models a common practice is to freeze the weights of base model (pre-trained) and only train the extra layers that you add at the end.
Trainable weights are the same as trainable parameters.
A trainable layer often has multiple trainable weights.
Let's view this example:
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, None, 501) 0
_________________________________________________________________
lstm_1 (LSTM) (None, None, 40) 86720
_________________________________________________________________
SoftDense (TimeDistributed) (None, None, 501) 20541
=================================================================
Total params: 107,261
Trainable params: 107,261
Non-trainable params: 0
__________________________
The first layer is just an input layer; it receives the data as-is, so it does not have any trainable weights.
The next layer has 542*4 *40=86720 trainable weights.
40 due to its output dim, 4 because as an LSTM it actually has 4 trainable layers inside it, and 542 for 501+40+1... due to reasons that are probably beyond the scope of this answer.
The last layer has 41*501=20541 trainable weights
(40 from the hidden dimension of its input, the LSTM layer, +1 for bias, times 501 for its output).
Total trainable parameters are 107,261.
If I were to freeze the last layer I would have only 86,720 trainable weights.
Late to the party, but maybe this answer can be useful to others that might be googling this.
First, it is useful to distinguish between the quantity "Trainable params" one sees at the end of my_model.summary(), with the output of len(my_model.trainable_weights).
Maybe an example helps: let's say I have a model with VGG16 architecture.
my_model = keras.applications.vgg16.VGG16(
weights="imagenet",
include_top=False
)
# take a look at model summary
my_model.summary()
You will see that there are 13 conv. layers that have trainable parameters. Acknowledging the fact that pooling/input layers do not have trainable parameters, i.e. no learning is needed for them. On the other hand, in each of those 13 layers, there are "weights" and "biases" that need to be learned, think of them as variables.
What len(my_model.trainable_weights) will give you is the number of trainable layers (if you will) multiplied by 2 (weights + bias).
In this case, if you print len(my_model.trainable_weights), you will get 26 as the answer. maybe we can think of 26 as the number of variables for the optimization, variables that can differ in the shape of course.
Now to connect trainable_weights to the total number of trainable parameters, one can try:
trainable_params = 0
for layer in my_model.trainable_weights:
trainable_params += layer.numpy().size
print(F"#{trainable_params = }")
You will get this number: 14714688. Which must be the "Trainable params" number you see at the end of my_model.summary().

How to extract Keras layer weights as trainable parameter?

I'm training a GAN-like models, but not exactly the same. I'm using Keras with TensorFlow backend.
I have two Keras models G and D. I want to output the weights parameter of a target layer in G, as the input of model D, and use the result of D.predict(G.weights) as part of the loss function for G, i.e. D is not trainable, but the argument G.weights are trainable. In this way to want to further train G.weights.
I tried to use
def custom_loss(ytrue, ypred):
### Something to do with ytrue and ypred
weight = self.G.get_layer('target').get_weights()
loss += self.D.predict(weight)
return loss
but apparently it does not work since weight is just a numpy array and is not trainable.
Is there a way to get the weights of model that is still trainable in Keras? I'm new to Keras and know very little about TensorFlow. I will be very appreciate it someone can help!
As you mention, layer.get_weights() will return the current weights of the matrix. What you want to feed for prediction is a the node in the computation graph representing such weights. You can use layer.trainable_weights instead, which will return two tf.Variable which you can feed to another layer/model.
Note that there is one variable for the unit to unit connections and another one for the bias. If you want to get a flattened tensor from it you could do something like:
from keras import backend as K
...
ww, bias = self.G.get_layer('target').trainable_weights
flattened_weights = Flatten()(K.concat([ww, K.reshape(bias, (5, 1))], axis=1))

LSTM having a systematic offset between predictions and ground truth

Currently i think i'm experiencing a systematic offset in a LSTM model, between the predictions and the ground truth values. What's the best approach to continue further from now on?
The model architecture, along with the predictions & ground truth values are shown below. This is a regression problem where the historical data of the target plus 5 other correlated features X are used to predict the target y. Currently the input sequence n_input is of length 256, where the output sequence n_out is one. Simplified, the previous 256 points are used to predict the next target value.
X is normalized. The mean squared error is used as the loss function. Adam with a cosine annealing learning rate is used as the optimizer (min_lr=1e-7, max_lr=6e-2).
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
cu_dnnlstm_8 (CuDNNLSTM) (None, 256) 270336
_________________________________________________________________
batch_normalization_11 (Batc (None, 256) 1024
_________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 256) 0
_________________________________________________________________
dropout_11 (Dropout) (None, 256) 0
_________________________________________________________________
dense_11 (Dense) (None, 1) 257
=================================================================
Total params: 271,617
Trainable params: 271,105
Non-trainable params: 512
_________________________________________________________________
Increasing the node size in the LSTM layer, adding more LSTM layers (with return_sequences=True) or adding dense layers after the LSTM layer(s) only seems to lower the accuracy. Any advice would be appreciated.
Additional information on the image. The y-axis is a value, x-axis is the time (in days). NaNs have been replaced with zero, because the ground truth value in this case can never reach zero. That's why the odd outliers are in the data.
Edit:
I made some changes to the model, which increased accuracy. The architecture is the same, however the features used have changed. Currently only the historical data of the target sequence itself is used as a feature. Along with this, n_input got changed so 128. Switched Adam for SGD, mean squared error with the mean absolute error and finally the NaNs have been interpolated instead of being replaced with 0.
One step ahead predictions on the validation set look fine:
However, the offset on the validation set remains:
It might be worth noting that this offset also appears on the train set for x < ~430:
It looks like your model is overfitting and is simply always returning the value from the last timestep as a prediction. Your dataset is probably too small to have a model with this amount of parameters converge. You'll need to resort to techniques that combat overfitting: agressive dropout, adding more data, or try simpler, less overparameterized methods.
This phenomenon (LSTMs returning a shifted version of the input) has been a recurring theme in many stackoverflow questions. The answers there might contain some useful information:
LSTM Sequence Prediction in Keras just outputs last step in the input
LSTM model just repeats the past in forecasting time series
LSTM NN produces “shifted” forecast (low quality result)
Keras network producing inverse predictions
Stock price predictions of keras multilayer LSTM model converge to a constant value
Keras LSTM predicted timeseries squashed and shifted
LSTM Time series shifted predictions on stock market close price
Interesting results from LSTM RNN : lagged results for train and validation data
Finally, be aware that, depending on the nature of your dataset, there simply might be no pattern to be discovered in your data at all. You see this a lot with people trying to predict the stock market with LSTMs (there is a question on stackoverflow on how to predict the lottery numbers).
The answer is much simpler than we thought...
I saw multiple people saying this is due to overfitting and datasize. Some other people stated this is due to rescaling.
After several try, I found the solution: Try to do detrending before feed the data to RNN.
For example, you can do a simple degree-2 polynomial fitting of the data which will give you a polynomial formula. And it is possible to reduce the each data value corresponding to the formula value. Then we got a new dataset and we can feed it to the LSTM, after prediction we can just add the trend back to the result and the results should look better.

CNN Keras: How many weights will be trained?

I have a little comprehension problem with CNN. And I'm not quite sure how many filters and thus weights are trained.
Example: I have an input layer with the 32x32 pixels and 3 channels (i.e. shape of (32,32,3)). Now I use a 2D-convolution layer with 10 filters of shape (4,4). So I end up with 10 channels each with shape of (28,28), but do I now train a separate filter for each input channel or are they shared? Do I train 3x10x4x4 weights or do I train 10x4x4 weights?
You can find out the number of (non-)trainable parameters of a model in Keras using the summary function:
from keras import models, layers
model = models.Sequential()
model.add(layers.Conv2D(10, (4,4), input_shape=(32, 32, 3)))
model.summary()
Here is the output:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 29, 29, 10) 490
=================================================================
Total params: 490
Trainable params: 490
Non-trainable params: 0
In general, for a 2D-convolution layer with k filters with size of w*w applied on an input with c channels the number of trainable parameters (considering one bias parameter for each filter, in the default case) is equal to k*w*w*c+k or k*(w*w*c+1). In the example above we have: k=10, w=4, c=3 therefore we have 10*(4*4*3+1) = 490 trainable parameters. As you can infer, for each channel there are separate weights and they are not shared. Furthermore, the number of parameters of a 2D-convolution layer does not depend on the width or height of the previous layer.
Update:
A convolution layer with depth-wise shared weights: I am not aware of such a layer and could not find a built-in implementation of that in Keras or Tensorflow either. But after thinking about it, you realize that it is essentially equivalent to summing all the channels together and then applying a 2D-convolution on the result. For example in case of a 32*32*3 image, first all the three channels are summed together resulting in a 32*32*1 tensor and then a 2D-convolution can be applied on that tensor. Therefore at least one way of achieving a 2D-convolution with shared weights across channels could be like this in Keras (which may or may not be efficient):
from keras import models, layers
from keras import backend as K
model = models.Sequential()
model.add(layers.Lambda(lambda x: K.expand_dims(K.sum(x, axis=-1)), input_shape=(32, 32, 3)))
model.add(layers.Conv2D(10, (4,4)))
model.summary()
Output:
Layer (type) Output Shape Param #
=================================================================
lambda_1 (Lambda) (None, 32, 32, 1) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 29, 29, 10) 170
=================================================================
Total params: 170
Trainable params: 170
Non-trainable params: 0
One good thing about that Lambda layer is that it could be added in any place (e.g. after the convolution layer). But I think the most important question to ask here is: "Why using a 2D-conv layer with depth-wise shared weighs would be beneficial?" One obvious answer is that the network size (i.e. the total number of trainable parameters) is reduced and therefore there might be a decrease in training time, which I suspect would be negligible. Further, using shared weights across channels implies that the patterns present in different channels are more or less similar. But this is not always the case, for example in RGB images, and therefore by using shared weights across channels I guess you might observe a (noticeable) decrease in network accuracy. So, at least, you should have in mind this trade-off and experiment with it.
However, there is another kind of convolution layer, which you might be interested in, called "Depth-wise Separable Convolution" which has been implemented in Tensorflow, and Keras supports it as well. The idea is that on each channel a separate 2D-conv filter is applied and afterwards the resulting feature maps are aggregated using k 1*1 convolutions(k here is the number of output channels). It basically separates the learning of spatial features and depth-wise features. In his paper, "Xception: Deep Learning with Depthwise Separable Convolutions", Francois Chollet (the creator of Keras) shows that using depth-wise separable convolutions improves both the performance and accuracy of network. And here you can read more about different kinds of convolution layers used in deep learning.

How to train a network in Keras for varying output size

I have basic neural network created with Keras. I train the network successfully with vectors of data and corresponding output data that is a vector with two elements. It represents a coordinate (x, y). So in goes an array, out comes an array.
Problem is that I am unable to use training data where a single input vector should correspond to many coordinates. Effectively, I desire a vector of coordinates as output, without prior knowledge of the number of coordinates.
Network is created by
model = Sequential()
model.add(Dense(20, input_shape=(196608,)))
model.add(Dense(10))
model.add(Dense(2))
and model summary shows the output dimensions for each layer
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 20) 3932180
_________________________________________________________________
dense_2 (Dense) (None, 10) 210
_________________________________________________________________
dense_3 (Dense) (None, 2) 22
=================================================================
I realize the network structure only allows a length 2 vector as output. Dense layers also do not accept None as their size. How do I modify the network so that it can train on and output a vector of vectors (list of coordinates)?
A recurrent neural networks (RNNs) would be much more appropriate, this models are typicall called seq2seq, that is, sequence to sequence. Recurrent nets use layers like LSTM and GRU, and can input and output variable length sequences. Just look at things like Machine Translation done with RNNs.
This can be done directly with keras, and there are many examples lying around the internet, for example this one.
An rnn is not what you want for predicting coordinates. Instead, I would recommend using a model that predicts coordinates and associated confidences. So you would have 100 coordinate predictions for every forward pass through your model. Each of those predictions would have another associated prediction that determines if it is correct or not. Only predictions that are above a certain confidence threshold would count. That confidence threshold is what allows the model to choose how many points it wants to use each time (with a maximum number set by the number of outputs which in this example is 100).
r-cnn is a model that does just that. Here is the first keras implementaion I found on github https://github.com/yhenon/keras-frcnn.

Categories