If I freeze my base_model with trainable=false, I get strange numbers with trainable_weights.
Before freezing my model has 162 trainable_weights. After freezing, the model only has 2. I tied 2 layers to the pre-trained network. Does trainable_weights show me the layers to train? I find the number weird, when I see 2,253,335 Trainable params.
Trainable weights are the weights that will be learnt during the training process. If you do trainable=False then those weights are kept as it is and are not changed because they are not learnt. You might see some "strange numbers" because either you are using a pre-trained network that has its weights already learnt or you might be using random initialization when defining the model. When using transfer learning with pre-trained models a common practice is to freeze the weights of base model (pre-trained) and only train the extra layers that you add at the end.
Trainable weights are the same as trainable parameters.
A trainable layer often has multiple trainable weights.
Let's view this example:
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, None, 501) 0
_________________________________________________________________
lstm_1 (LSTM) (None, None, 40) 86720
_________________________________________________________________
SoftDense (TimeDistributed) (None, None, 501) 20541
=================================================================
Total params: 107,261
Trainable params: 107,261
Non-trainable params: 0
__________________________
The first layer is just an input layer; it receives the data as-is, so it does not have any trainable weights.
The next layer has 542*4 *40=86720 trainable weights.
40 due to its output dim, 4 because as an LSTM it actually has 4 trainable layers inside it, and 542 for 501+40+1... due to reasons that are probably beyond the scope of this answer.
The last layer has 41*501=20541 trainable weights
(40 from the hidden dimension of its input, the LSTM layer, +1 for bias, times 501 for its output).
Total trainable parameters are 107,261.
If I were to freeze the last layer I would have only 86,720 trainable weights.
Late to the party, but maybe this answer can be useful to others that might be googling this.
First, it is useful to distinguish between the quantity "Trainable params" one sees at the end of my_model.summary(), with the output of len(my_model.trainable_weights).
Maybe an example helps: let's say I have a model with VGG16 architecture.
my_model = keras.applications.vgg16.VGG16(
weights="imagenet",
include_top=False
)
# take a look at model summary
my_model.summary()
You will see that there are 13 conv. layers that have trainable parameters. Acknowledging the fact that pooling/input layers do not have trainable parameters, i.e. no learning is needed for them. On the other hand, in each of those 13 layers, there are "weights" and "biases" that need to be learned, think of them as variables.
What len(my_model.trainable_weights) will give you is the number of trainable layers (if you will) multiplied by 2 (weights + bias).
In this case, if you print len(my_model.trainable_weights), you will get 26 as the answer. maybe we can think of 26 as the number of variables for the optimization, variables that can differ in the shape of course.
Now to connect trainable_weights to the total number of trainable parameters, one can try:
trainable_params = 0
for layer in my_model.trainable_weights:
trainable_params += layer.numpy().size
print(F"#{trainable_params = }")
You will get this number: 14714688. Which must be the "Trainable params" number you see at the end of my_model.summary().
Related
I understand that it is a long post, but help in any of the sections is appreciated.
I have some queries about the prediction method of my LSTM model. Here is a general summary of my approach:
I used a dataset having 50 time series for training. They start with a value of 1.09 all the way up to 0.82, with each time series having between 570 to 2000 datapoints (i.e, each time series has a different length, but similar trend).
I converted them to the dataset accepted by keras' LSTM/Bi-LSTM layers in the format:
[1, 0.99, 0.98, 0.97] ==Output==> [0.96]
[0.99, 0.98, 0.97, 0.96] ==Output==> [0.95]
and so on..
Shapes of the input and output containers (arrays): input(39832, 5, 1) and output(39832, )
Error-free training
Prediction on an initial points of data (window) having shape (1, 5, 1). This has been taken from the actual data.
The predicted output is one value, which is appended to a separate list (for plotting), as well as appended to the window, and the first value of the window dropped out. This window is then fed as input to the model to generate the next prediction point.
Continue this until I get the whole curve for both models (LSTM and Bi-LSTM)
However, the prediction is not even close to the actual data. It flatlines to a fixed value, whereas it should be somewhat like the black curve (which is the actual data)
Figure:https://i.stack.imgur.com/Ofw7m.png
Model (similar code goes for Bi-LSTM model):
model_lstm = Sequential()
model_lstm.add(LSTM(128, input_shape=(timesteps, 1), return_sequences= True))
model_lstm.add(Dropout(0.2))
model_lstm.add(LSTM(128, return_sequences= False))
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(1))
model_lstm.compile(loss = 'mean_squared_error', optimizer = optimizers.Adam(0.001))
Curve prediction initialize:
start = cell_to_test[0:timesteps].reshape(1, timesteps, 1)
y_curve_lstm = list(start.flatten())
y_window = start
Curve prediction:
while len(y_curve_lstm) <= len(cell_to_test):
yhat = model_lstm.predict(y_window)
yhat = float(yhat)
y_curve_lstm.append(yhat)
y_window = list(y_window.flatten())
y_window.append(yhat)
y_window.remove(y_window[0])
y_window = np.array(y_window).reshape(1, timesteps, 1)
#print(yhat)
Model summary:
Model: "sequential_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_5 (LSTM) (None, 5, 128) 66560
_________________________________________________________________
dropout_5 (Dropout) (None, 5, 128) 0
_________________________________________________________________
lstm_6 (LSTM) (None, 128) 131584
_________________________________________________________________
dropout_6 (Dropout) (None, 128) 0
_________________________________________________________________
dense_5 (Dense) (None, 1) 129
=================================================================
Total params: 198,273
Trainable params: 198,273
Non-trainable params: 0
_________________________________________________________________
And in addition to diagnosing the problem, I am really trying to find the answers to the following questions (I looked up other sources, but in vain):
Is my data enough to train the LSTM model? I have been told that it requires thousands of data points, so I feel that my current dataset more than suffices the condition.
Is my model less/more complex than it needs to be?
Does increasing the number of epochs, layers, and the neurons per layer always lead to a 'better' model, or are there optimal values for the same? If the latter, then is there a method to find this optimal point, or is hit-and-trail the only way?
I trained with the number of epochs=25, which gave me a loss of 1.25 * 10e-4. Should the loss be lower for the model to predict the trend? (I am focused on getting the shape first, accuracy later, because the training takes too long with higher epochs)
In continuation to the previous question, does loss have the same unit as the data? The reason why I am asking this is because the data has a resolution of up to 10e-7.
Once again, I understand that it has been a long post, but help in any of the sections is appreciated.
I'm trying to get an embedding of some data I have using Keras. I have a model consisting of only an embedding layer, and then I have my own custom loss function that uses the weights of the embedding layer to calculate a loss function, however, the network won't train at all, meaning the loss value is always the same.
My loss function looks like this
def custom_loss(a,b):
layer = model.layers[0].get_weights()[0]
distances = [distance(layer[i],layer[j]) for i in range(len(layer)-1) for j in range(i+1, len(layer))]
loss = [-math.log10(sims[i]*distances[i])for i in range(len(sims))]
return K.sum(loss)+0*K.sum(b)
model.compile(optimizer="rmsprop",loss=custom_loss)
I get the distances between the different n-dimensional embedding values for all different points I pass to the embedding layer. I have the +0*K.sum(b) because otherwise, I get a no gradient exception. sims is a list of values that I multiply the distances with. loss ends up being a list with floats.
I don't get any exceptions, however, the loss is just constant. What might be wrong?
I must say I'm not really familiar with creating custom loss functions like this in Keras, so what I've done might be totally wrong.
My training data consists of simply passing ones as input values, as I'm, again, only interested in getting the embedding wrt my custom loss function. I know I'm basically abusing Keras in order to not write my own optimization function.
Does my K.sum(loss) simply give no gradient? My intuition tells me that layer = model.layers[0].get_weights()[0] is simply not updated between batches, but I don't know if this is correct, or if it is, how to fix it. Is there some other way to access the weights that would work?
Network summary looks like this
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 400, 3) 1200
=================================================================
Total params: 1,200
Trainable params: 1,200
Non-trainable params: 0
The distance function used looks like this
def distance(a,b):
powers = [(a[i]-b[i])**2 for i in range(len(a))]
return math.sqrt(sum(powers))
I have a little comprehension problem with CNN. And I'm not quite sure how many filters and thus weights are trained.
Example: I have an input layer with the 32x32 pixels and 3 channels (i.e. shape of (32,32,3)). Now I use a 2D-convolution layer with 10 filters of shape (4,4). So I end up with 10 channels each with shape of (28,28), but do I now train a separate filter for each input channel or are they shared? Do I train 3x10x4x4 weights or do I train 10x4x4 weights?
You can find out the number of (non-)trainable parameters of a model in Keras using the summary function:
from keras import models, layers
model = models.Sequential()
model.add(layers.Conv2D(10, (4,4), input_shape=(32, 32, 3)))
model.summary()
Here is the output:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 29, 29, 10) 490
=================================================================
Total params: 490
Trainable params: 490
Non-trainable params: 0
In general, for a 2D-convolution layer with k filters with size of w*w applied on an input with c channels the number of trainable parameters (considering one bias parameter for each filter, in the default case) is equal to k*w*w*c+k or k*(w*w*c+1). In the example above we have: k=10, w=4, c=3 therefore we have 10*(4*4*3+1) = 490 trainable parameters. As you can infer, for each channel there are separate weights and they are not shared. Furthermore, the number of parameters of a 2D-convolution layer does not depend on the width or height of the previous layer.
Update:
A convolution layer with depth-wise shared weights: I am not aware of such a layer and could not find a built-in implementation of that in Keras or Tensorflow either. But after thinking about it, you realize that it is essentially equivalent to summing all the channels together and then applying a 2D-convolution on the result. For example in case of a 32*32*3 image, first all the three channels are summed together resulting in a 32*32*1 tensor and then a 2D-convolution can be applied on that tensor. Therefore at least one way of achieving a 2D-convolution with shared weights across channels could be like this in Keras (which may or may not be efficient):
from keras import models, layers
from keras import backend as K
model = models.Sequential()
model.add(layers.Lambda(lambda x: K.expand_dims(K.sum(x, axis=-1)), input_shape=(32, 32, 3)))
model.add(layers.Conv2D(10, (4,4)))
model.summary()
Output:
Layer (type) Output Shape Param #
=================================================================
lambda_1 (Lambda) (None, 32, 32, 1) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 29, 29, 10) 170
=================================================================
Total params: 170
Trainable params: 170
Non-trainable params: 0
One good thing about that Lambda layer is that it could be added in any place (e.g. after the convolution layer). But I think the most important question to ask here is: "Why using a 2D-conv layer with depth-wise shared weighs would be beneficial?" One obvious answer is that the network size (i.e. the total number of trainable parameters) is reduced and therefore there might be a decrease in training time, which I suspect would be negligible. Further, using shared weights across channels implies that the patterns present in different channels are more or less similar. But this is not always the case, for example in RGB images, and therefore by using shared weights across channels I guess you might observe a (noticeable) decrease in network accuracy. So, at least, you should have in mind this trade-off and experiment with it.
However, there is another kind of convolution layer, which you might be interested in, called "Depth-wise Separable Convolution" which has been implemented in Tensorflow, and Keras supports it as well. The idea is that on each channel a separate 2D-conv filter is applied and afterwards the resulting feature maps are aggregated using k 1*1 convolutions(k here is the number of output channels). It basically separates the learning of spatial features and depth-wise features. In his paper, "Xception: Deep Learning with Depthwise Separable Convolutions", Francois Chollet (the creator of Keras) shows that using depth-wise separable convolutions improves both the performance and accuracy of network. And here you can read more about different kinds of convolution layers used in deep learning.
I'm trying to create a simple stateful neural network in keras to wrap my head around how to connect Embedding layers and LSTM's. I have a piece of text where I have mapped every character to a integer and would like to send in one character at a time to predict the next character. I have done this earlier where I have sent in 8 characters at a time and got that to work well (using return_sequences=True and TimeDistributed(Dense)). But this time I want to only send in 1 character at a time and this is where my problem arises.
The code I use to set up my model:
n_fac = 32
vocab_size = len(chars)
n_hidden = 256
batch_size=64
model = Sequential()
model.add(Embedding(vocab_size,n_fac,input_length=1,batch_input_shape=(batch_size,1)))
model.add(BatchNormalization())
model.add(LSTM(n_hidden,stateful=True))
model.add(Dense(vocab_size,activation='softmax'))
model.summary() gives me the following:
Layer (type) Output Shape Param # Connected to
embedding_1 (Embedding) (64, 1, 32) 992 embedding_input_1[0][0]
batchnormalization_1 (BatchNorma (64, 1, 32) 128 embedding_1[0][0]
lstm_1 (LSTM) (64, 256) 295936 batchnormalization_1[0][0]
dense_1 (Dense) (64, 31) 7967 lstm_1[0][0]
Total params: 305,023
Trainable params: 304,959
Non-trainable params: 64
The code I use to set up my training data:
text = ... #Omitted for simplicity. Just setting text to some kind of literature work
text = text.lower() #Simple model, therefor only using lower case characters
idx2char = list(set(list(text)))
char2idx = {char:idx for idx,char in enumerate(idx2char)}
text_in_idx = [char2idx[char] for char in text]
x = text_idx[:-1]
y = text_idx[1:]
Compiling and training my network:
model.compile(optimizer=Adam(lr=1e-4),loss='sparse_categorical_crossentropy')
nb_epoch = 10
for i in range(nb_epoch):
model.reset_states()
model.fit(x,y,nb_epoch=1,batch_size=batch_size,shuffle=False)
Training works as it should, the loss is reduced with each epoch.
Now I want to try out my trained network but have no idea how to give it a character to predict the next. I start out by resetting its states and then want to start feeding it one char at a time.
I tried a couple of different inputs but all of them failed. These are not qualified guesses.
#The model uses integers for characters, therefor integers are sent as input
model.predict([1]) #Type error
model.predict(np.array([1])) #Value error
model.predict(np.array([1])[np.newaxis,:]) #Value error
model.predict(np.array([1])[:,np.newaxis]) #Value error
Am I forced to send in something of length batch_size or how am I supposed to send in data for the model to predict something?
The error text for Value error is very long and obscure so I omitted it. I can supply it if needed.
Using theano backend with keras.
I have basic neural network created with Keras. I train the network successfully with vectors of data and corresponding output data that is a vector with two elements. It represents a coordinate (x, y). So in goes an array, out comes an array.
Problem is that I am unable to use training data where a single input vector should correspond to many coordinates. Effectively, I desire a vector of coordinates as output, without prior knowledge of the number of coordinates.
Network is created by
model = Sequential()
model.add(Dense(20, input_shape=(196608,)))
model.add(Dense(10))
model.add(Dense(2))
and model summary shows the output dimensions for each layer
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 20) 3932180
_________________________________________________________________
dense_2 (Dense) (None, 10) 210
_________________________________________________________________
dense_3 (Dense) (None, 2) 22
=================================================================
I realize the network structure only allows a length 2 vector as output. Dense layers also do not accept None as their size. How do I modify the network so that it can train on and output a vector of vectors (list of coordinates)?
A recurrent neural networks (RNNs) would be much more appropriate, this models are typicall called seq2seq, that is, sequence to sequence. Recurrent nets use layers like LSTM and GRU, and can input and output variable length sequences. Just look at things like Machine Translation done with RNNs.
This can be done directly with keras, and there are many examples lying around the internet, for example this one.
An rnn is not what you want for predicting coordinates. Instead, I would recommend using a model that predicts coordinates and associated confidences. So you would have 100 coordinate predictions for every forward pass through your model. Each of those predictions would have another associated prediction that determines if it is correct or not. Only predictions that are above a certain confidence threshold would count. That confidence threshold is what allows the model to choose how many points it wants to use each time (with a maximum number set by the number of outputs which in this example is 100).
r-cnn is a model that does just that. Here is the first keras implementaion I found on github https://github.com/yhenon/keras-frcnn.