CNN Keras: How many weights will be trained?

CNN Keras: How many weights will be trained? - python

I have a little comprehension problem with CNN. And I'm not quite sure how many filters and thus weights are trained.
Example: I have an input layer with the 32x32 pixels and 3 channels (i.e. shape of (32,32,3)). Now I use a 2D-convolution layer with 10 filters of shape (4,4). So I end up with 10 channels each with shape of (28,28), but do I now train a separate filter for each input channel or are they shared? Do I train 3x10x4x4 weights or do I train 10x4x4 weights?

You can find out the number of (non-)trainable parameters of a model in Keras using the summary function:
from keras import models, layers
model = models.Sequential()
model.add(layers.Conv2D(10, (4,4), input_shape=(32, 32, 3)))
model.summary()
Here is the output:
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 29, 29, 10) 490
=================================================================
Total params: 490
Trainable params: 490
Non-trainable params: 0
In general, for a 2D-convolution layer with k filters with size of w*w applied on an input with c channels the number of trainable parameters (considering one bias parameter for each filter, in the default case) is equal to k*w*w*c+k or k*(w*w*c+1). In the example above we have: k=10, w=4, c=3 therefore we have 10*(4*4*3+1) = 490 trainable parameters. As you can infer, for each channel there are separate weights and they are not shared. Furthermore, the number of parameters of a 2D-convolution layer does not depend on the width or height of the previous layer.
Update:
A convolution layer with depth-wise shared weights: I am not aware of such a layer and could not find a built-in implementation of that in Keras or Tensorflow either. But after thinking about it, you realize that it is essentially equivalent to summing all the channels together and then applying a 2D-convolution on the result. For example in case of a 32*32*3 image, first all the three channels are summed together resulting in a 32*32*1 tensor and then a 2D-convolution can be applied on that tensor. Therefore at least one way of achieving a 2D-convolution with shared weights across channels could be like this in Keras (which may or may not be efficient):
from keras import models, layers
from keras import backend as K
model = models.Sequential()
model.add(layers.Lambda(lambda x: K.expand_dims(K.sum(x, axis=-1)), input_shape=(32, 32, 3)))
model.add(layers.Conv2D(10, (4,4)))
model.summary()
Output:
Layer (type) Output Shape Param #
=================================================================
lambda_1 (Lambda) (None, 32, 32, 1) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 29, 29, 10) 170
=================================================================
Total params: 170
Trainable params: 170
Non-trainable params: 0
One good thing about that Lambda layer is that it could be added in any place (e.g. after the convolution layer). But I think the most important question to ask here is: "Why using a 2D-conv layer with depth-wise shared weighs would be beneficial?" One obvious answer is that the network size (i.e. the total number of trainable parameters) is reduced and therefore there might be a decrease in training time, which I suspect would be negligible. Further, using shared weights across channels implies that the patterns present in different channels are more or less similar. But this is not always the case, for example in RGB images, and therefore by using shared weights across channels I guess you might observe a (noticeable) decrease in network accuracy. So, at least, you should have in mind this trade-off and experiment with it.
However, there is another kind of convolution layer, which you might be interested in, called "Depth-wise Separable Convolution" which has been implemented in Tensorflow, and Keras supports it as well. The idea is that on each channel a separate 2D-conv filter is applied and afterwards the resulting feature maps are aggregated using k 1*1 convolutions(k here is the number of output channels). It basically separates the learning of spatial features and depth-wise features. In his paper, "Xception: Deep Learning with Depthwise Separable Convolutions", Francois Chollet (the creator of Keras) shows that using depth-wise separable convolutions improves both the performance and accuracy of network. And here you can read more about different kinds of convolution layers used in deep learning.

Related

ML wrong prediction on Japan Crossword puzzle

I’m trying to study machine learning in hands-on way. I found exercise for myself to create neural network that solves “Japan crosswords” for fixed size images (128*128).
Very simple example (4*4) demonstrates the conception: black & white picture encoded by top and left matrices. Number in matrix means continues length of black line. Easy to prove left and top matrix have dimension at max (N*(N/2)) and ((N/2)*N) correspondingly.
I have a python generator that creates random b&w images and 2 reduced matrices. Top and left matrices are fed as input (left is transposed to match top) and b&w as an expected output. Input is treated as 3-dim (128 * 64 * 2) where 2 – is top and left correspondingly.
Following is my current topology that try to build function (128 * 64 * 2) -> (128, 128, 1)
Model: "model"
Layer (type) Output Shape Param #
interlaced_reduce (InputLaye [(None, 128, 64, 2)] 0
small_conv (Conv2D) (None, 128, 64, 32) 288
leaky_re_lu (LeakyReLU) (None, 128, 64, 32) 0
medium_conv (Conv2D) (None, 128, 64, 64) 8256
leaky_re_lu_1 (LeakyReLU) (None, 128, 64, 64) 0
large_conv (Conv2D) (None, 128, 64, 128) 32896
leaky_re_lu_2 (LeakyReLU) (None, 128, 64, 128) 0
up_sampling2d (UpSampling2D) (None, 128, 128, 128) 0
dropout (Dropout) (None, 128, 128, 128) 0
dense (Dense) (None, 128, 128, 1) 129
Total params: 41,569
Trainable params: 41,569
Non-trainable params: 0
After train on 50 images I got the following statistic (please note, I tried to normalize input matrices to [0,1] without any success, current statistic demonstrate non-normalized case) :
...
Epoch 50/50 2/2 [==============================] - 1s 687ms/step -loss: 18427.2871 - mae: 124.9277
Then prediction produces following:
You can see left – expected random image and right – result of prediction. In prediction I intentionally use grey-scaled image to understand how close my result to target. But as you can see – the prediction is far from expected and is close to source form of top/left reduce matrices.
So my questions:
1) What layers I’m missing?
2) What should be improved in existing topology?
p.s. this is cross post from Cross Validated Stackexchange, because nobody even viewed question that site

So it's hard to say what model would work best without training and testing the actual model, but from the results you've gotten so far here's a few options you could try.
Try adding a fully connected hidden layer
From the model you posted, it seems that you have a few convolution layers, followed by an up-sampling and dropout layer, and finally a single dense layer for your output nodes. Potentially, adding additional dense layers (for e.g. 128 or more or less nodes) before your final output layer might help. While the multiple convolution layers help the neural net to build up a sort of hierarchical understanding of the image, the hypothesis class might not be complex enough. Adding one or more dense layers might help with this.
Try using a multilayer perceptron
Convolution layers are often used to process images because they help build up a hierarchical understanding of the image that is somewhat scale/shift/rotation invariant. However, considering the problem that you're solving, a global understanding of the input might be more beneficial than identifying shift-invariant features.
As such, one possible option would be to remove the convolution layers and to use a multilayer perceptron (MLP).
Let us think of the input as two matrices of numbers, and the output is a matrix of 1s and 0s that correspond to 'black' and 'white'. You could then try a model with the following layers:
A Flatten layer that takes in your two reduced matrices as inputs and flattens
them
A hidden dense layer, maybe with something like 128 nodes and relu activation. You should experiment with the number of layers, nodes, and activation.
An output dense layer with 16384 (128x128) nodes. You could apply a softmax activation to this layer which could help the optimiser during the training process. Then, when creating your final image, set values < 0.5 to 0 and values >= 0.5 to 1, and reshape and reformat the matrix into a square image.
Of course, no guarantees that an MLP would work well, but if often does especially when given sufficient amounts of data (perhaps in the 1000s or more number of training examples).
Try using a deterministic algorithm
Looking at the structure of this problem, it seems that it could be solved more appropriately with a deterministic algorithm, which would fall under more the branch of traditional artificial intelligence rather than deep learning. This is also another potential route to explore.

The model you build is a conventional model (seen by the use of Conv2D). This layer are good in analyzing something given its neighbors. Making them very powerful for image classification or segmentation.
In your case the result of a pixels is depending on the whole line and column.
Neural networks seems to be unsuited for your problem, but if you want to continue look in to replacing conv layers with Conv(1xN) and Conv(Nx1). It will still be very hard to make it work.
The hard way: These puzzle exist out of a strong recurrent process. Each step the correct spots get filled int with a zero or one. Based on those the next get filled in. So a recurrent neural network would make most sense to me. Where the convolution is used to have the prediction of the neighbors influence its current prediction

Keras custom loss function doesn't update my layer

I'm trying to get an embedding of some data I have using Keras. I have a model consisting of only an embedding layer, and then I have my own custom loss function that uses the weights of the embedding layer to calculate a loss function, however, the network won't train at all, meaning the loss value is always the same.
My loss function looks like this
def custom_loss(a,b):
layer = model.layers[0].get_weights()[0]
distances = [distance(layer[i],layer[j]) for i in range(len(layer)-1) for j in range(i+1, len(layer))]
loss = [-math.log10(sims[i]*distances[i])for i in range(len(sims))]
return K.sum(loss)+0*K.sum(b)
model.compile(optimizer="rmsprop",loss=custom_loss)
I get the distances between the different n-dimensional embedding values for all different points I pass to the embedding layer. I have the +0*K.sum(b) because otherwise, I get a no gradient exception. sims is a list of values that I multiply the distances with. loss ends up being a list with floats.
I don't get any exceptions, however, the loss is just constant. What might be wrong?
I must say I'm not really familiar with creating custom loss functions like this in Keras, so what I've done might be totally wrong.
My training data consists of simply passing ones as input values, as I'm, again, only interested in getting the embedding wrt my custom loss function. I know I'm basically abusing Keras in order to not write my own optimization function.
Does my K.sum(loss) simply give no gradient? My intuition tells me that layer = model.layers[0].get_weights()[0] is simply not updated between batches, but I don't know if this is correct, or if it is, how to fix it. Is there some other way to access the weights that would work?
Network summary looks like this
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 400, 3) 1200
=================================================================
Total params: 1,200
Trainable params: 1,200
Non-trainable params: 0
The distance function used looks like this
def distance(a,b):
powers = [(a[i]-b[i])**2 for i in range(len(a))]
return math.sqrt(sum(powers))

What is the meaning of "trainable_weights" in Keras?

If I freeze my base_model with trainable=false, I get strange numbers with trainable_weights.
Before freezing my model has 162 trainable_weights. After freezing, the model only has 2. I tied 2 layers to the pre-trained network. Does trainable_weights show me the layers to train? I find the number weird, when I see 2,253,335 Trainable params.

Trainable weights are the weights that will be learnt during the training process. If you do trainable=False then those weights are kept as it is and are not changed because they are not learnt. You might see some "strange numbers" because either you are using a pre-trained network that has its weights already learnt or you might be using random initialization when defining the model. When using transfer learning with pre-trained models a common practice is to freeze the weights of base model (pre-trained) and only train the extra layers that you add at the end.

Trainable weights are the same as trainable parameters.
A trainable layer often has multiple trainable weights.
Let's view this example:
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, None, 501) 0
_________________________________________________________________
lstm_1 (LSTM) (None, None, 40) 86720
_________________________________________________________________
SoftDense (TimeDistributed) (None, None, 501) 20541
=================================================================
Total params: 107,261
Trainable params: 107,261
Non-trainable params: 0
__________________________
The first layer is just an input layer; it receives the data as-is, so it does not have any trainable weights.
The next layer has 542*4 *40=86720 trainable weights.
40 due to its output dim, 4 because as an LSTM it actually has 4 trainable layers inside it, and 542 for 501+40+1... due to reasons that are probably beyond the scope of this answer.
The last layer has 41*501=20541 trainable weights
(40 from the hidden dimension of its input, the LSTM layer, +1 for bias, times 501 for its output).
Total trainable parameters are 107,261.
If I were to freeze the last layer I would have only 86,720 trainable weights.

Late to the party, but maybe this answer can be useful to others that might be googling this.
First, it is useful to distinguish between the quantity "Trainable params" one sees at the end of my_model.summary(), with the output of len(my_model.trainable_weights).
Maybe an example helps: let's say I have a model with VGG16 architecture.
my_model = keras.applications.vgg16.VGG16(
weights="imagenet",
include_top=False
)
# take a look at model summary
my_model.summary()
You will see that there are 13 conv. layers that have trainable parameters. Acknowledging the fact that pooling/input layers do not have trainable parameters, i.e. no learning is needed for them. On the other hand, in each of those 13 layers, there are "weights" and "biases" that need to be learned, think of them as variables.
What len(my_model.trainable_weights) will give you is the number of trainable layers (if you will) multiplied by 2 (weights + bias).
In this case, if you print len(my_model.trainable_weights), you will get 26 as the answer. maybe we can think of 26 as the number of variables for the optimization, variables that can differ in the shape of course.
Now to connect trainable_weights to the total number of trainable parameters, one can try:
trainable_params = 0
for layer in my_model.trainable_weights:
trainable_params += layer.numpy().size
print(F"#{trainable_params = }")
You will get this number: 14714688. Which must be the "Trainable params" number you see at the end of my_model.summary().

LSTM having a systematic offset between predictions and ground truth

Currently i think i'm experiencing a systematic offset in a LSTM model, between the predictions and the ground truth values. What's the best approach to continue further from now on?
The model architecture, along with the predictions & ground truth values are shown below. This is a regression problem where the historical data of the target plus 5 other correlated features X are used to predict the target y. Currently the input sequence n_input is of length 256, where the output sequence n_out is one. Simplified, the previous 256 points are used to predict the next target value.
X is normalized. The mean squared error is used as the loss function. Adam with a cosine annealing learning rate is used as the optimizer (min_lr=1e-7, max_lr=6e-2).
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
cu_dnnlstm_8 (CuDNNLSTM) (None, 256) 270336
_________________________________________________________________
batch_normalization_11 (Batc (None, 256) 1024
_________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 256) 0
_________________________________________________________________
dropout_11 (Dropout) (None, 256) 0
_________________________________________________________________
dense_11 (Dense) (None, 1) 257
=================================================================
Total params: 271,617
Trainable params: 271,105
Non-trainable params: 512
_________________________________________________________________
Increasing the node size in the LSTM layer, adding more LSTM layers (with return_sequences=True) or adding dense layers after the LSTM layer(s) only seems to lower the accuracy. Any advice would be appreciated.
Additional information on the image. The y-axis is a value, x-axis is the time (in days). NaNs have been replaced with zero, because the ground truth value in this case can never reach zero. That's why the odd outliers are in the data.
Edit:
I made some changes to the model, which increased accuracy. The architecture is the same, however the features used have changed. Currently only the historical data of the target sequence itself is used as a feature. Along with this, n_input got changed so 128. Switched Adam for SGD, mean squared error with the mean absolute error and finally the NaNs have been interpolated instead of being replaced with 0.
One step ahead predictions on the validation set look fine:
However, the offset on the validation set remains:
It might be worth noting that this offset also appears on the train set for x < ~430:

It looks like your model is overfitting and is simply always returning the value from the last timestep as a prediction. Your dataset is probably too small to have a model with this amount of parameters converge. You'll need to resort to techniques that combat overfitting: agressive dropout, adding more data, or try simpler, less overparameterized methods.
This phenomenon (LSTMs returning a shifted version of the input) has been a recurring theme in many stackoverflow questions. The answers there might contain some useful information:
LSTM Sequence Prediction in Keras just outputs last step in the input
LSTM model just repeats the past in forecasting time series
LSTM NN produces “shifted” forecast (low quality result)
Keras network producing inverse predictions
Stock price predictions of keras multilayer LSTM model converge to a constant value
Keras LSTM predicted timeseries squashed and shifted
LSTM Time series shifted predictions on stock market close price
Interesting results from LSTM RNN : lagged results for train and validation data
Finally, be aware that, depending on the nature of your dataset, there simply might be no pattern to be discovered in your data at all. You see this a lot with people trying to predict the stock market with LSTMs (there is a question on stackoverflow on how to predict the lottery numbers).

The answer is much simpler than we thought...
I saw multiple people saying this is due to overfitting and datasize. Some other people stated this is due to rescaling.
After several try, I found the solution: Try to do detrending before feed the data to RNN.
For example, you can do a simple degree-2 polynomial fitting of the data which will give you a polynomial formula. And it is possible to reduce the each data value corresponding to the formula value. Then we got a new dataset and we can feed it to the LSTM, after prediction we can just add the trend back to the result and the results should look better.

How to train a network in Keras for varying output size

I have basic neural network created with Keras. I train the network successfully with vectors of data and corresponding output data that is a vector with two elements. It represents a coordinate (x, y). So in goes an array, out comes an array.
Problem is that I am unable to use training data where a single input vector should correspond to many coordinates. Effectively, I desire a vector of coordinates as output, without prior knowledge of the number of coordinates.
Network is created by
model = Sequential()
model.add(Dense(20, input_shape=(196608,)))
model.add(Dense(10))
model.add(Dense(2))
and model summary shows the output dimensions for each layer
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 20) 3932180
_________________________________________________________________
dense_2 (Dense) (None, 10) 210
_________________________________________________________________
dense_3 (Dense) (None, 2) 22
=================================================================
I realize the network structure only allows a length 2 vector as output. Dense layers also do not accept None as their size. How do I modify the network so that it can train on and output a vector of vectors (list of coordinates)?

A recurrent neural networks (RNNs) would be much more appropriate, this models are typicall called seq2seq, that is, sequence to sequence. Recurrent nets use layers like LSTM and GRU, and can input and output variable length sequences. Just look at things like Machine Translation done with RNNs.
This can be done directly with keras, and there are many examples lying around the internet, for example this one.

An rnn is not what you want for predicting coordinates. Instead, I would recommend using a model that predicts coordinates and associated confidences. So you would have 100 coordinate predictions for every forward pass through your model. Each of those predictions would have another associated prediction that determines if it is correct or not. Only predictions that are above a certain confidence threshold would count. That confidence threshold is what allows the model to choose how many points it wants to use each time (with a maximum number set by the number of outputs which in this example is 100).
r-cnn is a model that does just that. Here is the first keras implementaion I found on github https://github.com/yhenon/keras-frcnn.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.