In a Multilayer CNN, which inputs takes the 2nd layer? - python

I'm uncertain about the following question, all I've found on the internet seemed vague and fuzzy.
Consider this CNN:
model = Sequential()
# 1st conv layer
model.add(Conv2D(10, (4,4), actiavtion="relu", input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
# 2nd conv layer
model.add(Conv2D(20, (4,4), actiavtion="relu"))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
Now, when the input image is passed to the first conv layer, we result in 10 features maps, each of them of the shape (25, 25, 1). Hence, we result in the shape of (25, 25, 1, 10), correct? Applying the Pooling leads us to (12, 12, 1, 10).
My question appears when it comes to the second conv layer. A conv layer always takes one picture/matrix as input. Like the first layer took (28, 28, 1), which is one picture.
But conv layer 1 gave us 10 pictures (or feature maps). So, which of these 10 is used as the input? I'd assume every single one.
Suppose that is correct: So, we have the input shape (12, 12, 1) for the second conv layer. Applying it results in (9, 9, 1) and the Pooling layer gives then (4, 4, 1). Since we have 20 features specified, we result in (4, 4, 1, 20).
But that's only for one of the 10 possible inputs! Therefore, if we apply all of them, we'd have the final shape (4, 4, 1, 20, 10). Correct?
Edit:
The weight calculation makes me think it's correct because it fits.
On the other hand, the flatten layer only has 320 = 4*4*20 neurons, not 3200 = 4*4*20*10 like I would expect it. So that would make me think it's not correct.
This is the output of the model summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_13 (Conv2D) (None, 25, 25, 10) 170
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 12, 12, 10) 0
_________________________________________________________________
conv2d_14 (Conv2D) (None, 9, 9, 20) 3220
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 4, 4, 20) 0
_________________________________________________________________
flatten_6 (Flatten) (None, 320) 0
_________________________________________________________________
dense_12 (Dense) (None, 128) 41088
_________________________________________________________________
dense_13 (Dense) (None, 10) 1290
=================================================================
Total params: 45,768
Trainable params: 45,768
Non-trainable params: 0
And if the initial input shape would have been an RGB picture (e.g. (28, 28, 3)), we would result in (4, 4, 3, 20, 10)?

Your confusion comes from the fact that even though you provide 2 numbers to the filter (4 for width and 4 for height in your example), the filter is actually 3D. This 3rd dimension represents the number of input channels.
Let's go through the first convolution layer: Conv2D(10, (4,4), actiavtion="relu", input_shape=(28,28,1).
We have an input shape of (28, 28, 1), and filter shape of (4, 4, 1). Even though you specified the shape to be (4, 4) in that line above, remember that the third dimension will be the number of input channels, which for this first convolution layer, is 1. If you were feeding RGB images into your model, for example, both the input shape and filter would have the third dimension be 3 instead of 1.
Our output shape, given our input shape and filter shape, should be (input_shape[0] - filter_shape[0] + 1, input_shape[1] - filter_shape[1] + 1, output_channels) (assuming the stride is 1, which it is in your model). Substituting values, we get (28 - 4 + 1, 28 - 4 + 1, 10), or (25, 25, 10). This confirms what we see in model.summary().
As for how we go from input to output under the hood, first we need to move the filter across the input, both horizontally and vertically. An input of shape (28, 28, 1), with a filter of shape (4, 4, 1), would yield a chunked input of shape (25, 25, 4, 4, 1). In other words, we have have 25 x 25 "views" of our original image, with each of these views having shape (4, 4, 1) representing the pixel values we see in the image.
We have 10 (4, 4, 1) filters (10 being number of output channels). Let's take the first of these filters. Let's also take the first "view" of our original image (remember, we have 25 x 25 in total). We multiply the filter by this "view" element-wise, which works out great because both the filter and the "view" are of the same shape (4, 4, 1). The nature of this multiplication gives us an output "view" of shape (4, 4, 1). We then add all these values (4 x 4 x 1 = 16 values total) to give our "signal". Larger sum of these values would mean stronger detection of whatever the filter is looking for. I've overlooked some things, like bias, but that doesn't change the dimensionality of things.
The above walk through only dealt with the first filter and first "view" of our image, and resulted in a single scalar "signal". We have 10 filters, and 25 x 25 views, yielding a final output shape of (25, 25, 10) (as expected).
Note how the entire process operated in 3D space. Both the filters and views were 3D, in this case with a last dimension of 1. It is able to operate in 3D space because the element-wise multiplication will work out, as long as both the filter and "view" have the same 3rd dimension (1 in this case).
If we went through the second convolution layer (Conv2D(20, (4,4), actiavtion="relu")), that last dimension of both the filter and "view" would be 10 instead of 1. This is because the output channels of the previous convolution layer are the same as the input channels of the current one.

Related

ValueError: Negative dimension size caused by subtracting 12 from 1 for 'max_pooling2d_1/MaxPool' (op: 'MaxPool') with input shapes: [?,1,1,64]

I'm using Keras with Tensorflow as backend,nhere is my code:
K.clear_session()
model=Sequential()
model.add(layers.Conv2D(32,(3,3),padding='same',input_shape=(49,43,1),activation='relu'))
model.add(layers.Conv2D(64,(3,3),activation='relu'))
model.add(layers.MaxPool2D(pool_size=(8,8)))
model.add(layers.Conv2D(32,(3,3),padding='same',activation='relu'))
model.add(layers.Conv2D(64,(3,3),activation='relu'))
model.add(layers.MaxPool2D(pool_size=(8,8)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(layers.Dense(256,activation='relu'))
model.add(layers.Dense(4,activation='softmax'))
When I run your code, I get a different output - different numbers:
ValueError: Negative dimension size caused by subtracting 8 from 3 for '{{node max_pooling2d_1/MaxPool}} = MaxPoolT=DT_FLOAT, data_format="NHWC", ksize=[1, 8, 8, 1], padding="VALID", strides=[1, 8, 8, 1]' with input shapes: [?,3,3,64].
Your input is too small to go through all the layers. I added the output of each layer on the code below. When it reaches the second MaxPool layer, it's already too small to be divided by 8.
model=Sequential()
model.add(layers.Conv2D(32,(3,3),padding='same',input_shape=(49,43,1),activation='relu'))
# Output: (49, 43, 32)
model.add(layers.Conv2D(64,(3,3),activation='relu'))
# Output: (47, 41, 64)
model.add(layers.MaxPool2D(pool_size=(8,8)))
# Output: (5, 5, 64))
model.add(layers.Conv2D(32,(3,3),padding='same',activation='relu'))
# Output: (5, 5, 32)
model.add(layers.Conv2D(64,(3,3),activation='relu'))
# Output: (3, 3, 64)
model.add(layers.MaxPool2D(pool_size=(8,8)))
# Output: How can I divide 3 by 8???
model.add(Activation('relu'))
model.add(Flatten())
model.add(layers.Dense(256,activation='relu'))
model.add(layers.Dense(4,activation='softmax'))
There are a lot of questions and blog posts about computing layers' output.
So either increases your input or remove some layers in your model.

CNN Back-propagation on a 3d image

So, I am trying to write my own code for CNN using CIFAR-10 dataset. I have completed the feed forward algorithm and started with the back-propagation. I had no problem back propagating through the fully connected layer as well as going through unpooling. I am now stuck at back propagating through a convolution layer. My image input size is (32, 32, 3) rgb input.
This is my network so far.
Conv Layer + Relu -> Maxpool -> Conv Layer + Relu -> Maxpool -> Dense + Relu -> Dense + Softmax
While propagating forward, I took a (3, 3, 3) kernel and convolved with the image. I used 12 such filters, which gave me (32, 32, 12) tensor as an output. Maxpool made the dimensions (16, 16, 12) which was again convolved with (3, 3, 12) for 8 different filters to get (16, 16, 8).
While back propagating, I have reached this part. I have a (16, 16, 8) tensor, using which i have to update a filter matrix of (3, 3, 12) for 8 different filters. Also I have to propagate my error to calculate change in the input of the previous layer which is of the dimension (16, 16, 12).
I understood how this works with 1 channel image, but I'm confused when three channels are concerned.
I also know that back-propagating through a convolution layer is also a convolution, but I'm stuck here. How is it done. Please help.

how to configure data labels in a numpy array for training a Keras model?

I'm trying to implement Keras for my first time (so sorry for the dumb question) as part of a wider project to make an AI that learns to play connect 4. As part of this, I pass a NN a 6*7 grid and it outputs an array of 7 values giving the probabilities to pick for each column in the game. Here is the output of the Model.summary() method for a bit more detail:
______________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten (Flatten) (None, 42) 0
_________________________________________________________________
dense (Dense) (None, 20) 860
_________________________________________________________________
dense_1 (Dense) (None, 20) 420
_________________________________________________________________
dense_2 (Dense) (None, 7) 147
=================================================================
Total params: 1,427
Trainable params: 1,427
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
the model will give (at the moment random) predictions when i pass it numpy arrays of shape (1, 6, 7), however, when i try to train the model with an array of shape (221, 6, 7) for the data and an array of shape (221, 7) for the labels i get this error:
ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (7,)
This is the code I use to train the model (which outputs (221, 6, 7) and (221, 7)):
board_tensor = np.array(full_board_list)
print(board_tensor.shape)
label_tensor = np.array(full_label_list)
print(label_tensor.shape)
self.model.fit(board_tensor, label_tensor)
this is the code I use to define the model:
self.model = keras.Sequential([
keras.layers.Flatten(input_shape=(6, 7)),
keras.layers.Dense(20, activation=tf.nn.relu),
keras.layers.Dense(20, activation=tf.nn.relu),
keras.layers.Dense(7, activation=tf.nn.softmax)])
self.model.compile(optimizer=tf.train.AdamOptimizer(),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
(the model is part of an AI object so that it could be compared to other types of AI objects)
This is the code which successfully predicts a batch of size 1, generated from by a two dimensional list representing the board (it outputs (1, 6, 7) and (1, 7)):
input_tensor = np.array(board.board)
input_tensor = np.expand_dims(input_tensor, 0)
print(input_tensor.shape)
probability_distribution = self.model.predict(input_tensor)
print(probability_distribution.shape)
I realise that the error is probably due to a lack of understanding on my part as to what the methods in Keras expect to be given; so as a little side-note, does anyone have any good, thorough learning resources which really get you to understand what each method is doing (ie. not just telling you which code to type in to make an image recogniser) that would be understandable to people new to Keras and Tensorflow like me?
thanks a lot in advance!
You are using the sparse_categorical_crossentropy loss, which takes integer labels (not one-hot encoded ones), while your labels are one-hot encoded. This is why you get an error.
The easiest way to fix it is to change loss to categorical_crossentropy.

Stacking convolutional network and recurrent layer

I'm working on a classifier for video sequences. It should take several video frames on input and output a label, either 0 or 1. So, it is a many-to-one network.
I already have a classifier for single frames. This classifier makes several convolutions with Conv2D, then applies GlobalAveragePooling2D. This results in 1D vector of length 64. Then original per-frame classifier has a Dence layer with softmax activation.
Now I would like to extend this classifier to work with sequences. Ideally, sequences should be of varying length, but for now I fix the length to 4.
To extend my classifier, I'm going to replace Dense with an LSTM layer with 1 unit. So, my goal is to have the LSTM layer to take several 1D vectors of length 64, one by one, and output a label.
Schematically, what I have now:
input(99, 99, 3) - [convolutions] - features(1, 64) - [Dense] - [softmax] - label(1, 2)
Desired architecture:
4x { input(99, 99, 3) - [convolutions] - features(1, 64) } - [LSTM] - label(1, 2)
I cannot figure out, how to do it with Keras.
Here is my code for convolutions
from keras.layers import Conv2D, BatchNormalization, GlobalAveragePooling2D, \
LSTM, TimeDistributed
IMAGE_WIDTH=99
IMAGE_HEIGHT=99
IMAGE_CHANNELS=3
convolutional_layers = Sequential([
Conv2D(input_shape=(IMAGE_WIDTH, IMAGE_HEIGHT, IMAGE_CHANNELS),
filters=6, kernel_size=(3, 3), strides=(2, 2), activation='relu',
name='conv1'),
BatchNormalization(),
Conv2D(filters=64, kernel_size=(1, 1), strides=(1, 1), activation='relu',
name='conv5_pixel'),
BatchNormalization(),
GlobalAveragePooling2D(name='avg_pool6'),
])
Here is the summary:
In [24]: convolutional_layers.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1 (Conv2D) (None, 49, 49, 6) 168
_________________________________________________________________
batch_normalization_3 (Batch (None, 49, 49, 6) 24
_________________________________________________________________
conv5_pixel (Conv2D) (None, 49, 49, 64) 448
_________________________________________________________________
batch_normalization_4 (Batch (None, 49, 49, 64) 256
_________________________________________________________________
avg_pool6 (GlobalAveragePool (None, 64) 0
=================================================================
Total params: 896
Trainable params: 756
Non-trainable params: 140
Now I want a recurrent layer to process sequences of these 64-dimensional vectors and output a label for each sequence.
I've read in manuals that TimeDistributed layer applies its input layer to every time slice of the input data.
I continue my code:
FRAME_NUMBER = 4
td = TimeDistributed(convolutional_layers, input_shape=(FRAME_NUMBER, 64))
model = Sequential([
td,
LSTM(units=1)
])
Result is the exception IndexError: list index out of range
Same exception for
td = TimeDistributed(convolutional_layers, input_shape=(None, FRAME_NUMBER, 64))
What am I doing wrong?
Expanding on the comments to an answer; the TimeDistributed layer applies the given layer to every time step of the input. Hence, your TimeDistributed would apply to every frame giving an input shape=(F_NUM, W, H, C). After applying the convolution to every image, you get back (F_NUM, 64) which are features for every frame.

What do Keras convolution layers do with color channels?

Bellow is a piece of example code from the documentation in Keras. It looks like the first convolution accepts a 256x256 image with 3 color channels. It has 64 output filters (I think these are the same as feature maps which I have read about elsewhere can someone confirm this for me). What confuses me is that the output size is (None, 64, 256, 256). I would expect it to be (None, 64 * 3, 256, 256) since it would need to do convolutions for each of the color channels. What I am wondering is how does Keras handel the color channels. Do the values get averaged together (converted to grey scale) before passing though the convolution?
# apply a 3x3 convolution with 64 output filters on a 256x256 image:
model = Sequential()
model.add(Convolution2D(64, 3, 3, border_mode='same', input_shape=(3, 256, 256)))
# now model.output_shape == (None, 64, 256, 256)
# add a 3x3 convolution on top, with 32 output filters:
model.add(Convolution2D(32, 3, 3, border_mode='same'))
# now model.output_shape == (None, 32, 256, 256)
a filter of size 3*3 with 3 input channels consists of 3*3*3 parameters, so the weights of the convolution kernels for each channel are different.
it sums up the convolution results of each channel (probably together with a bias term) to get the output. so the output shape is independent of the number of input channels, for example, (None, 64, 256, 256) rather than (None, 64 * 3, 256, 256).
I'm not 100% sure but I think a feature map refers to the output of applying one such filter to the input (for example a 256*256 matrix).

Categories