Recently I have been trying to understand tensorflow's tf.nn.conv2d_transpose, however I have a hard time understanding the input parameters for it. It's defined as:
tf.nn.conv2d_transpose(value, filter, output_shape, strides, padding='SAME')
For example, let's say I have a image of size [batch_size, 7, 7, 128] and want to transform it to [batch_size, 14, 14, 64]. Then output_shape=[batch_size, 14, 14, 64], strides=[2,2], however I can't figure out how to get the shape of the filter. Any thoughts?
Furthermore how does padding="SAME" works for conv2d_transpose? Is it applied to the output image or the input?
For the first question on filter shapes, I'd use the object oriented version tf.layers.Conv2DTranspose and look at the kernel property to figure out the filter shapes:
>>> import tensorflow as tf
>>> l = tf.layers.Conv2DTranspose(filters=64, kernel_size=1, padding='SAME', strides=[2, 2])
>>> l(tf.ones([12, 7, 7, 128]))
<tf.Tensor 'conv2d_transpose/BiasAdd:0' shape=(12, 14, 14, 64) dtype=float32>
>>> l.kernel
<tf.Variable 'conv2d_transpose/kernel:0' shape=(1, 1, 64, 128) dtype=float32_ref>
>>>
On second padding question, conv2d_transpose computes the gradient of conv2d. Since conv2d pads its inputs, conv2d_transpose needs to pad its output to fit the gradient.
Related
I currently have a tensor of torch.Size([1, 3, 256, 224]) but I need it to be input shape [32, 3, 256, 224]. I am capturing data in real-time so dataloader doesn't seem to be a good option. Is there any easy way to take 32 of size torch.Size([1, 3, 256, 224]) and combine them to create 1 tensor of size [32, 3, 256, 224]?
You are probable using jit model, and the batch size must be exact like the one the model was trained on.
t = torch.rand(1, 3, 256, 224)
t.size() # torch.Size([1, 3, 256, 224])
t2= t.expand(32, -1,-1,-1)
t2.size() # torch.Size([32, 3, 256, 224])
Expanding a tensor does not allocate new memory, but only creates a new view on the existing tensor, and you get what you need. Only the tensor stride was changed.
I'm uncertain about the following question, all I've found on the internet seemed vague and fuzzy.
Consider this CNN:
model = Sequential()
# 1st conv layer
model.add(Conv2D(10, (4,4), actiavtion="relu", input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
# 2nd conv layer
model.add(Conv2D(20, (4,4), actiavtion="relu"))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
Now, when the input image is passed to the first conv layer, we result in 10 features maps, each of them of the shape (25, 25, 1). Hence, we result in the shape of (25, 25, 1, 10), correct? Applying the Pooling leads us to (12, 12, 1, 10).
My question appears when it comes to the second conv layer. A conv layer always takes one picture/matrix as input. Like the first layer took (28, 28, 1), which is one picture.
But conv layer 1 gave us 10 pictures (or feature maps). So, which of these 10 is used as the input? I'd assume every single one.
Suppose that is correct: So, we have the input shape (12, 12, 1) for the second conv layer. Applying it results in (9, 9, 1) and the Pooling layer gives then (4, 4, 1). Since we have 20 features specified, we result in (4, 4, 1, 20).
But that's only for one of the 10 possible inputs! Therefore, if we apply all of them, we'd have the final shape (4, 4, 1, 20, 10). Correct?
Edit:
The weight calculation makes me think it's correct because it fits.
On the other hand, the flatten layer only has 320 = 4*4*20 neurons, not 3200 = 4*4*20*10 like I would expect it. So that would make me think it's not correct.
This is the output of the model summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_13 (Conv2D) (None, 25, 25, 10) 170
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 12, 12, 10) 0
_________________________________________________________________
conv2d_14 (Conv2D) (None, 9, 9, 20) 3220
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 4, 4, 20) 0
_________________________________________________________________
flatten_6 (Flatten) (None, 320) 0
_________________________________________________________________
dense_12 (Dense) (None, 128) 41088
_________________________________________________________________
dense_13 (Dense) (None, 10) 1290
=================================================================
Total params: 45,768
Trainable params: 45,768
Non-trainable params: 0
And if the initial input shape would have been an RGB picture (e.g. (28, 28, 3)), we would result in (4, 4, 3, 20, 10)?
Your confusion comes from the fact that even though you provide 2 numbers to the filter (4 for width and 4 for height in your example), the filter is actually 3D. This 3rd dimension represents the number of input channels.
Let's go through the first convolution layer: Conv2D(10, (4,4), actiavtion="relu", input_shape=(28,28,1).
We have an input shape of (28, 28, 1), and filter shape of (4, 4, 1). Even though you specified the shape to be (4, 4) in that line above, remember that the third dimension will be the number of input channels, which for this first convolution layer, is 1. If you were feeding RGB images into your model, for example, both the input shape and filter would have the third dimension be 3 instead of 1.
Our output shape, given our input shape and filter shape, should be (input_shape[0] - filter_shape[0] + 1, input_shape[1] - filter_shape[1] + 1, output_channels) (assuming the stride is 1, which it is in your model). Substituting values, we get (28 - 4 + 1, 28 - 4 + 1, 10), or (25, 25, 10). This confirms what we see in model.summary().
As for how we go from input to output under the hood, first we need to move the filter across the input, both horizontally and vertically. An input of shape (28, 28, 1), with a filter of shape (4, 4, 1), would yield a chunked input of shape (25, 25, 4, 4, 1). In other words, we have have 25 x 25 "views" of our original image, with each of these views having shape (4, 4, 1) representing the pixel values we see in the image.
We have 10 (4, 4, 1) filters (10 being number of output channels). Let's take the first of these filters. Let's also take the first "view" of our original image (remember, we have 25 x 25 in total). We multiply the filter by this "view" element-wise, which works out great because both the filter and the "view" are of the same shape (4, 4, 1). The nature of this multiplication gives us an output "view" of shape (4, 4, 1). We then add all these values (4 x 4 x 1 = 16 values total) to give our "signal". Larger sum of these values would mean stronger detection of whatever the filter is looking for. I've overlooked some things, like bias, but that doesn't change the dimensionality of things.
The above walk through only dealt with the first filter and first "view" of our image, and resulted in a single scalar "signal". We have 10 filters, and 25 x 25 views, yielding a final output shape of (25, 25, 10) (as expected).
Note how the entire process operated in 3D space. Both the filters and views were 3D, in this case with a last dimension of 1. It is able to operate in 3D space because the element-wise multiplication will work out, as long as both the filter and "view" have the same 3rd dimension (1 in this case).
If we went through the second convolution layer (Conv2D(20, (4,4), actiavtion="relu")), that last dimension of both the filter and "view" would be 10 instead of 1. This is because the output channels of the previous convolution layer are the same as the input channels of the current one.
I'm building a model in Tensorflow using tf.layers objects. When I run the following code using tf.layers.MaxPooling2D my model does not reduce in size. I've only recently switched from using Keras to Tensorflow directly so I presume I'm misunderstanding the usage.
import tensorflow as tf
import numpy as np
features = tf.constant(np.random.random((20,128,128,3)), dtype=tf.float32)
y_true = tf.constant(np.random.random((20,1)), dtype=tf.float32)
print('features = %s' % features)
conv = tf.layers.Conv2D(32,(2,2),padding='same')(features)
print('conv = %s' % conv)
pool = tf.layers.MaxPooling2D((2,2),(1,1),padding='same')(conv)
print('pool = %s' % pool)
# and so on ...
I see this output:
features = Tensor("Const:0", shape=(20, 128, 128, 3), dtype=float32)
conv = Tensor("conv2d/BiasAdd:0", shape=(20, 128, 128, 32), dtype=float32)
pool = Tensor("max_pooling2d/MaxPool:0", shape=(20, 128, 128, 32), dtype=float32)
I was expecting to see the output from the MaxPool layer to have a shape of (20,64,64,32).
Am I using this this correctly?
If you want to downsample by a factor of 2 your feature map, you should use a stride 2.
In [1]: tf.layers.MaxPooling2D(2, 2, padding='same')(conv)
Out[1]: <tf.Tensor 'max_pooling2d/MaxPool:0' shape=(20, 64, 64, 32) dtype=float32>
I would like to convert this Lasagne code:
et = {}
net['input'] = lasagne.layers.InputLayer((100, 1, 24, 113))
net['conv1/5x1'] = lasagne.layers.Conv2DLayer(net['input'], 64, (5, 1))
net['shuff'] = lasagne.layers.DimshuffleLayer(net['conv1/5x1'], (0, 2, 1, 3))
net['lstm1'] = lasagne.layers.LSTMLayer(net['shuff'], 128)
in Keras code. Currently I came up with this:
multi_input = Input(shape=(1, 24, 113), name='multi_input')
y = Conv2D(64, (5, 1), activation='relu', data_format='channels_first')(multi_input)
y = LSTM(128)(y)
But I get the error: Input 0 is incompatible with layer lstm_1: expected ndim=3, found ndim=4
Solution
from keras.layers import Input, Conv2D, LSTM, Permute, Reshape
multi_input = Input(shape=(1, 24, 113), name='multi_input')
print(multi_input.shape) # (?, 1, 24, 113)
y = Conv2D(64, (5, 1), activation='relu', data_format='channels_first')(multi_input)
print(y.shape) # (?, 64, 20, 113)
y = Permute((2, 1, 3))(y)
print(y.shape) # (?, 20, 64, 113)
# This line is what you missed
# ==================================================================
y = Reshape((int(y.shape[1]), int(y.shape[2]) * int(y.shape[3])))(y)
# ==================================================================
print(y.shape) # (?, 20, 7232)
y = LSTM(128)(y)
print(y.shape) # (?, 128)
Explanations
I put the documents of Lasagne and Keras here so you can do cross-referencing:
Lasagne
Recurrent layers can be used similarly to feed-forward layers except
that the input shape is expected to be (batch_size, sequence_length, num_inputs)
Keras
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
Basically the API is the same, but Lasagne probably does reshape for you (I need to check the source code later). That's why you got this error:
Input 0 is incompatible with layer lstm_1: expected ndim=3, found ndim=4
, since the tensor shape after Conv2D is (?, 64, 20, 113) of ndim=4
Therefore, the solution is to reshape it to (?, 20, 7232).
Edit
Confirmed with the Lasagne source code, it does the trick for you:
num_inputs = np.prod(input_shape[2:])
So the correct tensor shape as input for LSTM is (?, 20, 64 * 113) = (?, 20, 7232)
Note
Permute is redundant here in Keras since you have to reshape anyway. The reason why I put it here is to have a "full translation" from Lasagne to Keras, and it does what DimshuffleLaye does in Lasagne.
DimshuffleLaye is however needed in Lasagne because of the reason I mentioned in Edit, the new dimension created by Lasagne LSTM is from the multiplication of "the last two" dimensions.
Bellow is a piece of example code from the documentation in Keras. It looks like the first convolution accepts a 256x256 image with 3 color channels. It has 64 output filters (I think these are the same as feature maps which I have read about elsewhere can someone confirm this for me). What confuses me is that the output size is (None, 64, 256, 256). I would expect it to be (None, 64 * 3, 256, 256) since it would need to do convolutions for each of the color channels. What I am wondering is how does Keras handel the color channels. Do the values get averaged together (converted to grey scale) before passing though the convolution?
# apply a 3x3 convolution with 64 output filters on a 256x256 image:
model = Sequential()
model.add(Convolution2D(64, 3, 3, border_mode='same', input_shape=(3, 256, 256)))
# now model.output_shape == (None, 64, 256, 256)
# add a 3x3 convolution on top, with 32 output filters:
model.add(Convolution2D(32, 3, 3, border_mode='same'))
# now model.output_shape == (None, 32, 256, 256)
a filter of size 3*3 with 3 input channels consists of 3*3*3 parameters, so the weights of the convolution kernels for each channel are different.
it sums up the convolution results of each channel (probably together with a bias term) to get the output. so the output shape is independent of the number of input channels, for example, (None, 64, 256, 256) rather than (None, 64 * 3, 256, 256).
I'm not 100% sure but I think a feature map refers to the output of applying one such filter to the input (for example a 256*256 matrix).