In Tensorflow's TimeDistributed document. There is an example:
inputs = tf.keras.Input(shape=(10, 128, 128, 3))
conv_2d_layer = tf.keras.layers.Conv2D(64, (3, 3))
outputs = tf.keras.layers.TimeDistributed(conv_2d_layer)(inputs)
outputs.shape
And the output is TensorShape([None, 10, 126, 126, 64]). How come it ends up with this shape?
Consider a batch of 32 video samples, where each sample is a 128x128 RGB image, across 10 timesteps. The batch input shape is (32, 10, 128, 128, 3).
You can then use TimeDistributed to apply the same Conv2D layer to each of the 10 timesteps, independently as shown below
inputs = tf.keras.Input(shape=(10, 128, 128, 3))
conv_2d_layer = tf.keras.layers.Conv2D(64, (3, 3))
outputs = tf.keras.layers.TimeDistributed(conv_2d_layer)(inputs)
outputs.shape
Output:
TensorShape([None, 10, 126, 126, 64])
And the output is TensorShape([None, 10, 126, 126, 64]). How come it
ends up with this shape?
Here is the formula to calculate output shape of convolution layer
In this case, you have 128 input features, 3 is the convolution kernel size, 0 is convolution padding size and 1 is convolution stride size (by default if not provided) then output features is (128+2*0-3)/1+1=126.
Same apply for other dimension. So now you have a 126x126 image for one filter.
If you apply this for 64 times you will have another dimension 126x126x64, now output shape across 10 time steps is [None, 10, 126, 126, 64]. Here None is batch_size. For more information you can refer this and this.
Related
I'm trying to perform image classification using a CNN with a EfficientNet Backbone using Tensorflow. The original image is very large and contains a significant amount of whitespace. Therefore, it has been split into tiles. For example, the first image has 30 tiles (also known as patches). Each patch is 256 x 256 in size.
However, not all images have exactly 30 tiles. Some have significantly more while some have significantly less. All tiles are important. I am intending upon exploring the usage of LSTMs (TimeDistributed Layers) for this dataset to handle variable input. The data produced through the Tensorflow Dataset takes the shape of: (None, 256, 256, 3). The None is due to the fact that the number of patches / tiles varies and the last 3 dimensions is explained by the fact that each tile takes the shape of (256, 256, 3) as it is 256 x 256.
Given that the Tensorflow Dataset produces output of (None, 256, 256, 3), I assumed this would be the input as well. However, Tensorflow's model input_shape is limited to 3 input channels. Therefore, I set input_shape to be (256, 256, 3) as shown below:
B0 = tf.keras.applications.efficientnet_v2.EfficientNetV2B0(
weights='imagenet',
include_top=False,
pooling='avg',
input_shape=(256, 256, 3),
)
In another example that utilized the same dataset that I am utilizing, an individual also utilized TimeDistributed Layers, but as he always had 10 tiles per image, he utilized the following:
bottleneck = efn.EfficientNetB1(
weights='../input/effnetweights/efficientnet-b1_weights_tf_dim_ordering_tf_kernels_autoaugment_notop.h5',
include_top=False,
pooling='avg'
)
bottleneck = Model(inputs=bottleneck.inputs, outputs=bottleneck.layers[-2].output)
model = Sequential()
model.add(TimeDistributed(bottleneck, input_shape=(SEQ_LEN, IMG_SIZE, IMG_SIZE, 3)))
where SEQ_LEN is 10, IMG_SIZE is 120.
While this example, uses 10 tiles per image and is static, articles on the internet show that Tensorflow's TimeDistributed Wrapper allows the handling of variable length input which I interpret to be (variable length amount of tiles).
I therefore attempted to use following as my model:
B0 = tf.keras.applications.efficientnet_v2.EfficientNetV2B0(
weights='imagenet',
include_top=False,
pooling='avg',
input_shape=(TILE_SIZE, TILE_SIZE, 3),
)
B0 = Model(inputs=B0.inputs, outputs=B0.layers[-2].output)
model = Sequential()
model.add(TimeDistributed(B0, input_shape=(None, TILE_SIZE, TILE_SIZE, 3)))
model.add(GlobalMaxPooling3D())
model.add(Dense(6, activation='softmax'))
model.compile(
loss='categorical_crossentropy',
optimizer=Adam(),
metrics=['categorical_accuracy', tfa.metrics.CohenKappa(num_classes=6, sparse_labels=False, weightage="quadratic")]
)
This didn't work and produced the following error:
WARNING:tensorflow:Model was constructed with shape (None, None, 256, 256, 3) for input KerasTensor(type_spec=TensorSpec(shape=(None, None, 256, 256, 3), dtype=tf.float32, name='time_distributed_input'), name='time_distributed_input', description="created by layer 'time_distributed_input'"), but it was called on an input with incompatible shape (None, None, None, None).
Traceback (most recent call last):
File "C:\Users\tom50\OneDrive - Flinders\Honours\TimeDistributed\Main.py", line 59, in <module>
history = model.fit(
File "C:\Users\tom50\AppData\Roaming\Python\Python39\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\tom50\AppData\Local\Temp\__autograph_generated_filekcf5gc2z.py", line 15, in tf__train_function
retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
ValueError: in user code:
File "C:\Users\tom50\AppData\Roaming\Python\Python39\site-packages\keras\engine\training.py", line 1160, in train_function *
return step_function(self, iterator)
File "C:\Users\tom50\AppData\Roaming\Python\Python39\site-packages\keras\engine\training.py", line 1146, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\tom50\AppData\Roaming\Python\Python39\site-packages\keras\engine\training.py", line 1135, in run_step **
outputs = model.train_step(data)
File "C:\Users\tom50\AppData\Roaming\Python\Python39\site-packages\keras\engine\training.py", line 993, in train_step
y_pred = self(x, training=True)
File "C:\Users\tom50\AppData\Roaming\Python\Python39\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\tom50\AppData\Roaming\Python\Python39\site-packages\keras\engine\input_spec.py", line 232, in assert_input_compatibility
raise ValueError(
ValueError: Exception encountered when calling layer "sequential" " f"(type Sequential).
Input 0 of layer "time_distributed" is incompatible with the layer: expected ndim=5, found ndim=4. Full shape received: (None, None, None, None)
Call arguments received by layer "sequential" " f"(type Sequential):
• inputs=tf.Tensor(shape=(None, None, None, None), dtype=float32)
• training=True
• mask=None
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from efficientnet.tfkeras import EfficientNetB0
# Define the input shape and number of classes
input_shape = (256, 256, 3)
num_classes = 5
# Create the model with the EfficientNetB0 backbone
model = keras.Sequential([
EfficientNetB0(include_top=False, input_shape=input_shape),
layers.GlobalAveragePooling2D(),
layers.Dense(num_classes, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Split the image into tiles
def split_image(image, tile_size=(256, 256)):
tiles = []
h, w = image.shape[:2]
for i in range(0, h, tile_size[0]):
for j in range(0, w, tile_size[1]):
tiles.append(image[i:i+tile_size[0], j:j+tile_size[1]])
return tiles
# Load the image
img = keras.preprocessing.image.load_img("image.jpg")
img = keras.preprocessing.image.img_to_array(img)
tiles = split_image(img)
# Predict the class of each tile
predictions = []
for tile in tiles:
x = keras.preprocessing.image.img_to_array(tile)
x = keras.applications.mobilenet_v2.preprocess_input(x)
x = np.expand_dims(x, axis=0)
prediction = model.predict(x)
predictions.append(prediction)
This code first creates a model with the EfficientNetB0 backbone and compiles it. Then it loads the image and splits it into tiles using the split_image() function. Finally, it iterates through each tile and uses the model to predict its class.
Please note that this is just a sample and you should modify the code according to your requirement, also note that this code is not tested, so there may be some errors.
I have a ResNet9 model, implemented in Pytorch which I am using for multi-class image classification. My total number of classes is 6. Using the following code, from torchsummary library, I am able to show the summary of the model, seen in the attached image:
INPUT_SHAPE = (3, 256, 256) #input shape of my image
print(summary(model.cuda(), (INPUT_SHAPE)))
However, I am quite confused about the -1 values in all layers of the ResNet9 model. Also, for Conv2d-1 layer, I am confused about the 64 value in the output shape [-1, 64, 256, 256] as I believe the n_channels value of the input image is 3. Can anyone please help me with the explanation of the output shape values? Thanks!
Yes
your INPUT_SHAPE is torch.Size([3, 256, 256]) if it's channel first format AND (256, 256, 3) if it's channel last format.
As Pytorch model accepts it in channel first format , for you it shows torch.Size([3, 256, 256])
and talking about our output shape [-1, 64, 256, 256], this is the output shape of your first conv output which has 64 filter each of 256x256 dim and not your input_shape.
-1 represents your variable batch_size which can be fixed in dataloader
I am having trouble understanding how 2D Conv calculations are done on 4D inputs. Basically, this is the situation, I have an image of height, width, channels = 128, 128, 103. I want each of these 103 channels to be processed individually as if I'm inputting them to the network one by one. Would the following line work?
import tensorflow.keras
from tensorflow.keras.layers import Conv2D
model1 = tensorflow.keras.models.Sequential()
model1.add(Conv2D(1, kernel_size=(3,3), input_shape = (128, 128,103,1), padding='same'))
I want to avoid splitting the image and inputting it into the network as 103 batches of (128,128,1)
As explained in the documentation: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D?version=nightly
4+D tensor with shape: batch_shape + (channels, rows, cols) if data_format='channels_first' or
4+D tensor with shape: batch_shape + (rows, cols, channels) if data_format='channels_last'.
(by default: data_format='channels_last'.)
You are passing a 5D tensor of shape (batch_shape, 128, 128, 103, 1).
I suggest you reshape your tensor into something that will yield a shape like this one (None, 128, 128, 103).
Also please change input_shape = (128, 128,103,1) to input_shape = (128, 128,103)
I currently have a tensor of torch.Size([1, 3, 256, 224]) but I need it to be input shape [32, 3, 256, 224]. I am capturing data in real-time so dataloader doesn't seem to be a good option. Is there any easy way to take 32 of size torch.Size([1, 3, 256, 224]) and combine them to create 1 tensor of size [32, 3, 256, 224]?
You are probable using jit model, and the batch size must be exact like the one the model was trained on.
t = torch.rand(1, 3, 256, 224)
t.size() # torch.Size([1, 3, 256, 224])
t2= t.expand(32, -1,-1,-1)
t2.size() # torch.Size([32, 3, 256, 224])
Expanding a tensor does not allocate new memory, but only creates a new view on the existing tensor, and you get what you need. Only the tensor stride was changed.
Bellow is a piece of example code from the documentation in Keras. It looks like the first convolution accepts a 256x256 image with 3 color channels. It has 64 output filters (I think these are the same as feature maps which I have read about elsewhere can someone confirm this for me). What confuses me is that the output size is (None, 64, 256, 256). I would expect it to be (None, 64 * 3, 256, 256) since it would need to do convolutions for each of the color channels. What I am wondering is how does Keras handel the color channels. Do the values get averaged together (converted to grey scale) before passing though the convolution?
# apply a 3x3 convolution with 64 output filters on a 256x256 image:
model = Sequential()
model.add(Convolution2D(64, 3, 3, border_mode='same', input_shape=(3, 256, 256)))
# now model.output_shape == (None, 64, 256, 256)
# add a 3x3 convolution on top, with 32 output filters:
model.add(Convolution2D(32, 3, 3, border_mode='same'))
# now model.output_shape == (None, 32, 256, 256)
a filter of size 3*3 with 3 input channels consists of 3*3*3 parameters, so the weights of the convolution kernels for each channel are different.
it sums up the convolution results of each channel (probably together with a bias term) to get the output. so the output shape is independent of the number of input channels, for example, (None, 64, 256, 256) rather than (None, 64 * 3, 256, 256).
I'm not 100% sure but I think a feature map refers to the output of applying one such filter to the input (for example a 256*256 matrix).