Unexpected behaviour of from_logits in BinaryCrossentropy? - python

I am playing with a naive U-net that I'm deploying on MNIST as a toy dataset.
I am seeing a strange behaviour in the way the from_logits argument works in tf.keras.losses.BinaryCrossentropy.
From what I understand, if in the last layer of any neural network activation='sigmoid' is used, then in tf.keras.losses.BinaryCrossentropy you must use from_logits=False. If instead activation=None, you need from_logits=True. Either of them should work in practice, although from_logits=True appears more stable (e.g., Why does sigmoid & crossentropy of Keras/tensorflow have low precision?). This is not the case in the following example.
So, my unet goes as follows (the full code is at the end of this post):
def unet(input,init_depth,activation):
# do stuff that defines layers
# last layer is a 1x1 convolution
output = tf.keras.layers.Conv2D(1,(1,1), activation=activation)(previous_layer) # shape = (28x28x1)
return tf.keras.Model(input,output)
Now I define two models, one with the activation in the last layer:
input = Layers.Input((28,28,1))
model_withProbs = unet(input,4,activation='sigmoid')
model_withProbs.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
optimizer=tf.keras.optimizers.Adam()) #from_logits=False since the sigmoid is already present
and one without
model_withLogits = unet(input,4,activation=None)
model_withLogits.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam()) #from_logits=True since there is no activation
If I'm right, they should have exactly the same behaviour.
Instead, the prediction for model_withLogits has pixel values up to 2500 or so (which is wrong), while for model_withProbs I get values between 0 and 1 (which is right). You can check out the figures I get here
I thought about the issue of stability (from_logits=True is more stable) but this problem appears even before training (see here). Moreover, the problem is exactly when I pass from_logits=True (that is, for model_withLogits) so I don't think stability is relevant.
Does anybody have any clue of why this is happening? Am I missing anything fundamental here?
Post Scriptum: Codes
Re-purposing MNIST for segmentation.
I load MNIST:
(x_train, labels_train), (x_test, labels_test) = tf.keras.datasets.mnist.load_data()
I am re-purposing MNIST for a segmentation task by setting to one all the non-zero values x_train:
x_train = x_train/255 #normalisation
x_test = x_test/255
Y_train = np.zeros(x_train.shape) #create segmentation map
Y_train[x_train>0] = 1 #Y_train is zero everywhere but where the digit is drawn
Full unet network:
def unet(input, init_depth,activation):
conv1 = Layers.Conv2D(init_depth,(2,2),activation='relu', padding='same')(input)
pool1 = Layers.MaxPool2D((2,2))(conv1)
drop1 = Layers.Dropout(0.2)(pool1)
conv2 = Layers.Conv2D(init_depth*2,(2,2),activation='relu',padding='same')(drop1)
pool2 = Layers.MaxPool2D((2,2))(conv2)
drop2 = Layers.Dropout(0.2)(pool2)
conv3 = Layers.Conv2D(init_depth*4, (2,2), activation='relu',padding='same')(drop2)
#pool3 = Layers.MaxPool2D((2,2))(conv3)
#drop3 = Layers.Dropout(0.2)(conv3)
#upsampling
up1 = Layers.Conv2DTranspose(init_depth*2, (2,2), strides=(2,2))(conv3)
up1 = Layers.concatenate([conv2,up1])
conv4 = Layers.Conv2D(init_depth*2, (2,2), padding='same')(up1)
up2 = Layers.Conv2DTranspose(init_depth,(2,2), strides=(2,2), padding='same')(conv4)
up2 = Layers.concatenate([conv1,up2])
conv5 = Layers.Conv2D(init_depth, (2,2), padding='same' )(up2)
last = Layers.Conv2D(1,(1,1), activation=activation)(conv5)
return tf.keras.Model(inputs=input,outputs=last)

Related

Understand and Implement Element-Wise Attention Module

Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -)
I'm trying to understand and implement a research work on Triple Attention Learning, which consists on
- channel-wise attention (a)
- element-wise attention (b)
- scale-wise attention (c)
The mechanism is integrated experimentally inside the DenseNet model. The arch of the whole model's diagram is here. The channel-wise attention module is simply nothing but the squeeze and excitation block. That gives a sigmoid output further to the element-wise attention module. Below is the more precise feature flow diagram of these modules (a, b, and c).
Theory
For the most part, I was able to understand and implement it but was a bit lost in the Element-Wise attention section (part b from the above diagram). This is where I need your assistance. -)
Here is a little theory on this topic to give you a rough idea of what all this is about. Please note, The paper is not openly accessible now but at its early stage of release on the publisher page it was free to get and I saved it at that time. And to be fair to all, I'm sharing it with you, Link. Anyway, from the paper (Section 4.3) it shows:
So first of all, f(att) function (which is in the first inplace diagram, left-middle part or b) consists of three convolution layers with 512 kernels with 1 x 1, 512 kernels with 3 x 3 and C kernels with 1 x 1. Here C is the number of the classifier. And with Softmax activation!
Next, it applies to the Channel-Wise attention module which we mentioned that simply a SENet module and gave a sigmoid probability score i.e X(CA). So, from the function of f(att), we're getting C times softmax probability scores and each of these scores get multiplied with sigmoid output and finally produces feature maps A (according to the equation 4 of the above diagram).
Second, there is a C linear classifier that implemented as a 1 x 1 - C kernels convolution layer. This layer also applied to the SENet module's output i.e. X(CA), to each feature vector pixel-wise. And in the end, it gives an output of feature maps S (equation 5 shown below diagram).
And Third, they element-wise multiply each confidence score (of S) with the corresponding attention element A. This multiplication is on purpose. They did it for preventing unnecessary attention on the feature maps. To make it effective, they also use the weighted cross-entropy loss function to minimize it here between the classification ground truth and the score vector.
My Query
Mostly I don't get properly the minimization strategies in the middle of the network. I want someone who can give me a proper understanding and implementation of this `element-wise attention mechanism in detail that proposed in the mentioned paperwork (section 4.3).
Implement
Here is a minimum code to get started. It should enough I guess. This is shallow implementation but too much away from the original element-wise module. I'm not sure how to implement it properly. For now, I want it as a layer that supposed to plug and play to any model. I was trying with MNIST and a simple Conv net.
In a summary, for MNIST, we should have a network that contains both the channel-wise and element-wise attention model followed by the last 10 unit softmax layer. So for example:
Net: Conv2D - Attentions-Module - GAP - Softmax(10)
The Attention-Module consists of those two-part: Channel-wise and Element-wise, and the Element-wisesupposed to have Softmax too that minimizes weighted CE loss function to ground-truth and score vector coming from this module (according to the paperwork, already described above too). The module also passes weighted feature maps to the consecutive layers. For more clarity here is a simple schematic diagram of what we're looking for
Ok, for the channel-wise attention which should give us a single probability score (sigmoid), let's use a fake layer for now for simplicity:
class FakeSE(tf.keras.layers.Layer):
def __init__(self):
super(Block, self).__init__()
# conv layer
self.conv = tf.keras.layers.Conv2D(10, padding='same',
kernel_size=3)
def call(self, input_tensor, training=False):
x = self.conv(input_tensor)
return tf.math.sigmoid(x)
And for the element-wise attention part, following is the failed attempt so far:
class ElementWiseAttention(tf.keras.layers.Layer):
def __init__(self):
# for simplicity the f(attn) function here has 2 convolution instead of 3
# self.conv1, and self.conv2
self.conv1 = tf.keras.layers.Conv2D(16,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=tf.nn.silu)
self.conv2 = tf.keras.layers.Conv2D(10,
kernel_size=1,
strides=1, padding='same',
use_bias=False, activation=tf.keras.activations.softmax)
# fake SENet or channel-wise attention module
self.cam = FakeSE()
# a linear layer
self.linear = tf.keras.layers.Conv2D(10,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=None)
super(ElementWiseAttention, self).__init__()
def call(self, inputs):
# 2 stacked conv layer (in paper, it's 3. we set 2 for simplicity)
# this is the f(att)
x = self.conv1(inputs)
x = self.conv2(x)
# this is the A = f(att)*X(CA)
camx = self.cam(x)*x
# this is S = X(CA)*Linear_Classifier
linx = self.cam(self.linear(inputs))
# element-wise multiply to prevent unnecessary attention
# suppose to minimize with weighted cross entorpy loss
out = tf.multiply(camx, linx)
return out
The above one is the Layer of Interest. If I understand the paper words correctly, this layer should not only minimize the weighted loss function to gt and score_vector but also produce some weighted feature maps (2D).
Run
Here is the toy data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, axis=-1)
x_train = x_train.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) # if we want to resize
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
# Model
input = tf.keras.Input(shape=(32,32,1))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
em = ElementWiseAttention()(efnet.output)
# Now that we apply global max pooling.
gap = tf.keras.layers.GlobalMaxPooling2D()(em)
# classification layer.
output = tf.keras.layers.Dense(10, activation='softmax')(gap)
# bind all
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = tf.keras.metrics.CategoricalAccuracy(),
optimizer = tf.keras.optimizers.Adam())
# fit
func_model.fit(x_train, y_train, batch_size=32, epochs=3, verbose = 1)
Understanding the element-wise attention
When paper introduce they method they said:
The attention modules aim to exploit the relationship between disease
labels and (1) diagnosis-specific feature channels, (2)
diagnosis-specific locations on images (i.e. the regions of thoracic
abnormalities), and (3) diagnosis-specific scales of the feature maps.
(1), (2), (3) corresponding to channel-wise attention, element-wise attention, scale-wise attention
We can tell that element-wise attention is for deal with disease location & weight info, i.e: at each location on image, how likely there is a disease, as it been mention again when paper introduce the element-wise attention:
The element-wise attention learning aims to enhance the sensitivity of feature
representations to thoracic abnormal regions, while suppressing the activations when there is no abnormality.
OK, we could easily get location & weight info for one disease, but we have multiple disease:
Since there are multiple thoracic diseases, we choose to estimate an
element-wise attention map for each category in this work.
We could store the multiple disease location & weight info by using a tensor A with shape (height, width, number of disease):
The all-category attention map is denoted by A ∈ RH×W×C, where each
element aijc is expected to represent the relative importance at location (i, j) for
identifying the c-th category of thoracic abnormalities.
And we have linear classifiers for produce a tensor S with same shape as A, this can be interpret as:
At each location on feature maps X(CA), how confident those linear classifiers think there is certain disease at that location
Now we element-wise multiply S and A to get M, i.e we are:
prevent the attention maps from paying unnecessary attention to those
location with non-existent labels
So after all those, we get tensor M which tells us:
location & weight info about certain disease that linear classifiers are confident about it
Then if we do global average pooling over M, we get prediction of weight for each disease, add another softmax (or sigmoid) we could get prediction of probability for each disease
Now since we have label and prediction, so, naturally we could minimizing loss function to optimize the model.
Implementation
Following code is tested on colab and will show you how to implement channel-wise attention and element-wise attention, and build and training a simple model base on your code with DenseNet121 and without scale-wise attention:
import tensorflow as tf
import numpy as np
ALPHA = 1/16
C = 10
D = 128
class ChannelWiseAttention(tf.keras.layers.Layer):
def __init__(self):
super(ChannelWiseAttention, self).__init__()
# squeeze
self.gap = tf.keras.layers.GlobalAveragePooling2D()
# excitation
self.fc0 = tf.keras.layers.Dense(int(ALPHA * D), use_bias=False, activation=tf.nn.relu)
self.fc1 = tf.keras.layers.Dense(D, use_bias=False, activation=tf.nn.sigmoid)
# reshape so we can do channel-wise multiplication
self.rs = tf.keras.layers.Reshape((1, 1, D))
def call(self, inputs):
# calculate channel-wise attention vector
z = self.gap(inputs)
u = self.fc0(z)
u = self.fc1(u)
u = self.rs(u)
return u * inputs
class ElementWiseAttention(tf.keras.layers.Layer):
def __init__(self):
super(ElementWiseAttention, self).__init__()
# f(att)
self.conv0 = tf.keras.layers.Conv2D(512,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=tf.nn.relu)
self.conv1 = tf.keras.layers.Conv2D(512,
kernel_size=3,
strides=1, padding='same',
use_bias=True, activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(C,
kernel_size=1,
strides=1, padding='same',
use_bias=False, activation=tf.keras.activations.softmax)
# linear classifier
self.linear = tf.keras.layers.Conv2D(C,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=None)
# for calculate score vector to training element-wise attention module
self.gap = tf.keras.layers.GlobalAveragePooling2D()
self.sfm = tf.keras.layers.Softmax()
def call(self, inputs):
# f(att)
a = self.conv0(inputs)
a = self.conv1(a)
a = self.conv2(a)
# confidence score
s = self.linear(inputs)
# element-wise multiply to prevent unnecessary attention
m = s * a
# using to minimize with weighted cross entorpy loss
y_hat = self.gap(m)
# could also using sigmoid like in paper
out = self.sfm(y_hat)
return m, out
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, axis=-1)
x_train = x_train.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) # if we want to resize
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
# Model
input = tf.keras.Input(shape=(32,32,1))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
xca = ChannelWiseAttention()(efnet.get_layer("conv3_block1_0_bn").output)
m, output = ElementWiseAttention()(xca)
# bind all
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = tf.keras.metrics.CategoricalAccuracy(),
optimizer = tf.keras.optimizers.Adam())
# fit
func_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
PS: Serendipity, I also answered your another question related to this paper few month back:
How to place custom layer inside a in-built pre trained model?

Correcting input dimensions for CNN, LSTM based classifier using Keras, Python

I'm working on implementing a 2D (Perhaps 1D) CNN+ LSTM classifier for Network Traffic classification purposes. The CNN will essentially be used as a feature extractor and the LSTM would work for the classification.
I have used the TimeDistributed layer to help combine the CNN and LSTM layers together (Code attached.)
Since the input size varies dynamically, the number of data points has been indicated with None.
no_rows=20 (Number of packets considered per flow for classification)
no_cols=7 (Number of features considered for each packet)
Despite using the TimeDistributed layer wrap, I am facing some input dimension issues. Not quite sure how to resolve this.
Using Reshape as a layer to resolve this was one of the many fixes I came across but didn't work. Kindly let me know how to build this structure and how to fix my code.
Thanks !
(Using a Linux based AWS instance, Ubuntu 16.04 and Tensorflow backend to implement the code)
Used Reshape layer from Keras core layers to fix the output of the CNN but did not resolve the issue.
Had to remove the Flatten layer and replace it with GlobalMaxPooling2D layer due to the presence of dynamically changing input size.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 36)
model = Sequential()
# Adding CNN Model layers
model.add(TimeDistributed(Conv2D(32, kernel_size = (4 , 2), strides = 1, padding='valid', activation = 'relu', input_shape = (None,20,7,1))))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Conv2D(64, kernel_size = (4 , 2), strides = 1, padding='valid', activation = 'relu')))
model.add(TimeDistributed(BatchNormalization()))
#model.add(TimeDistributed(Reshape((-1,1))))
model.add(TimeDistributed(GlobalMaxPooling2D()))
#model.add(Reshape((1,1)))
# Adding LSTM layers
model.add(LSTM(128, recurrent_dropout=0.2))
model.add(Dropout(rate = 0.2))
model.add(Dense(100))
model.add(Dropout(rate = 0.4))
model.add(Dense(108))
model.add(Dense(num_classes,activation='softmax'))
# Compiling this model
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
#print(model.summary())
#Training the Network
history = model.fit(X_train, Y_train, batch_size=32, epochs = 1, validation_data=(X_test, Y_test))
Running the code snippet mentioned above I face the following error message:
"Input tensor must be of rank 3, 4 or 5 but was {}.".format(n + 2))
ValueError: Input tensor must be of rank 3, 4 or 5 but was 2.
One thing you can do is to make a batch of a fixed input (number of frames) from your video source and process that. The code for doing that would be:
def get_data(video_source, batch_size):
x_data = []
#Reading the Video from file path
cap = cv2.VideoCapture(video_source)
for i in range(batch_size):
#To Store Frames
frames = []
for j in range(frame_to_process): #here we get frame_to_process
ret, frame = cap.read()
if not ret:
# print('No frames found!')
break
# converting to frmae gray
# frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# resizing frame to a particular input shape
# frame = cv2.resize(frame,(30,30),interpolation=cv2.INTER_AREA)
frames.append(frame)
# appending each batch
x_data.append(frames)
return x_data
# number of frames in each batch
frame_to_process = 30
# size of each batch
batch_size = 32
# make batch of video inputs
X_data = np.array(get_data(video_source, batch_size, frame_to_process))
Tip: Also, instead of using Conv2D with TimeDistrubuted you can use ConvLSTM which can give a little performance improvement.
Anyways, if you want to process frames dynamically you can convert the code to Pytorch which has Dynamic Graphs, where you can give input with variable batch size.
Dynamic Computation Graph
Difference between Static and Dynamic Graphs

Get decoder from trained autoencoder model in Keras

I am training a deep autoencoder to map human faces to a 128 dimensional latent space, and then decode them back to its original 128x128x3 format.
I was hoping that after training the autoencoder, I would somehow be able to 'slice' the second half of the autoencoder, i.e. the decoder network responsible for mapping the latent space (128,) to the image space (128, 128, 3) by using the functional Keras API and autoenc_model.get_layer()
Here are the relevant layers of my model:
INPUT_SHAPE=(128,128,3)
input_img = Input(shape=INPUT_SHAPE, name='enc_input')
#1
x = Conv2D(64, (3, 3), padding='same', activation='relu')(input_img)
x = BatchNormalization()(x)
//Many Conv2D, BatchNormalization(), MaxPooling() layers
.
.
.
#Flatten
fc_input = Flatten(name='enc_output')(x)
y = Dropout(DROP_RATE)(fc_input)
y = Dense(128, activation='relu')(y)
y = Dropout(DROP_RATE)(y)
fc_output = Dense(128, activation='linear')(y)
#Reshape
decoder_input = Reshape((8, 8, 2), name='decoder_input')(fc_output)
#Decoder part
#UnPooling-1
z = UpSampling2D()(decoder_input)
//Many Conv2D, BatchNormalization, UpSampling2D layers
.
.
.
#16
decoder_output = Conv2D(3, (3, 3), padding='same', activation='linear', name='decoder_output')(z)
autoenc_model = Model(input_img, decoder_output)
here is the notebook containing the entire model architecture.
To get the decodeer network from the trained autoencoder, I have tried using:
dec_model = Model(inputs=autoenc_model.get_layer('decoder_input').input, outputs=autoenc_model.get_layer('decoder_output').output)
and
dec_model = Model(autoenc_model.get_layer('decoder_input'), autoenc_model.get_layer('decoder_output'))
neither of which seem to work.
I need to extract the decoder layers out of the autoencoder as I want to train the entire autoencoder model first, then use the encoder and the decoder independently.
I could not find a satisfactory answer anywhere else. The Keras blog article on building autoencoders only covers how to extract the decoder for 2 layered autoencoders.
The decoder input/output shape should be: (128, ) and (128, 128, 3), which is the input shape of the 'decoder_input' and output shape of the 'decoder_output' layers respectively.
Couple of changes are needed:
z = UpSampling2D()(decoder_input)
to
direct_input = Input(shape=(8,8,2), name='d_input')
#UnPooling-1
z = UpSampling2D()(direct_input)
and
autoenc_model = Model(input_img, decoder_output)
to
dec_model = Model(direct_input, decoder_output)
autoenc_model = Model(input_img, dec_model(decoder_input))
Now, you can train on the auto encoder and predict using the decoder.
import numpy as np
autoenc_model.fit(np.ones((5,128,128,3)), np.ones((5,128,128,3)))
dec_model.predict(np.ones((1,8,8,2)))
You can also refer this self-contained example:
https://github.com/keras-team/keras/blob/master/examples/variational_autoencoder.py
My solution isn't very elegant, and there are probably better solutions out there, but since no-one replied yet, I'll post it (I was actually hoping someone would so I can improve my own implementation, as you'll see below).
So what I did was built a network that can take a secondary input, directly into the latent space.
Unfortunately, both inputs are obligatory, so I end up with a network that requires dummy arrays full of zeros for the 'unwanted' input (you'll see in a second).
Using Keras functional API:
image_input = Input(shape=image_shape)
conv1 = Conv2D(...,activation='relu')(image_input)
...
dense_encoder = Dense(...)(<layer>)
z_input = Input(shape=n_latent)
decoder_entry = Dense(...,activation='relu')(Add()([dense_encoder,z_input]))
...
decoder_output = Conv2DTranspose(...)
model = Model(inputs=[image_input,z_input], outputs=decoder_output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
encoder = Model(inputs=image_input,outputs=dense_encoder)
decoder = Model(inputs=[z_input,image_input], outputs=decoder_output)
Note that you shouldn't compile the encoder and decoder.
(some code is either omitted or left with ... for you to fill in your specific needs).
Finally, to train you'll have to provide one empty array. So to train the entire auto-encoder:
images is X in this context
model.fit([images,np.zeros((len(n_latent),...))],images)
And then you can get the latent features using:
latent_features = encoder.predict(images)
Or use the decoder with latent input and dummy variables (note the order of inputs above):
decoder.predict([Z_inputs,np.zeros(shape=images.shape)])
Finally, another solution I haven't tried is build to parallel models, with the same architecture, one the autoencoder, and the second only the decoder part, and then use:
decoder_layer.set_weights(model_layer.get_weights())
It should work, but I haven't confirmed it. It does have the disadvantage of having to copy the weights again every time your train the autoencoder model.
So to conclude, I am aware of the many problems here, but again, I only posted this because I saw no-one else replied, and was hoping this will still be of some use to you.
Please comment if something is not clear.
An option is to define a function which uses get_layer and then reconstruct the decoder part in there. For example, consider a simple autoencoder with the following architecture: [n_inputs, 500, 100, 500, n_outputs]. To be able to run some inputs through the second half (ie run 100 inputs through the layers of 500 and n_outputs.
# Function to get outputs from a given set of bottleneck inputs
def bottleneck_to_outputs(bottleneck_inputs, autoencoder):
# Run bottleneck_inputs (eg 100 units) through decoder layer (eg 500 units)
x = autoencoder.get_layer('decoder')(bottleneck_inputs)
# Run x (eg 500 units) through output layer (n units = n features)
x = autoencoder.get_layer('output')(x)
return x
For your example, this function should work (assuming you have given your layers the names referenced here).
def decoder_part(autoenc_model, image):
#UnPooling-1
z = autoenc_model.get_layer('upsampling1')(image)
#9
z = autoenc_model.get_layer('conv2d1')(z)
z = autoenc_model.get_layer('batchnorm1')(z)
#10
z = autoenc_model.get_layer('conv2d2')(z)
z = autoenc_model.get_layer('batchnorm2')(z)
#UnPooling-2
z = autoenc_model.get_layer('upsampling2')(z)
#11
z = autoenc_model.get_layer('conv2d3')(z)
z = autoenc_model.get_layer('batchnorm3')(z)
#12
z = autoenc_model.get_layer('conv2d4')(z)
z = autoenc_model.get_layer('batchnorm4')(z)
#UnPooling-3
z = autoenc_model.get_layer('upsampling3')(z)
#13
z = autoenc_model.get_layer('conv2d5')(z)
z = autoenc_model.get_layer('batchnorm5')(z)
#14
z = autoenc_model.get_layer('conv2d6')(z)
z = autoenc_model.get_layer('batchnorm6')(z)
#UnPooling-4
z = autoenc_model.get_layer('upsampling4')(z)
#15
z = autoenc_model.get_layer('conv2d7')(z)
z = autoenc_model.get_layer('batchnorm7')(z)
#16
decoder_output = autoenc_model.get_layer('decoder_output')(z)
return decoder_output
Given this function, it would make sense to also have a way to test if it is working correctly. In order to do this, define another model which gets you from inputs to the bottleneck (latent space), such as:
bottleneck_layer = Model(inputs= input_img,outputs=decoder_input)
Then, as a test, run a vector of ones through the first part of the model and obtain the latent space:
import numpy as np
ones_image = np.ones((128,128,3))
bottleneck_ones = bottleneck_layer(ones_image.reshape(1,128,128,3))
And then run that latent space through the function defined above to create a variable which you will test against the output of full network:
decoded_test = decoder_part(autoenc_model, bottleneck_ones)
Now, run the ones_image through the whole network and verify that you get the same results:
model_test = autoenc_model.predict(ones_image.reshape(1,128,128,3))
tf.debugging.assert_equal(model_test, decoder_test, message= 'Tensors are not equivalent')
If the assert_equal line does not throw an error, your decoder is working correctly.

Make fixed timestep length LSTM Keras model free timestep length

I have a Keras LSTM multitask model that performs two tasks. One is a sequence tagging task (so I predict a label per token). The other is a global classification task over the whole sequence using a CNN that is stacked on the hidden states of the LSTM.
In my setup (don't ask why) I only need the CNN task during training, but the labels it predicts have no use on the final product. So, on Keras, one can train a LSTM model without especifiying the input sequence lenght. like this:
l_input = Input(shape=(None,), dtype="int32", name=input_name)
However, if I add the CNN stacked on the LSTM hidden states I need to set a fixed sequence length for the model.
l_input = Input(shape=(timesteps_size,), dtype="int32", name=input_name)
The problem is that once I have trained the model with a fixed timestep_size I can no longer use it to predict longer sequences.
In other frameworks this is not a problem. But in Keras, I cannot get rid of the CNN and change the expected input shape of the model once it has been trained.
Here is a simplified version of the model
l_input = Input(shape=(timesteps_size,), dtype="int32")
l_embs = Embedding(len(input.keys()), 100)(l_input)
l_blstm = Bidirectional(GRU(300, return_sequences=True))(l_embs)
# Sequential output
l_out1 = TimeDistributed(Dense(len(labels.keys()),
activation="softmax"))(l_blstm)
# Global output
conv1 = Conv1D( filters=5 , kernel_size=10 )( l_embs )
conv1 = Flatten()(MaxPooling1D(pool_size=2)( conv1 ))
conv2 = Conv1D( filters=5 , kernel_size=8 )( l_embs )
conv2 = Flatten()(MaxPooling1D(pool_size=2)( conv2 ))
conv = Concatenate()( [conv1,conv2] )
conv = Dense(50, activation="relu")(conv)
l_out2 = Dense( len(global_labels.keys()) ,activation='softmax')(conv)
model = Model(input=input, output=[l_out1, l_out2])
optimizer = Adam()
model.compile(optimizer=optimizer,
loss="categorical_crossentropy",
metrics=["accuracy"])
I would like to know if anyone here has faced this issue, and if there are any solutions to delete layers from a model after training and, more important, how to reshape input layer sizes after training.
Thanks
Variable timesteps length makes a problem not because of using convolution layers (actually the good thing about convolution layers is that they do not depend on the input size). Rather, using Flatten layers cause the problem here since they need an input with specified size. Instead, you can use Global Pooling layers. Further, I think stacking convolution and pooling layers on top of each other might give a better result instead of using two separate convolution layers and merging them (although this depends on the specific problem and dataset you are working on). So considering these two points it might be better to write your model like this:
# Global output
conv1 = Conv1D(filters=16, kernel_size=5)(l_embs)
conv1 = MaxPooling1D(pool_size=2)(conv1)
conv2 = Conv1D(filters=32, kernel_size=5)(conv1)
conv2 = MaxPooling1D(pool_size=2)(conv2)
gpool = GlobalAveragePooling1D()(conv2)
x = Dense(50, activation="relu")(gpool)
l_out2 = Dense(len(global_labels.keys()), activation='softmax')(x)
model = Model(inputs=l_input, outputs=[l_out1, l_out2])
You may need to tune the number of conv+maxpool layers, number of filters, kernel size and even add dropout or batch normalization layers.
As a side note, using TimeDistributed on a Dense layer is redundant as the Dense layer is applied on the last axis.

Multi-input models using Keras (Model API)

I've been trying to construct a multiple input model using Keras. I am coming from using the sequential model and having only one input which was fairly straight-forward. I have been looking at the documentation (https://keras.io/getting-started/functional-api-guide/) and some answers here on StackOverflow (How to "Merge" Sequential models in Keras 2.0?). Basically what I want is to have two inputs train one model. One input is a piece of text and the other is a set of hand-picked features that were extracted from that text. The hand-picked feature vectors are of a constant length. Below is what I've tried so far:
left = Input(shape=(7801,), dtype='float32', name='left_input')
left = Embedding(7801, self.embedding_vector_length, weights=[self.embeddings],
input_length=self.max_document_length, trainable=False)(left)
right = Input(shape=(len(self.z_train), len(self.z_train[0])), dtype='float32', name='right_input')
for i, filter_len in enumerate(filter_sizes):
left = Conv1D(filters=128, kernel_size=filter_len, padding='same', activation=c_activation)(left)
left = MaxPooling1D(pool_size=2)(left)
left = CuDNNLSTM(100, unit_forget_bias=1)(left)
right = CuDNNLSTM(100, unit_forget_bias=1)(right)
left_out = Dense(3, activation=activation, kernel_regularizer=l2(l_2), activity_regularizer=l1(l_1))(left)
right_out = Dense(3, activation=activation, kernel_regularizer=l2(l_2), activity_regularizer=l1(l_1))(right)
for i in range(self.num_outputs):
left_out = Dense(3, activation=activation, kernel_regularizer=l2(l_2), activity_regularizer=l1(l_1))(left_out)
right_out = Dense(3, activation=activation, kernel_regularizer=l2(l_2), activity_regularizer=l1(l_1))(right_out)
left_model = Model(left, left_out)
right_model = Model(right, right_out)
concatenated = merge([left_model, right_model], mode="concat")
out = Dense(3, activation=activation, kernel_regularizer=l2(l_2), activity_regularizer=l1(l_1), name='output_layer')(concatenated)
self.model = Model([left_model, right_model], out)
self.model.compile(loss=loss, optimizer=optimizer, metrics=[cosine, mse, categorical_accuracy])
This gives the error:
TypeError: Input layers to a `Model` must be `InputLayer` objects. Received inputs: Tensor("cu_dnnlstm_1/strided_slice_16:0", shape=(?, 100), dtype=float32). Input 0 (0-based) originates from layer type `CuDNNLSTM`.
The error is clear (and you're almost there). The code is currently attempting to set the inputs as the models [left_model, right_model], instead the inputs must be Input layers [left, right]. The relevant part of the code sample above should read:
self.model = Model([left, rigt], out)
see my answer here as reference: Merging layers especially the second example.

Categories