Extract learned NN posterior weight distribution parameters from DenseVariational layer - python

I also posted this question in the tensorflow probability Github issues:
https://github.com/tensorflow/probability/issues/892
I'm using Tensorflow 2.1.0 and tensorflow-probability 0.9.0 in python 3.6.8.
I'm working with a Tensorflow Probability Keras model that has a DenseVariational layer defined as follows (lifted from examples found online):
def posterior_mean_field(kernel_size, bias_size=0, dtype=None):
n = kernel_size + bias_size
c = np.log(np.expm1(1.))
return tf.keras.Sequential([
tfp.layers.VariableLayer(2 * n, dtype=dtype),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
tfd.Normal(loc=t[..., :n], scale=1e-5 + tf.nn.softplus(c + t[..., n:])),
reinterpreted_batch_ndims=1)),
])
def prior_trainable(kernel_size, bias_size=0, dtype=None):
n = kernel_size + bias_size
return tf.keras.Sequential([
tfp.layers.VariableLayer(n, dtype=dtype),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(tfd.Normal(loc=t, scale=1),
reinterpreted_batch_ndims=1)),
])
dense = tfp.layers.DenseVariational(units=units, make_posterior_fn=posterior_mean_field,
make_prior_fn=prior_trainable,
)(prev_layer)
If I train my model and then remove the layers following this layer, the remaining model will output random variables from the learned posterior weight distributions. Something like this:
from tensorflow.keras import Model
# DenseVariational layer is 3rd to last layer in this case
cropped_model = Model(inputs, model.layers[-3].output)
cropped_mode.predict(test_data)
Most of the time this is fine (e.g. training, sampling, etc.). However, is there a direct way to get the learned loc and scale posterior values returned for a given input (e.g. test_data) to this cropped_model, instead of a sample draw from the distribution they define?

You may refer to the 'Train model and Inspect' section of this webpage.
I will briefly introduce the solution mentioned in the website here.
Assuming the DenseVariational layer is the first layer of your trainned model, you can get the trainned prior distribution and then its mean and variance in this way (since DenseVariational layer is not affected by input, the dummy input can be any array:
dummy_input = np.array([[0]])
model.layers[0]._prior(dummy_input)
print('Prior Variance: ', model_prior.variance().numpy())
print('Posterior mean: ', model_posterior.mean().numpy())

Related

How to get the latent vector as an output from a cnn model before training to the fully connected layer?

I am working on CNN model using Tensorflow frames in google collab. I am unable to extract the latent vectors from the convolutional layers. I want to extract the output of the convolutional layers, the layers before fully connected layer.
I have tried with the following code
a = dropout()(classifier_model.output)
print(a)
I am unable to understand the solution suggested on the link Stackoverflow solution to print the value of tensorflow object after applying a-conv-pool-layer
Anyone with any suggestion?
You can use get_layer method of the Model class to get a layer by its name, find bellow an example with a dummy 1D CNN and a binary classifier :
timesteps = 100
nfeatures = 2
# build the model using the functional API
# example of a 1D CNN inspired by the your stack overflow link, but using a model instead of successive *raw* layers
# the values of the Conv1D filters and kernels are different
input = Input((timesteps, nfeatures))
p = Conv1D(filters=16, kernel_size=10)(input)
p = ReLU()(p)
p = MaxPool1D(pool_size=2)(p)
p = Conv1D(filters=32, kernel_size=10)(p)
p = ReLU()(p)
p = MaxPool1D(pool_size=2)(p)
p = Conv1D(filters=64, kernel_size=10)(p)
p = ReLU()(p)
p = MaxPool1D(pool_size=2, name='conv1Dfeat')(p) # give a name to the CNN output
# fully connected part
p = Flatten()(p)
p = Dense(10)(p)
# could add a dropout layer to ease optimization
finaloutput = Dense(1, activation='sigmoid')(p)
# full model
model = Model(inputs=input, outputs=finaloutput)
# compile network, i.e. define optimizer, loss and metrics
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
You need to train the model using the fit method with some data. Then you can get the output of the layer which name is conv1Dfeat (the last layer of the convolutive part) by defining the model:
modelCNN = Model(inputs=input, outputs=model.get_layer('conv1Dfeat').output)
modelCNN.summary()
If you want to get the output of the convolutive part, let's say based on a single numpy input array of shape (timesteps, nfeatures), you can use the predict of the Model class on batched data:
data = np.random.normal(size=(timesteps, nfeatures)) # dummy data
data_tf = tf.expand_dims(data, axis=0) # convert to TF tensor and add batch dimension at the same time
cnn_out_np = modelCNN.predict(data_tf)
cnn_out_np = np.squeeze(cnn_out_np, axis=0) # remove batch dimension
print(cnn_out_np.shape)
(4, 64)

Understand and Implement Element-Wise Attention Module

Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -)
I'm trying to understand and implement a research work on Triple Attention Learning, which consists on
- channel-wise attention (a)
- element-wise attention (b)
- scale-wise attention (c)
The mechanism is integrated experimentally inside the DenseNet model. The arch of the whole model's diagram is here. The channel-wise attention module is simply nothing but the squeeze and excitation block. That gives a sigmoid output further to the element-wise attention module. Below is the more precise feature flow diagram of these modules (a, b, and c).
Theory
For the most part, I was able to understand and implement it but was a bit lost in the Element-Wise attention section (part b from the above diagram). This is where I need your assistance. -)
Here is a little theory on this topic to give you a rough idea of what all this is about. Please note, The paper is not openly accessible now but at its early stage of release on the publisher page it was free to get and I saved it at that time. And to be fair to all, I'm sharing it with you, Link. Anyway, from the paper (Section 4.3) it shows:
So first of all, f(att) function (which is in the first inplace diagram, left-middle part or b) consists of three convolution layers with 512 kernels with 1 x 1, 512 kernels with 3 x 3 and C kernels with 1 x 1. Here C is the number of the classifier. And with Softmax activation!
Next, it applies to the Channel-Wise attention module which we mentioned that simply a SENet module and gave a sigmoid probability score i.e X(CA). So, from the function of f(att), we're getting C times softmax probability scores and each of these scores get multiplied with sigmoid output and finally produces feature maps A (according to the equation 4 of the above diagram).
Second, there is a C linear classifier that implemented as a 1 x 1 - C kernels convolution layer. This layer also applied to the SENet module's output i.e. X(CA), to each feature vector pixel-wise. And in the end, it gives an output of feature maps S (equation 5 shown below diagram).
And Third, they element-wise multiply each confidence score (of S) with the corresponding attention element A. This multiplication is on purpose. They did it for preventing unnecessary attention on the feature maps. To make it effective, they also use the weighted cross-entropy loss function to minimize it here between the classification ground truth and the score vector.
My Query
Mostly I don't get properly the minimization strategies in the middle of the network. I want someone who can give me a proper understanding and implementation of this `element-wise attention mechanism in detail that proposed in the mentioned paperwork (section 4.3).
Implement
Here is a minimum code to get started. It should enough I guess. This is shallow implementation but too much away from the original element-wise module. I'm not sure how to implement it properly. For now, I want it as a layer that supposed to plug and play to any model. I was trying with MNIST and a simple Conv net.
In a summary, for MNIST, we should have a network that contains both the channel-wise and element-wise attention model followed by the last 10 unit softmax layer. So for example:
Net: Conv2D - Attentions-Module - GAP - Softmax(10)
The Attention-Module consists of those two-part: Channel-wise and Element-wise, and the Element-wisesupposed to have Softmax too that minimizes weighted CE loss function to ground-truth and score vector coming from this module (according to the paperwork, already described above too). The module also passes weighted feature maps to the consecutive layers. For more clarity here is a simple schematic diagram of what we're looking for
Ok, for the channel-wise attention which should give us a single probability score (sigmoid), let's use a fake layer for now for simplicity:
class FakeSE(tf.keras.layers.Layer):
def __init__(self):
super(Block, self).__init__()
# conv layer
self.conv = tf.keras.layers.Conv2D(10, padding='same',
kernel_size=3)
def call(self, input_tensor, training=False):
x = self.conv(input_tensor)
return tf.math.sigmoid(x)
And for the element-wise attention part, following is the failed attempt so far:
class ElementWiseAttention(tf.keras.layers.Layer):
def __init__(self):
# for simplicity the f(attn) function here has 2 convolution instead of 3
# self.conv1, and self.conv2
self.conv1 = tf.keras.layers.Conv2D(16,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=tf.nn.silu)
self.conv2 = tf.keras.layers.Conv2D(10,
kernel_size=1,
strides=1, padding='same',
use_bias=False, activation=tf.keras.activations.softmax)
# fake SENet or channel-wise attention module
self.cam = FakeSE()
# a linear layer
self.linear = tf.keras.layers.Conv2D(10,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=None)
super(ElementWiseAttention, self).__init__()
def call(self, inputs):
# 2 stacked conv layer (in paper, it's 3. we set 2 for simplicity)
# this is the f(att)
x = self.conv1(inputs)
x = self.conv2(x)
# this is the A = f(att)*X(CA)
camx = self.cam(x)*x
# this is S = X(CA)*Linear_Classifier
linx = self.cam(self.linear(inputs))
# element-wise multiply to prevent unnecessary attention
# suppose to minimize with weighted cross entorpy loss
out = tf.multiply(camx, linx)
return out
The above one is the Layer of Interest. If I understand the paper words correctly, this layer should not only minimize the weighted loss function to gt and score_vector but also produce some weighted feature maps (2D).
Run
Here is the toy data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, axis=-1)
x_train = x_train.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) # if we want to resize
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
# Model
input = tf.keras.Input(shape=(32,32,1))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
em = ElementWiseAttention()(efnet.output)
# Now that we apply global max pooling.
gap = tf.keras.layers.GlobalMaxPooling2D()(em)
# classification layer.
output = tf.keras.layers.Dense(10, activation='softmax')(gap)
# bind all
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = tf.keras.metrics.CategoricalAccuracy(),
optimizer = tf.keras.optimizers.Adam())
# fit
func_model.fit(x_train, y_train, batch_size=32, epochs=3, verbose = 1)
Understanding the element-wise attention
When paper introduce they method they said:
The attention modules aim to exploit the relationship between disease
labels and (1) diagnosis-specific feature channels, (2)
diagnosis-specific locations on images (i.e. the regions of thoracic
abnormalities), and (3) diagnosis-specific scales of the feature maps.
(1), (2), (3) corresponding to channel-wise attention, element-wise attention, scale-wise attention
We can tell that element-wise attention is for deal with disease location & weight info, i.e: at each location on image, how likely there is a disease, as it been mention again when paper introduce the element-wise attention:
The element-wise attention learning aims to enhance the sensitivity of feature
representations to thoracic abnormal regions, while suppressing the activations when there is no abnormality.
OK, we could easily get location & weight info for one disease, but we have multiple disease:
Since there are multiple thoracic diseases, we choose to estimate an
element-wise attention map for each category in this work.
We could store the multiple disease location & weight info by using a tensor A with shape (height, width, number of disease):
The all-category attention map is denoted by A ∈ RH×W×C, where each
element aijc is expected to represent the relative importance at location (i, j) for
identifying the c-th category of thoracic abnormalities.
And we have linear classifiers for produce a tensor S with same shape as A, this can be interpret as:
At each location on feature maps X(CA), how confident those linear classifiers think there is certain disease at that location
Now we element-wise multiply S and A to get M, i.e we are:
prevent the attention maps from paying unnecessary attention to those
location with non-existent labels
So after all those, we get tensor M which tells us:
location & weight info about certain disease that linear classifiers are confident about it
Then if we do global average pooling over M, we get prediction of weight for each disease, add another softmax (or sigmoid) we could get prediction of probability for each disease
Now since we have label and prediction, so, naturally we could minimizing loss function to optimize the model.
Implementation
Following code is tested on colab and will show you how to implement channel-wise attention and element-wise attention, and build and training a simple model base on your code with DenseNet121 and without scale-wise attention:
import tensorflow as tf
import numpy as np
ALPHA = 1/16
C = 10
D = 128
class ChannelWiseAttention(tf.keras.layers.Layer):
def __init__(self):
super(ChannelWiseAttention, self).__init__()
# squeeze
self.gap = tf.keras.layers.GlobalAveragePooling2D()
# excitation
self.fc0 = tf.keras.layers.Dense(int(ALPHA * D), use_bias=False, activation=tf.nn.relu)
self.fc1 = tf.keras.layers.Dense(D, use_bias=False, activation=tf.nn.sigmoid)
# reshape so we can do channel-wise multiplication
self.rs = tf.keras.layers.Reshape((1, 1, D))
def call(self, inputs):
# calculate channel-wise attention vector
z = self.gap(inputs)
u = self.fc0(z)
u = self.fc1(u)
u = self.rs(u)
return u * inputs
class ElementWiseAttention(tf.keras.layers.Layer):
def __init__(self):
super(ElementWiseAttention, self).__init__()
# f(att)
self.conv0 = tf.keras.layers.Conv2D(512,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=tf.nn.relu)
self.conv1 = tf.keras.layers.Conv2D(512,
kernel_size=3,
strides=1, padding='same',
use_bias=True, activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(C,
kernel_size=1,
strides=1, padding='same',
use_bias=False, activation=tf.keras.activations.softmax)
# linear classifier
self.linear = tf.keras.layers.Conv2D(C,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=None)
# for calculate score vector to training element-wise attention module
self.gap = tf.keras.layers.GlobalAveragePooling2D()
self.sfm = tf.keras.layers.Softmax()
def call(self, inputs):
# f(att)
a = self.conv0(inputs)
a = self.conv1(a)
a = self.conv2(a)
# confidence score
s = self.linear(inputs)
# element-wise multiply to prevent unnecessary attention
m = s * a
# using to minimize with weighted cross entorpy loss
y_hat = self.gap(m)
# could also using sigmoid like in paper
out = self.sfm(y_hat)
return m, out
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, axis=-1)
x_train = x_train.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) # if we want to resize
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
# Model
input = tf.keras.Input(shape=(32,32,1))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
xca = ChannelWiseAttention()(efnet.get_layer("conv3_block1_0_bn").output)
m, output = ElementWiseAttention()(xca)
# bind all
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = tf.keras.metrics.CategoricalAccuracy(),
optimizer = tf.keras.optimizers.Adam())
# fit
func_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
PS: Serendipity, I also answered your another question related to this paper few month back:
How to place custom layer inside a in-built pre trained model?

Question About Dropout Layer and Batch Normalization Layer in DNN model

I have some queries about the Dropout layer and Batch normalized layer. Basically, I have made a simple DNN structure with a Dropout layer and Batch normalized layer and train it that's fine.
The simple structure of DNN model for example:
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
layers.Dense(10, activation='relu', input_shape=[11]),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(8, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(6, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(1,activation='softmax'),
])
model.compile(
optimizer='adam',
loss='mae',
)
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=100,
verbose=0,
)
But now I would like to use the train model's weights and bias of all layers in my custom prediction model(forget about the other way).
# Predictions for test
test_logits_1 = tf.matmul(tf_test_dataset, weights_1) + biases_1
test_relu_1 = tf.nn.relu(test_logits_1)
test_logits_2 = tf.matmul(test_relu_1, weights_2) + biases_2
test_relu_2 = tf.nn.relu(test_logits_2)
test_logits_3 = tf.matmul(test_relu_2, weights_3) + biases_3
test_relu_3 = tf.nn.relu(test_logits_3)
test_logits_4 = tf.matmul(test_logits_3 , weights_4) + biases_4
test_prediction = tf.nn.softmax(test_relu_4)
Now the question is here: have to need to add the dropout layer and batch normalized layer, batch size in the prediction model?? If yes then why to do that and how do I extract all the details of layers and use them in my custom prediction model?
#Dr. Snoopy thanks for pointing out that the BatchNormalization has parameters but to my knowledge they are not the normalization weights(weights being normalized) based on what I was able to deduce from the docs and little research.
The doc says the following(quoted text below) and based on the description it is clear that beta and gamma values are trainable variables which tallies with the output from tensorflow.
During training (i.e. when using fit() or when calling the layer/model with the argument training=True), the layer normalizes its output using the mean and standard deviation of the current batch of inputs. That is to say, for each channel being normalized, the layer returns (batch - mean(batch)) / (var(batch) + epsilon) * gamma + beta, where:
epsilon is small constant (configurable as part of the constructor arguments)
gamma is a learned scaling factor (initialized as 1), which can be disabled by passing scale=False to the constructor.
beta is a learned offset factor (initialized as 0), which can be disabled by passing center=False to the constructor.
But that is not the end of the story as the model summary indicates more parameters than the number of parameters beta and gamma comprise of.
A factor of 4 can be observed here i.e. the number of parameters in a BatchNormalization layer are 4 times the input shape the layer operates on.
These additional parameters are moving_mean and moving_variance values which can be seen in the following output
Coming back to the original question and concern of OP, "What parameters should i worry about?", the parameters that are needed for inference are moving_mean, moving_variance, beta, and gamma values.
The way to use these values/parameters is again easily deducible from the docs which I quote here again-
During inference (i.e. when using evaluate() or predict() or when calling the layer/model with the argument training=False (which is the default), the layer normalizes its output using a moving average of the mean and standard deviation of the batches it has seen during training. That is to say, it returns (batch - self.moving_mean) / (self.moving_var + epsilon) * gamma + beta.
self.moving_mean and self.moving_var are non-trainable variables that are updated each time the layer in called in training mode, as such:
moving_mean = moving_mean * momentum + mean(batch) * (1 - momentum)
moving_var = moving_var * momentum + var(batch) * (1 - momentum)
As such, the layer will only normalize its inputs during inference after having been trained on data that has similar statistics as the inference data.
So assuming the moving_mean, moving_variance, beta, and gamma values are available for every BatchNormalization layer, I think the following piece of code needs to be added after the first activation-
# epsilon is just to avoid ZeroDivisionError, so the default value should be okay
test_BN_1 = (test_relu_1 - moving_mean_1) / (moving_var_1 + epsilon_1) * gamma_1 + beta_1
EDIT:
Turns out that the documentation seems to be wrong but the implementation seems to be right based on what I could deduce from the source code on github.
If you follow the following links you'll see that the in call method of BatchNormalization class here https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py#L1227 the calculation is actually done by keras backend normalization function batch_normalization here https://github.com/keras-team/keras/blob/35146d00b44ca645fbf4ad0b007faa07632c6f9e/keras/backend.py#L2963. The backend function doc string seems to be in agreement with what is mentioned in the reference paper and the picture you've posted.
So that means, you should use the square root of the variance only.

Keras: Share a layer of weights across Training Examples (Not between layers)

The problem is the following. I have a categorical prediction task of vocabulary size 25K. On one of them (input vocab 10K, output dim i.e. embedding 50), I want to introduce a trainable weight matrix for a matrix multiplication between the input embedding (shape 1,50) and the weights (shape(50,128)) (no bias) and the resulting vector score is an input for a prediction task along with other features.
The crux is, I think that the trainable weight matrix varies for each input, if I simply add it in. I want this weight matrix to be common across all inputs.
I should clarify - by input here I mean training examples. So all examples would learn some example specific embedding and be multiplied by a shared weight matrix.
After every so many epochs, I intend to do a batch update to learn these common weights (or use other target variables to do multiple output prediction)
LSTM? Is that something I should look into here?
With the exception of an Embedding layer, layers apply to all examples in the batch.
Take as an example a very simple network:
inp = Input(shape=(4,))
h1 = Dense(2, activation='relu', use_bias=False)(inp)
out = Dense(1)(h1)
model = Model(inp, out)
This a simple network with 1 input layer, 1 hidden layer and an output layer. If we take the hidden layer as an example; this layer has a weights matrix of shape (4, 2,). At each iteration the input data which is a matrix of shape (batch_size, 4) is multiplied by the hidden layer weights (feed forward phase). Thus h1 activation is dependent on all samples. The loss is also computed on a per batch_size basis. The output layer has a shape (batch_size, 1). Given that in the forward phase all the batch samples affected the values of the weights, the same is true for backdrop and gradient updates.
When one is dealing with text, often the problem is specified as predicting a specific label from a sequence of words. This is modelled as a shape of (batch_size, sequence_length, word_index). Lets take a very basic example:
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
sequence_length = 80
emb_vec_size = 100
vocab_size = 10_000
def make_model():
inp = Input(shape=(sequence_length, 1))
emb = Embedding(vocab_size, emb_vec_size)(inp)
emb = Reshape((sequence_length, emb_vec_size))(emb)
h1 = Dense(64)(emb)
recurrent = LSTM(32)(h1)
output = Dense(1)(recurrent)
model = Model(inp, output)
model.compile('adam', 'mse')
return model
model = make_model()
model.summary()
You can copy and paste this into colab and see the summary.
What this example is doing is:
Transform a sequence of word indices into a sequence of word embedding vectors.
Applying a Dense layer called h1 to all the batches (and all the elements in the sequence); this layer reduces the dimensions of the embedding vector. It is not a typical element of a network to process text (in isolation). But this seemed to match your question.
Using a recurrent layer to reduce the sequence into a single vector per example.
Predicting a single label from the "sentence" vector.
If I get the problem correctly you can reuse layers or even models inside another model.
Example with a Dense layer. Let's say you have 10 Inputs
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# defining 10 inputs in a List with (X,) shape
inputs = [Input(shape = (X,),name='input_{}'.format(k)) for k in
range(10)]
# defining a common Dense layer
D = Dense(64, name='one_layer_to_rule_them_all')
nets = [D(inp) for inp in inputs]
model = Model(inputs = inputs, outputs = nets)
model.compile(optimizer='adam', loss='categorical_crossentropy')
This code is not going to work if the inputs have different shapes. The first call to D defines its properties. In this example, outputs are set directly to nets. But of course you can concatenate, stack, or whatever you want.
Now if you have some trainable model you can use it instead of the D:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# defining 10 inputs in a List with (X,) shape
inputs = [Input(shape = (X,),name='input_{}'.format(k)) for k in
range(10)]
# defining a shared model with the same weights for all inputs
nets = [special_model(inp) for inp in inputs]
model = Model(inputs = inputs, outputs = nets)
model.compile(optimizer='adam', loss='categorical_crossentropy')
The weights of this model are shared among all inputs.

Keras: zero division error

I'm trying to get the activation values for each layer in this baseline autoencoder built using Keras since I want to add a sparsity penalty to the loss function based on the Kullbach-Leibler (KL) divergence, as shown here, pag. 14.
In this scenario, I'm going to calculate the KL divergence for each layer and then sum all of them with the main loss function, e.g. mse.
I therefore made a script in Jupyter where I do that but all the time, when I try to compile I get ZeroDivisionError: integer division or modulo by zero.
This is the code
import numpy as np
from keras.layers import Conv2D, Activation
from keras.models import Sequential
from keras import backend as K
from keras import losses
x_train = np.random.rand(128,128).astype('float32')
kl = K.placeholder(dtype='float32')
beta = K.constant(value=5e-1)
p = K.constant(value=5e-2)
# encoder
model = Sequential()
model.add(Conv2D(filters=16,kernel_size=(4,4),padding='same',
name='encoder',input_shape=(128,128,1)))
model.add(Activation('relu'))
# get the average activation
A = K.mean(x=model.output)
# calculate the value for the KL divergence
kl = K.concatenate([kl, losses.kullback_leibler_divergence(p, A)],axis=0)
# decoder
model.add(Conv2D(filters=1,kernel_size=(4,4),padding='same', name='encoder'))
model.add(Activation('relu'))
B = K.mean(x=model.output)
kl = K.concatenate([kl, losses.kullback_leibler_divergence(p, B)],axis=0)
Here seems the cause
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in _normalize_axis(axis, ndim)
989 else:
990 if axis is not None and axis < 0:
991 axis %= ndim <----------
992 return axis
993
so there might be something wrong in the mean calculation. If I print the value I get
Tensor("Mean_10:0", shape=(), dtype=float32)
that is quite strange because the weights and the biases are non-zero initialised. Thus, there might be something wrong in the way of getting the activation values either.
I really would not know hot to fix it, I'm not much of a skilled programmer.
Could anyone help me in understanding where I'm wrong?
First, you shouldn't be doing calculations outside layers. The model must keep track of all calculations.
If you need a specific calculation to be done in the middle of the model, you should use a Lambda layer.
If you need that a specific output be used in the loss function, you should split your model for that output and do calculations inside a custom loss function.
Here, I used Lambda layer to calculate the mean, and a customLoss to calculate the kullback-leibler divergence.
import numpy as np
from keras.layers import *
from keras.models import Model
from keras import backend as K
from keras import losses
x_train = np.random.rand(128,128).astype('float32')
kl = K.placeholder(dtype='float32') #you'll probably not need this anymore, since losses will be treated individually in each output.
beta = beta = K.constant(value=5e-1)
p = K.constant(value=5e-2)
# encoder
inp = Input((128,128,1))
lay = Convolution2D(filters=16,kernel_size=(4,4),padding='same', name='encoder',activation='relu')(inp)
#apply the mean using a lambda layer:
intermediateOut = Lambda(lambda x: K.mean(x),output_shape=(1,))(lay)
# decoder
finalOut = Convolution2D(filters=1,kernel_size=(4,4),padding='same', name='encoder',activation='relu')(lay)
#but from that, let's also calculate a mean output for loss:
meanFinalOut = Lambda(lambda x: K.mean(x),output_shape=(1,))(finalOut)
#Now, you have to create a model taking one input and those three outputs:
splitModel = Model(inp,[intermediateOut,meanFinalOut,finalOut])
And finally, compile your model with your custom loss function (we will define that later). But since I don't know if you're actually using the final output (not mean) for training, I'll suggest creating one model for training and another for predicting:
trainingModel = Model(inp,[intermediateOut,meanFinalOut])
trainingModel.compile(...,loss=customLoss)
predictingModel = Model(inp,finalOut)
#you don't need to compile the predicting model since you're only training the trainingModel
#both will share the same weights, you train one, and predict in the other
Our custom loss function should then deal with the kullback.
def customLoss(p,mean):
return #your own kullback expression (I don't know how it works, but maybe keras' one can be used with single values?)
Alternatively, if you want a single loss function to be called instead of two:
summedMeans = Add([intermediateOut,meanFinalOut])
trainingModel = Model(inp, summedMeans)

Categories