Beta-Variational AutoEncoder can't disentangle - python

I am working on a dummy example with generated heartbeats, and want to first use a VAE to encode the heartbeats and afterwards a simple classifier.
Problem is when i increase the beta above 0.01, the reconstructions become nonsense (see the first image).
And when the beta is low i get a normal autoencoder output with no disentanglement (second image).
I believe the problem may be in my KL divergence or VAE loss function, but i can't seem to find it.
In my encoder i do the reparameterization as such:
enc = self.encoder(x,batch_size, x_lenghts)
mu = self.enc2mean(enc)
logv = self.enc2logv(enc)
std = torch.exp(0.5*logv)
z = torch.randn([batch_size,1, self.encoder_hidden_sizes[-1] * (int(self.bidirectional)+1)]).to(self.device)
z = z * std + mu
And i define the VAE loss as:
def VAE_loss(x, reconstruction, mu, logvar, batch_size, latent_dim, beta=0):
mse = F.mse_loss(x, reconstruction)
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
KLD /= (batch_size * latent_dim)
return mse + beta*KLD
Full standalone code to reproduce the results is here.
Any insights are appreciated!


Why is softmax classifier gradient divided by batch size (CS231n)?

In CS231 Computing the Analytic Gradient with Backpropagation which is first implementing a Softmax Classifier, the gradient from (softmax + log loss) is divided by the batch size (number of data being used in a cycle of forward cost calculation and backward propagation in the training).
Please help me understand why it needs to be divided by the batch size.
The chain rule to get the gradient should be below. Where should I incorporate the division?
Derivative of Softmax loss function
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
#Train a Linear Classifier
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in range(200):
# evaluate class scores, [N x K]
scores =, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples # <---------------------- Why?
# backpropate the gradient to the parameters (W,b)
dW =, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
It's because you are averaging the gradients instead of taking directly the sum of all the gradients.
You could of course not divide for that size, but this division has a lot of advantages. The main reason is that it's a sort of regularization (to avoid overfitting). With smaller gradients the weights cannot grow out of proportions.
And this normalization allows comparison between different configuration of batch sizes in different experiments (How can I compare two batch performances if they are dependent to the batch size?)
If you divide for that size the gradients sum it could be useful to work with greater learning rates to make the training faster.
This answer in the crossvalidated community is quite useful.
Came to notice that the dot in dW =, dscores) for the gradient at W is Σ over the num_sample instances. Since the dscore, which is probability (softmax output), was divided by the num_samples, did not understand that it was normalization for dot and sum part later in the code. Now understood divide by num_sample is required (may still work without normalization if the learning rate is trained though).
I believe the code below explains better.
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
# backpropate the gradient to the parameters (W,b)
dW =, dscores) / num_examples
db = np.sum(dscores, axis=0, keepdims=True) / num_examples

Variational Auto Encoder produces the same picture as the input

I'm trying to train a CVAE, convolutional variational auto encoder, to generate new pictures of human faces.
I'm using the same loss function, training step function, generating function etc. as in the official TensorFlow CVAE Tutorial. I'll post them below.
The only thing I've changed is the encoder and decoder network and the input and output shapes (because I want 128x128 RGB pictures, while the tutorial deals with 28x28 MNIST pictures).
The problem I encounter is, that every picture generated (by generate_and_save_images()) is not a new one (variational), but just a picture that exists 1 by 1 in the dataset. (My training dataset has 518 pictures of faces)
So the generating of new pictures doesn't work, we just have the same pictures 1 by 1 recreated.
Why? And how to fix it?
def sample(self, eps=None):
if eps is None:
eps = tf.random.normal(shape=(100, self.latent_dim))
return self.decode(eps, apply_sigmoid=True)
def encode(self, x):
mean, logvar = tf.split(self.encoder(x), num_or_size_splits=2, axis=1)
return mean, logvar
def reparameterize(self, mean, logvar):
eps = tf.random.normal(shape=mean.shape)
return eps * tf.exp(logvar * .5) + mean
def decode(self, z, apply_sigmoid=False):
logits = self.decoder(z)
if apply_sigmoid:
probs = tf.sigmoid(logits)
return probs
return logits
def log_normal_pdf(sample, mean, logvar, raxis=1):
log2pi = tf.math.log(2. * np.pi)
return tf.reduce_sum(
-.5 * ((sample - mean) ** 2. * tf.exp(-logvar) + logvar + log2pi),
def compute_loss(model, x):
mean, logvar = model.encode(x)
z = model.reparameterize(mean, logvar)
x_logit = model.decode(z)
cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)
logpx_z = -tf.reduce_sum(cross_ent, axis=[1, 2, 3])
logpz = log_normal_pdf(z, 0., 0.)
logqz_x = log_normal_pdf(z, mean, logvar)
return -tf.reduce_mean(logpx_z + logpz - logqz_x)
def train_step(model, x, optimizer):
"""Executes one training step and returns the loss.
This function computes the loss and gradients, and uses the latter to
update the model's parameters.
with tf.GradientTape() as tape:
loss = compute_loss(model, x)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# keeping the random vector constant for generation (prediction) so
# it will be easier to see the improvement.
random_vector_for_generation = tf.random.normal(
shape=[num_examples_to_generate, latent_dim])
def generate_and_save_images(model, epoch, test_sample):
mean, logvar = model.encode(test_sample)
z = model.reparameterize(mean, logvar)
predictions = model.sample(z)
_ = plt.figure(figsize=(20, 20))
for i in range(predictions.shape[0]):
plt.subplot(4, 4, i + 1)
plt.imshow(predictions[i, :, :, :])
# tight_layout minimizes the overlap between 2 sub-plots

keras variational loss function scale

I am very new to NN and tensorflow, recently I have been reading up on keras implementation of variational autoencoder, and I found two versions of loss functions:
def vae_loss(x, x_decoded_mean):
recon_loss = original_dim * objectives.mse(x, x_decoded_mean)
kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
return recon_loss + kl_loss
def vae_loss(x, x_decoded_mean):
recon_loss = objectives.mse(x, x_decoded_mean)
kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
return recon_loss + kl_loss
if my understanding is correct, version 1 is a sum of loss and version 2 is mean loss across all samples in the same batch. so does the scale of loss affect learning result? I tried testing them out, and it largely affect my latent variable scale. so why is this and which form of loss function is correct?
update of my question:
if I multiply original_dim with KL loss,
def vae_loss(x, x_decoded_mean):
xent_loss = original_dim * objectives.binary_crossentropy(x, x_decoded_mean)
kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1) *original_dim
return xent_loss + kl_loss
the latent distribution looks like below:
enter image description here
and decoded output looks like this:
enter image description here
looks the encoder output does not contain any information. I am using mnist dataset, and the example from
Summing versus averaging the loss for each example in a batch will simply scale all loss terms proportionally. An equivalent change would be adjusting the learning rate. The important thing is that your normal loss magnitude multiplied by your learning rate do not lead to unstable learning.

Large WGAN-GP train loss

This is the loss function of WGAN-GP
gen_sample = model.generator(input_gen)
disc_real = model.discriminator(real_image, reuse=False)
disc_fake = model.discriminator(gen_sample, reuse=True)
disc_concat = tf.concat([disc_real, disc_fake], axis=0)
# Gradient penalty
alpha = tf.random_uniform(
shape=[BATCH_SIZE, 1, 1, 1],
differences = gen_sample - real_image
interpolates = real_image + (alpha * differences)
gradients = tf.gradients(model.discriminator(interpolates, reuse=True), [interpolates])[0] # why [0]
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients), reduction_indices=[1]))
gradient_penalty = tf.reduce_mean((slopes-1.)**2)
d_loss_real = tf.reduce_mean(disc_real)
d_loss_fake = tf.reduce_mean(disc_fake)
disc_loss = -(d_loss_real - d_loss_fake) + LAMBDA * gradient_penalty
gen_loss = - d_loss_fake
This is the training loss
The generator loss is oscillating, and the value is so big.
My question is:
is the generator loss normal or abnormal?
One thing to note is that your gradient penalty calculation is wrong. The following line:
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients), reduction_indices=[1]))
should actually be:
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients), reduction_indices=[1,2,3]))
You are reducing on the first axis, but the gradient is based on an image as shown by the alpha values and therefore you have to reduce on the axis [1,2,3].
Another error in your code is that the generator loss is:
gen_loss = d_loss_real - d_loss_fake
For the gradient calculation this makes no difference, due to the parameters of the generator only being contained in d_loss_fake. However, for the value of the generator loss this makes all the difference in the world and is the reason why this oszillates this much.
At the end of the day you should look at your actual performance metric you care about to determine the quality of your GAN like the inception score or the Fréchet Inception Distance (FID), because the loss of discriminator and generator are only mildly descriptive.

CS231n: How to calculate gradient for Softmax loss function?

I am watching some videos for Stanford CS231: Convolutional Neural Networks for Visual Recognition but do not quite understand how to calculate analytical gradient for softmax loss function using numpy.
From this stackexchange answer, softmax gradient is calculated as:
Python implementation for above is:
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
Could anyone explain how the above snippet work? Detailed implementation for softmax is also included below.
def softmax_loss_naive(W, X, y, reg):
Softmax loss function, naive implementation (with loops)
- W: C x D array of weights
- X: D x N array of data. Data are D-dimensional columns
- y: 1-dimensional array of length N with labels 0...K-1, for K classes
- reg: (float) regularization strength
a tuple of:
- loss as single float
- gradient with respect to weights W, an array of same size as W
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
# Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
# Get shapes
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
# Compute vector of scores
f_i =[:, i]) # in R^{num_classes}
# Normalization trick to avoid numerical instability, per
log_c = np.max(f_i)
f_i -= log_c
# Compute loss (and add to it, divided later)
# L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
sum_i = 0.0
for f_i_j in f_i:
sum_i += np.exp(f_i_j)
loss += -f_i[y[i]] + np.log(sum_i)
# Compute gradient
# dw_j = 1/num_train * \sum_i[x_i * (p(y_i = j)-Ind{y_i = j} )]
# Here we are computing the contribution to the inner sum for a given i.
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
# Compute average
loss /= num_train
dW /= num_train
# Regularization
loss += 0.5 * reg * np.sum(W * W)
dW += reg*W
return loss, dW
Not sure if this helps, but:
is really the indicator function , as described here. This forms the expression (j == y[i]) in the code.
Also, the gradient of the loss with respect to the weights is:
which is the origin of the X[:,i] in the code.
I know this is late but here's my answer:
I'm assuming you are familiar with the cs231n Softmax loss function.
We know that:
So just as we did with the SVM loss function the gradients are as follows:
Hope that helped.
A supplement to this answer with a small example.
I came across this post and still was not 100% clear how to arrive at the partial derivatives.
For that reason I took another approach to get to the same results - maybe it is helpful to others too.
