PyTorch: Is retain_graph=True necessary in alternating optimization?

PyTorch: Is retain_graph=True necessary in alternating optimization? - python

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.

I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

Related

geometric mean while calculationg tensorflow loss

I need to calculate Aitchison distance as a loss function between input and output datasets.
While calculating this mstric I need to calculate geometric mean on each row (where [batches x features] - size of a dataset during loss ).
In simple case we could imagine that there is only 1 batch so I need just to calculate one geomean for input and one for output dataset
So how it could be done on tensorflow? I didn't find any specified metrics or reduced functions

You can easily calculate the geometric mean of a tensor as a loss function (or in your case as part of the loss function) with tensorflow using a numerically stable formula highlighted here. The provided code fragment highly resembles to the pytorch solution posted here that follows the abovementioned formula (and scipy implementation).
from tensorflow.python.keras import backend as K
def gmean_loss((y_true, y_pred, dim=1):
error = y_pred - y_true
logx = K.log(inputs)
return K.exp(K.mean(logx, dim=dim))
You can define dim according to your needs or integrate it into your code.

How to define the policy in the case of continuous action space that sum up to 1?

I am currently working on a continuous state-action space problem using policy gradient methods.
The environment action space is defined as ratios that has to sum up to 1 at each timestep. Hence, using the gaussian policy doesn't seem to be suitable in this case.
What I did instead is I tried to tweak the softmax policy (to make sure the policy network output sums up to 1), but I had hard time determining the loss function to use and eventually its gradient in order to update the network parameters.
So far, I have tried a discounted return-weighted Mean Squared Error, but the results aren't satisfactory.
Are there any other policies that can be used in this particular case? Or ar there any ideas which loss function to use?
Here is the implementation of my policy network (inside my agent class) in tensorflow.
def policy_network(self):
self.input = tf.placeholder(tf.float32,
shape=[None, self.input_dims],
name='input')
self.label = tf.placeholder(tf.float32, shape=[None, self.n_actions], name='label')
# discounted return
self.G = tf.placeholder(tf.float32, shape=[
None,
], name='G')
with tf.variable_scope('layers'):
l1 = tf.layers.dense(
inputs=self.input,
units=self.l1_size,
activation=tf.nn.relu,
kernel_initializer=tf.contrib.layers.xavier_initializer())
l2 = tf.layers.dense(
inputs=l1,
units=self.l2_size,
activation=tf.nn.relu,
kernel_initializer=tf.contrib.layers.xavier_initializer())
l3 = tf.layers.dense(
inputs=l2,
units=self.n_actions,
activation=None,
kernel_initializer=tf.contrib.layers.xavier_initializer())
self.actions = tf.nn.softmax(l3, name='actions')
with tf.variable_scope('loss'):
base_loss = tf.reduce_sum(tf.square(self.actions - self.label))
loss = base_loss * self.G
with tf.variable_scope('train'):
self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)

On top of my head, you may want to try 2D-Gaussian or multivariate Gaussian. https://en.wikipedia.org/wiki/Gaussian_function
For example, you could predict the 4 parameters (x_0, x_1, sigma_0, sigma_1) of 2D-Gaussian, which you could generate a pair of numbers on the 2D-Gaussian plane, say (2, 1.5), then you could use softmax to produce the desired action softmax([2, 1.5])=[0.62245933 0.37754067].
Then you could calculate the probability of the pair of numbers on the 2D-Gaussian plane, which you could then use to calculate the negative log probability, advantage, etc, to make the loss function and update the gradient.

Have you thought of using Dirichlet distribution? Your network can output concentration parameters alpha > 0 and then you can use them to generate a sample which would sum to one. Both PyTorch and TF support this distribution and you can both sample and get logProb from them. In this case, in addition to getting your sample, since it is a probability distribution, you can get a sense its variance too which can be a measure of the agent confidence. For the action of 3 dimensions, having alpha={1,1,1} basically means your agent doesn't have any preference and having alpha={100,1,1} would imply that it is very certain about most of the weight should go to the first dimensions.
Edit based on the comment:
Vanilla REINFORCE would have a hard time optimizing the policy when you use Dirichlet distribution. The problem is, in vanilla policy gradient, you can control how fast you change your policy in the network parameters space through gradient clipping and adaptive learning rate, etc. However, what matters the most is to control the rate of change in the probability space. Some network parameters may change probabilities a lot more than the others. Therefore, even though you control the learning rate to limit the delta of your network parameters, you may change the variance of your Dirichlet distribution a lot, which makes sense for your network if you think. In order to maximize the log-prob of your actions, your network might focus more on reducing the variance than shifting the mode of your distribution which would later hurt you in both exploration and learning meaningful policy. One way to alleviate this problem is to limit the rate of change of your policy probability distribution through limiting the KL-divergence of your new policy ditribution vs old one. TRPO or PPO are two of the ways to address this issue and solve the constraint optimization problems.
It is also probably good to make sure that in practice alpha > 1. You can achieve this easily by using softplus ln(1+exp(x)) + 1 on your neural network outputs before feeding it into your Drichlet distribution. Also monitor the gradients reaching your layers and make sure it exists.
You may also want to add the entropy of the distribution to your objective function to ensure enough exploration and prevent distribution with very low variance (very high alphas).

Tensorflow: Weighted sparse_softmax_cross_entropy for inbalanced classes across a single image

I'm working on a binary semantic segmentation task where the distribution of one class is very smalls across any input image, hence there are only a few pixels which are labeled. When using sparse_softmax_cross_entropy
the over all error is easily decreased when ignoring this class. Now, I'm looking for a way to weight the classes by a coefficient which penalizes missclassifications for the specific class higher compared to the other class.
The doc of the loss function states:
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weights is a tensor of shape [batch_size], then the loss weights apply to each corresponding sample.
If I understand this correctly, it says that specific sample in a batch get weighted differently compared to others. But this is actually not what I'm looking for. Does anyone know how to implement a weighted version of this loss function where the weights scale the importance of a specific class rather than samples?

To answer my own question:
The authors of the U-Net paper used a pre-computed weight-map to handle imbalanced classes.
The Institute for Anstronomy of ETH Zurich provided a Tensorflow-based U-Net package which contains a weighted version of the Softmax function (not sparse but they flatten their labels and logits first):
class_weights = tf.constant(np.array(class_weights, dtype=np.float32))
weight_map = tf.multiply(flat_labels, class_weights)
weight_map = tf.reduce_sum(weight_map, axis=1)
loss_map = tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits, labels=flat_labels)
weighted_loss = tf.multiply(loss_map, weight_map)
loss = tf.reduce_mean(weighted_loss)

Loop over a layer to do a Monte Carlo from a neural net output

I've recently had to tweak a neural network. Here's how it works:
Given an image as input, several layer turns it into a mean matrix mu and a covariance matrix sigma.
Then, a sample z is taken from the Gaussian distribution of parameters mu, sigma.
Several layer turns this sample into an output
this output is compared to a given image, which gives a cost
What I want to do is to keep mu and sigma, take multiple samples z, propagate them through the rest of the NN, and compare the multiple images I get to a given image.
Note that the step z -> image output calls other package, I'd like not having to dig into these...
What I did so far :
At first, I thought I did not need to go through all this hassle : I take a batch_size of one, it is as if I'm doing a Monte Carlo by running the NN multiple times. But I actually need the neural net to try several image before updating the weights, and thus changing mu and sigma.
I simply sampled multiple z then propagated them through the net. But I soon discovered that I was duplicating all the layers, making the code terribly slow, and above all preventing me from taking many samples to achieve the MC I'm aiming at.
Of course, I updated the loss and data input classes to take that into account.
Do you have any ideas ? Basically, I'd like an efficient way to make z -> output multiple time, in a cost-efficient manner. I've still a lot to learn from tensorflow and keras, so I'm a little bit lost on how to do that. As usual, please apologized if an answer already exists somewhere, I did my best to look for one by myself!

Ok, my question was a bit stupid. So as not to duplicate layers, I created multiple slice layers, and then I simply propagated them through the net with previously declared layers. Here's my code :
# First declare layers
a = layer_A()
b = layer_B()
# And so on ...
# Generate samples
samples = generate_samples()([mu, sigma])
# for all the monte carlo samples, do :
for i in range(mc_samples):
cur_sample = Lambda(lambda x: K.slice(x, (0, 0, 0, 2*i), (-1, -1, -1, 2)), name="slice-%i" % i)(samples)
cur_output = a(cur_sample)
cur_output = b(cur_output)
all_output.append(output)
output_of_net = keras.layers.concatenate(all_output)
return Model(inputs=inputs, outputs=output_of_net)
Simply loop over the last dimension in the loss function, average, and you're done ! A glimpse at my loss :
loss = 0
for i in range(mc_samples):
loss += f(y_true[..., i], y_pred[..., i])
return loss/mc_samples

How to Kullback Leibler divergence of two datasets

I have two datasets that contain 40000 samples. I want to calculate the Kullback-Leibler divergence between these two datasets in python. Is there any efficient way of doing this in python?

Edit:
OK. I figured out it doesn't work in the input space. So the old explanation is probably wrong but I'll keep it anyway.
Here is my new thoughts:
In my senior project, I'm using the algorithm called AugMix. In this algorithm they calculated the Shannon-Jensen Divergence between two augmented images, which is the symmetrical form of the KL Divergence.
They used the model output as the probability distribution of the dataset. The idea is to fit a model to a dataset, then interpret the output of the model as the probability density function.
For example, you fitted a dataset without overfitting. Then (assuming this is an classification problem) you feed your logits (the output of the last layer) to the softmax function for each class (sometimes the softmax function is added as a layer to the end of the network, careful). The output of your softmax function (or layer) can be interpreted as P(Y|X_{1}) where X_{1} is the input sample and the Y is the groundtruth class. Then you make a prediction for another sample X_{2}, P(Y|X_{2}), where X_{1} and X_{2} comes from different datasets (say dataset_1 and dataset_2) and the model is not trained with any of those datasets.
Then the KL divergence between dataset_1 and dataset_2 can be calculated by KL(dataset_1 || dataset_2) = P(Y|X_{1}) * log(P(Y|X_{1}) / P(Y|X_{2}))
Make sure that X_{1} and X_{2} belongs to the same class.
I'm not sure if this is the correct way. Alternatively, you can train two different models (model_1 and model_2) using different datasets (dataset_1 and dataset_2) and then calculate the KL divergence on the predictions of those two models using the samples of another dataset called dataset_3. In other words:
KL(dataset_1 || dataset_2) = sum x in dataset_3 model_1(x) * log(model_1(x) / model_2(x))
where model_1(x) is the softmax output of model_1, which is trained using dataset_1 without overfitting, for the correct label.
The latter sounds more reasonable to me but I'm not sure either of them. I could not find a proper answer on my own.
The things I'm going to explain are adopted from the blog of the Jason Brownlee from machinelearningmastery.com KL Divergence
As far as I understood, firstly, you have to convert your datasets into the probability distribution so that you can calculate the probability of each of the samples from the union (or intersect?) of the both datasets.
KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))
However, most of the time the intersection of the datasets are none. For example, if you want to measure the divergence between CIFAR10 and ImageNet, there is not any samples in common. The only way you can calculate this metric is to sample from the same dataset to create two different datasets. Therefore you can have samples that are present in both datasets, and calculate the KL divergence.
Lastly, maybe you want to check the Wasserstein Divergence that is used in GANs in order to compare the source distribution and the target distribution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.