How to Kullback Leibler divergence of two datasets - python

I have two datasets that contain 40000 samples. I want to calculate the Kullback-Leibler divergence between these two datasets in python. Is there any efficient way of doing this in python?

Edit:
OK. I figured out it doesn't work in the input space. So the old explanation is probably wrong but I'll keep it anyway.
Here is my new thoughts:
In my senior project, I'm using the algorithm called AugMix. In this algorithm they calculated the Shannon-Jensen Divergence between two augmented images, which is the symmetrical form of the KL Divergence.
They used the model output as the probability distribution of the dataset. The idea is to fit a model to a dataset, then interpret the output of the model as the probability density function.
For example, you fitted a dataset without overfitting. Then (assuming this is an classification problem) you feed your logits (the output of the last layer) to the softmax function for each class (sometimes the softmax function is added as a layer to the end of the network, careful). The output of your softmax function (or layer) can be interpreted as P(Y|X_{1}) where X_{1} is the input sample and the Y is the groundtruth class. Then you make a prediction for another sample X_{2}, P(Y|X_{2}), where X_{1} and X_{2} comes from different datasets (say dataset_1 and dataset_2) and the model is not trained with any of those datasets.
Then the KL divergence between dataset_1 and dataset_2 can be calculated by KL(dataset_1 || dataset_2) = P(Y|X_{1}) * log(P(Y|X_{1}) / P(Y|X_{2}))
Make sure that X_{1} and X_{2} belongs to the same class.
I'm not sure if this is the correct way. Alternatively, you can train two different models (model_1 and model_2) using different datasets (dataset_1 and dataset_2) and then calculate the KL divergence on the predictions of those two models using the samples of another dataset called dataset_3. In other words:
KL(dataset_1 || dataset_2) = sum x in dataset_3 model_1(x) * log(model_1(x) / model_2(x))
where model_1(x) is the softmax output of model_1, which is trained using dataset_1 without overfitting, for the correct label.
The latter sounds more reasonable to me but I'm not sure either of them. I could not find a proper answer on my own.
The things I'm going to explain are adopted from the blog of the Jason Brownlee from machinelearningmastery.com KL Divergence
As far as I understood, firstly, you have to convert your datasets into the probability distribution so that you can calculate the probability of each of the samples from the union (or intersect?) of the both datasets.
KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))
However, most of the time the intersection of the datasets are none. For example, if you want to measure the divergence between CIFAR10 and ImageNet, there is not any samples in common. The only way you can calculate this metric is to sample from the same dataset to create two different datasets. Therefore you can have samples that are present in both datasets, and calculate the KL divergence.
Lastly, maybe you want to check the Wasserstein Divergence that is used in GANs in order to compare the source distribution and the target distribution.

Related

geometric mean while calculationg tensorflow loss

I need to calculate Aitchison distance as a loss function between input and output datasets.
While calculating this mstric I need to calculate geometric mean on each row (where [batches x features] - size of a dataset during loss ).
In simple case we could imagine that there is only 1 batch so I need just to calculate one geomean for input and one for output dataset
So how it could be done on tensorflow? I didn't find any specified metrics or reduced functions
You can easily calculate the geometric mean of a tensor as a loss function (or in your case as part of the loss function) with tensorflow using a numerically stable formula highlighted here. The provided code fragment highly resembles to the pytorch solution posted here that follows the abovementioned formula (and scipy implementation).
from tensorflow.python.keras import backend as K
def gmean_loss((y_true, y_pred, dim=1):
error = y_pred - y_true
logx = K.log(inputs)
return K.exp(K.mean(logx, dim=dim))
You can define dim according to your needs or integrate it into your code.

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

Tensorflow: Weighted sparse_softmax_cross_entropy for inbalanced classes across a single image

I'm working on a binary semantic segmentation task where the distribution of one class is very smalls across any input image, hence there are only a few pixels which are labeled. When using sparse_softmax_cross_entropy
the over all error is easily decreased when ignoring this class. Now, I'm looking for a way to weight the classes by a coefficient which penalizes missclassifications for the specific class higher compared to the other class.
The doc of the loss function states:
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weights is a tensor of shape [batch_size], then the loss weights apply to each corresponding sample.
If I understand this correctly, it says that specific sample in a batch get weighted differently compared to others. But this is actually not what I'm looking for. Does anyone know how to implement a weighted version of this loss function where the weights scale the importance of a specific class rather than samples?
To answer my own question:
The authors of the U-Net paper used a pre-computed weight-map to handle imbalanced classes.
The Institute for Anstronomy of ETH Zurich provided a Tensorflow-based U-Net package which contains a weighted version of the Softmax function (not sparse but they flatten their labels and logits first):
class_weights = tf.constant(np.array(class_weights, dtype=np.float32))
weight_map = tf.multiply(flat_labels, class_weights)
weight_map = tf.reduce_sum(weight_map, axis=1)
loss_map = tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits, labels=flat_labels)
weighted_loss = tf.multiply(loss_map, weight_map)
loss = tf.reduce_mean(weighted_loss)

How to use multilayer perceptron and make predictions given a skew-distributed feature

I'm fairly new to deep learning and Keras, and this problem has bothered me for weeks. Hope I can get some hints from here.
Features:
I simulated two variables, each has 10k samples and follow a standard normal distribution: A ~ Norm(0, 1); B ~ Norm(0, 1).
Labels
And I derived two labels from the simulated variables: y1 = A * B; y2 = A / B.
Model
Input dimension: 2
Hidden layers: 4 dense layers, all of them were 32 neurons wide
Output layers: a dense layer with 1 neuron
Activation functions: ReLU for all the activation functions
Compiler: 'MSE' as the loss function, 'Adam' as the optimizer with learning rate at 1e-05
Tasks
Finally, I set up three tasks for MLP to learn:
(1) Use A, B to predict y1;
(2) Use A, B to predict y2;
(3) Use A, 1/B to predict y2
Validation
Use 'validation_split = 0.2' to verify the model
Results and Inference
It can reach MSE below 1 easily for both training and validation set after 10~15 epochs in task 1. However, I'll always get a very high loss like 30k+ on training loss for the other two tasks.
[update] I also evaluated the results by Pearson correlation coefficient, which returned ~0.7 for task 1 and <0.01 for task 2 and 3.
It's weird to me since the ideas of multiplication(y1) and division(y2) are mathematically the same. So I then tried to look into the distribution of 1/B, and I found that it has extremely long tails at each side. I surpose it might be the source of difficulty but couldn't figure any strategy for it. I also tried to normalize 1/B before the training but got no luck on it.
Any advice or comment is welcome. Can't find discussion on this either on web or books, really want to make some progress on it. Thank you.
y2 values have a much different distribution from y1 values, specifically, it will have values with much larger absolute values. This means that comparing the loss directly isn't really fair.
It's kinda like estimating the mass of a person vs. estimating the mass of a planet, and being upset that you're off by millions of pounds.
For an illustration, try calculating the loss on all three problems, but with an estimator that only ever guesses 0.0. I suspect that problem 1 will have much lower loss than the other two.

Practical determination of anomaly threshold in (variational) autoencoders

Although not strictly a programming question, I haven't found anything about this topic on this site. I currently dealing with (variational) autoencoders ((V)AE), and plan to deploy them to detect anomalies. For testing purposes, I've implemented an VAE in tensorflow for detecting handwritten digits.
The training went well and the reconstructed images are very similar to the originals. But for actually using the autoencoder, I have to use some kind of measure to determine if a new image fed to the autoencoder is a digit or not by comparing it to a threshold value.
At this point, I have two major questions:
1.) For training, I used a loss consisting of two components. First one is the reconstruction error, which is a crossentropy function:
# x: actual input
# x_hat: reconstructed input
epsilon = 1e-10 # <-- small number for numeric stability within log
recons_loss = - f.reduce_sum( x * tf.log( epsilon + x_hat) + (1 - x) * tf.log( epsilon + 1 - x_hat),
axis=1)
The second one is KL-divergence, which is a measure of how similar two probability distributions are, as we are demanding that the latent variable space is a distribution similar to a Gaussian.
# z_mean: vector representing the means of the latent distribution
# z_log_var: vector representing the variances of the latent distribution
KL_div = -0.5 * tf.reduce_sum( 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var),
axis=1)
For determining the reconstruction error of a new image, do I have to use both parts of the training loss? Intuitively, I would say no and just go with the recon_loss.
2.) How do I determine the threshold value? Is there already a tf functionality implemented that I can use?
If you have some good source for anything related, please share the link!
Thanks!
I had a similar problem recently. VAEs are very well in projecting a high dimensional data into a lower dimensional latent space. Altering the latent vector and feeding it to the decoder part creates new samples.
I hope I get your question right, you try to do an anomaly detection with the encoder part on the lower dimensional latent space?
I guess you have trained your VAE on MNIST. What you can do is getting all latent vectors of the MNIST-digits and compare the latent vector of your new digit via euclidian distance to them. The threshold would be a max distance set by you.
The code would be something like this:
x_mnist_encoded = encoder.predict(x_mnist, batch_size=batch_size) #array of MNIST latent vectors
test_digit_encoded = encoder.predict(x_testdigit, batch_size=1) #your testdigit latent vector
#calc the distance
from scipy.spatial import distance
threshold = 0.3 #min eucledian distance
for vector in x_mnist_encoded:
dst = distance.euclidean(vector,test_digit_encoded[0])
if dst <= threshold:
return True
VAE code is from https://blog.keras.io/building-autoencoders-in-keras.html

Categories