Practical determination of anomaly threshold in (variational) autoencoders

Practical determination of anomaly threshold in (variational) autoencoders - python

Although not strictly a programming question, I haven't found anything about this topic on this site. I currently dealing with (variational) autoencoders ((V)AE), and plan to deploy them to detect anomalies. For testing purposes, I've implemented an VAE in tensorflow for detecting handwritten digits.
The training went well and the reconstructed images are very similar to the originals. But for actually using the autoencoder, I have to use some kind of measure to determine if a new image fed to the autoencoder is a digit or not by comparing it to a threshold value.
At this point, I have two major questions:
1.) For training, I used a loss consisting of two components. First one is the reconstruction error, which is a crossentropy function:
# x: actual input
# x_hat: reconstructed input
epsilon = 1e-10 # <-- small number for numeric stability within log
recons_loss = - f.reduce_sum( x * tf.log( epsilon + x_hat) + (1 - x) * tf.log( epsilon + 1 - x_hat),
axis=1)
The second one is KL-divergence, which is a measure of how similar two probability distributions are, as we are demanding that the latent variable space is a distribution similar to a Gaussian.
# z_mean: vector representing the means of the latent distribution
# z_log_var: vector representing the variances of the latent distribution
KL_div = -0.5 * tf.reduce_sum( 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var),
axis=1)
For determining the reconstruction error of a new image, do I have to use both parts of the training loss? Intuitively, I would say no and just go with the recon_loss.
2.) How do I determine the threshold value? Is there already a tf functionality implemented that I can use?
If you have some good source for anything related, please share the link!
Thanks!

I had a similar problem recently. VAEs are very well in projecting a high dimensional data into a lower dimensional latent space. Altering the latent vector and feeding it to the decoder part creates new samples.
I hope I get your question right, you try to do an anomaly detection with the encoder part on the lower dimensional latent space?
I guess you have trained your VAE on MNIST. What you can do is getting all latent vectors of the MNIST-digits and compare the latent vector of your new digit via euclidian distance to them. The threshold would be a max distance set by you.
The code would be something like this:
x_mnist_encoded = encoder.predict(x_mnist, batch_size=batch_size) #array of MNIST latent vectors
test_digit_encoded = encoder.predict(x_testdigit, batch_size=1) #your testdigit latent vector
#calc the distance
from scipy.spatial import distance
threshold = 0.3 #min eucledian distance
for vector in x_mnist_encoded:
dst = distance.euclidean(vector,test_digit_encoded[0])
if dst <= threshold:
return True
VAE code is from https://blog.keras.io/building-autoencoders-in-keras.html

Related

Which deep learning network structure should I refer?

I am making a deep learning network that finds several points in 3d space.
The input is a stack of grayscale 1024 x 1024 images(# of images varies 5 to 20 ), and the output is 64 x 64 x 64 space. Each voxel of output has 0 or 1, but in my dataset there are only 2000 1s, so it is hard to tell whether my network is being trained well by observing the training losses.
For example if my network only spit out np.zeros((64,64,64)) as output, the accuracy still would be 1-2000/(64x64x64)~=99.9%.
So I want to ask which deep learning network I should choose for finding very small number of answers from 3d space. The input size becomes (1024 x 1024 x #img) and output size (64 x 64 x 64). I am now making experiments using 2D Unet-like net / 3D Unet-like net, with ReLU-with-ceiling end activation.
Please somebody recommend anything to refer and thank you very much.

Unet-like networks seems to be a good idea. Your problem does not comes frop the network itself, but from the loss and metrics you are using.
Indead, if you use a binary crossentropy loss and accuracy for metrics, because of the imbalanced character of your classes, your score will still be near 100%.
I suggest that you use Dice or Jaccard coefficient for metrics and/or loss (in this case loss is 1-Dicecoef), and that you calculate it only on the items of interest, and not on the background.
Depending on the framework you are using, you should easily find an existing implementation of these metrics. Then modify the code to avoid calculation on the background.
For example for python/tensorflow, using your volumes:
def dice_coef(y_true, y_pred, smooth=1):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
y_true_f = K.one_hot(K.cast(y_true_f, np.uint8), 2)
y_pred_f = K.one_hot(K.cast(y_pred_f, np.uint8), 2)
intersection = K.sum(y_true_f[:,1:]* y_pred_f[:,1:], axis=[-1])
union = K.sum(y_true_f[:,1:], axis=[-1]) + K.sum(y_pred_f[:,1:], axis=[-1])
dice = K.mean((2. * intersection + smooth)/(union + smooth), axis=0)
return dice

geometric mean while calculationg tensorflow loss

I need to calculate Aitchison distance as a loss function between input and output datasets.
While calculating this mstric I need to calculate geometric mean on each row (where [batches x features] - size of a dataset during loss ).
In simple case we could imagine that there is only 1 batch so I need just to calculate one geomean for input and one for output dataset
So how it could be done on tensorflow? I didn't find any specified metrics or reduced functions

You can easily calculate the geometric mean of a tensor as a loss function (or in your case as part of the loss function) with tensorflow using a numerically stable formula highlighted here. The provided code fragment highly resembles to the pytorch solution posted here that follows the abovementioned formula (and scipy implementation).
from tensorflow.python.keras import backend as K
def gmean_loss((y_true, y_pred, dim=1):
error = y_pred - y_true
logx = K.log(inputs)
return K.exp(K.mean(logx, dim=dim))
You can define dim according to your needs or integrate it into your code.

How to convert probability to angle degree in a head-pose estimation problem?

I reused code from others to make head-pose prediction in Euler angles. The author trained a classification network that returns bin classification results for the three angles, i.e. yaw, roll, pitch. The number of bins is 66. They somehow convert the probabilities to the corresponding angle, as written from line 150 to 152 here. Could someone help to explain the formula?
These are the relevant lines of code in the above file:
[56] model = hopenet.Hopenet(torchvision.models.resnet.Bottleneck, [3, 4, 6, 3], 66) # a variant of ResNet50
[80] idx_tensor = [idx for idx in xrange(66)]
[81] idx_tensor = torch.FloatTensor(idx_tensor).cuda(gpu)
[144] yaw, pitch, roll = model(img)
[146] yaw_predicted = F.softmax(yaw)
[150] yaw_predicted = torch.sum(yaw_predicted.data[0] * idx_tensor) * 3 - 99

If we look at the training code, and the authors' paper,* we see that the loss function is a sum of two losses:
the raw model output (vector of probabilities for each bin category):
[144] yaw, pitch, roll = model(img)
a linear combination of the bin predictions (the predicted continuous angle):
[146] yaw_predicted = F.softmax(yaw)
[150] yaw_predicted = torch.sum(yaw_predicted.data[0] * idx_tensor) * 3 - 99
Since 3 * softmax(label_weighted_sum(output)) - 99 is the final layer in training the regression loss (but is not explicitly a part of the model's forward), this must be applied to the raw output to convert it from the vector of bin probabilities to a single angle prediction.
*
3.2. The Multi-Loss Approach
All previous work which predicted head pose using convolutional networks regressed all three Euler angles directly using a mean squared error loss. We notice that this approach does not achieve the best results on our large-scale synthetic training data.
We propose to use three separate losses, one for each angle. Each loss is a combination of two components: a binned pose classification and a regression component. Any
backbone network can be used and augmented with three fully-connected layers which predict the angles. These three fully-connected layers share the previous convolutional layers of the network.
The idea behind this approach is that by performing bin classification we use the very stable softmax layer and cross-entropy, thus the network learns to predict the neighbourhood of the pose in a robust fashion. By having three cross-entropy losses, one for each Euler angle, we have three signals which are backpropagated into the network
which improves learning. In order to obtain a fine-grained predictions we compute the expectation of each output angle for the binned output. The detailed architecture is shown
in Figure 2.
We then add a regression loss to the network, namely a mean-squared error loss, in order to improve fine-grained predictions. We have three final losses, one for each angle,
and each is a linear combination of both the respective classification and the regression losses. We vary the weight of the regression loss in Section 4.4 and we hold the weight of
the classification loss constant at 1. The final loss for each Euler angle is the following:
Where H and MSE respectively designate the crossentropy and mean squared error loss functions.

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.

I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

How to Kullback Leibler divergence of two datasets

I have two datasets that contain 40000 samples. I want to calculate the Kullback-Leibler divergence between these two datasets in python. Is there any efficient way of doing this in python?

Edit:
OK. I figured out it doesn't work in the input space. So the old explanation is probably wrong but I'll keep it anyway.
Here is my new thoughts:
In my senior project, I'm using the algorithm called AugMix. In this algorithm they calculated the Shannon-Jensen Divergence between two augmented images, which is the symmetrical form of the KL Divergence.
They used the model output as the probability distribution of the dataset. The idea is to fit a model to a dataset, then interpret the output of the model as the probability density function.
For example, you fitted a dataset without overfitting. Then (assuming this is an classification problem) you feed your logits (the output of the last layer) to the softmax function for each class (sometimes the softmax function is added as a layer to the end of the network, careful). The output of your softmax function (or layer) can be interpreted as P(Y|X_{1}) where X_{1} is the input sample and the Y is the groundtruth class. Then you make a prediction for another sample X_{2}, P(Y|X_{2}), where X_{1} and X_{2} comes from different datasets (say dataset_1 and dataset_2) and the model is not trained with any of those datasets.
Then the KL divergence between dataset_1 and dataset_2 can be calculated by KL(dataset_1 || dataset_2) = P(Y|X_{1}) * log(P(Y|X_{1}) / P(Y|X_{2}))
Make sure that X_{1} and X_{2} belongs to the same class.
I'm not sure if this is the correct way. Alternatively, you can train two different models (model_1 and model_2) using different datasets (dataset_1 and dataset_2) and then calculate the KL divergence on the predictions of those two models using the samples of another dataset called dataset_3. In other words:
KL(dataset_1 || dataset_2) = sum x in dataset_3 model_1(x) * log(model_1(x) / model_2(x))
where model_1(x) is the softmax output of model_1, which is trained using dataset_1 without overfitting, for the correct label.
The latter sounds more reasonable to me but I'm not sure either of them. I could not find a proper answer on my own.
The things I'm going to explain are adopted from the blog of the Jason Brownlee from machinelearningmastery.com KL Divergence
As far as I understood, firstly, you have to convert your datasets into the probability distribution so that you can calculate the probability of each of the samples from the union (or intersect?) of the both datasets.
KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))
However, most of the time the intersection of the datasets are none. For example, if you want to measure the divergence between CIFAR10 and ImageNet, there is not any samples in common. The only way you can calculate this metric is to sample from the same dataset to create two different datasets. Therefore you can have samples that are present in both datasets, and calculate the KL divergence.
Lastly, maybe you want to check the Wasserstein Divergence that is used in GANs in order to compare the source distribution and the target distribution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.