I ask myself this question after reading about Variational Autoencoders, where the bottleneck of the model produce a mean m and a standard deviation u. Then from a uniform distribution X=U(0, 1), the VAE computes the latent vector v=X*u + m, that follows a U(m, v) distribution and allows the gradient to propagate.
I want to do the same with a beta distribution (so with parameters a and b). How is it possible to sample from a beta distribution while allowing the gradient to propagate (because otherwise I could simply use the tfp.distributions.Beta function but the gradient wouldn't propagate ...)?
I need to calculate Aitchison distance as a loss function between input and output datasets.
While calculating this mstric I need to calculate geometric mean on each row (where [batches x features] - size of a dataset during loss ).
In simple case we could imagine that there is only 1 batch so I need just to calculate one geomean for input and one for output dataset
So how it could be done on tensorflow? I didn't find any specified metrics or reduced functions
You can easily calculate the geometric mean of a tensor as a loss function (or in your case as part of the loss function) with tensorflow using a numerically stable formula highlighted here. The provided code fragment highly resembles to the pytorch solution posted here that follows the abovementioned formula (and scipy implementation).
from tensorflow.python.keras import backend as K
def gmean_loss((y_true, y_pred, dim=1):
error = y_pred - y_true
logx = K.log(inputs)
return K.exp(K.mean(logx, dim=dim))
You can define dim according to your needs or integrate it into your code.
I am getting into machine learning and recently I have studied classification of linear separable data using linear Discriminant Analysis. To do so I have used the scikit-learn package and the function
.discriminant_analysis.LinearDiscriminantAnalysis
On data from MNIST database of handwritten digits. I have used the database to fit the model and do predictions on test data by doing like this:
LDA(n_components=2)
LDA_fit(data,labels)
LDA_predict(testdata)
Which works just fine. I get a nice accuracy rate of 95%. However the predict function uses data from all 784 dimensions (corresponding to images of 28x28 pixels). I don’t understand why all dimensions are used for the prediction?
I though the purpose of the linear Discriminant analysis is to find a projection on the low dimension space that allows maximizes class separation allowing, such that ideally data is linear separable and classification is easy.
What’s the point of LDA and determining the projection matrix if all 784 dimensions are used for prediction anyway?
From documentation:
discriminant_analysis.LinearDiscriminantAnalysis can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the directions which maximize the separation between classes (in a precise sense discussed in the mathematics section below). The dimension of the output is necessarily less than the number of classes, so this is, in general, a rather strong dimensionality reduction, and only makes sense in a multiclass setting.
This is implemented in discriminant_analysis.LinearDiscriminantAnalysis.transform. The desired dimensionality can be set using the n_components constructor parameter. This parameter has no influence on discriminant_analysis.LinearDiscriminantAnalysis.fit or discriminant_analysis.LinearDiscriminantAnalysis.predict.
Meaning n_components is used only for transform or fit_transform. You can use dimensionality reduction for removing noise from your data or for visualization.
The low dimension which you had mentioned is actually n_classes in terms of classification.
If you use this for dimension reduction technique you can chose n_components dimensions, if you had specified it (it must be < n_classes). This has no impact on prediction as mentioned in documentation.
Hence, once you give input data, it will transform the data into n_classes dimensional space, then use this space for training/prediction. Reference - _decision_function() is used for prediction.
You can use Transform(X) to view the new lower dimensional space learned by the model.
Applying LDA on mnist data with reduced dimensions:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(data_1000, labels_1000).transform(data_1000)
# LDA before tsne
plt.figure()
colors = ['brown','black','deepskyblue','red','yellow','darkslategrey','navy','darkorange','deeppink', 'lawngreen']
target_names = ['0','1','2','3','4','5','6','7','8','9']
lw = 2
y = labels_1000
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2,3,4,5,6,7,8,9], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of MNIST dataset before TSNE')
plt.show()
Although not strictly a programming question, I haven't found anything about this topic on this site. I currently dealing with (variational) autoencoders ((V)AE), and plan to deploy them to detect anomalies. For testing purposes, I've implemented an VAE in tensorflow for detecting handwritten digits.
The training went well and the reconstructed images are very similar to the originals. But for actually using the autoencoder, I have to use some kind of measure to determine if a new image fed to the autoencoder is a digit or not by comparing it to a threshold value.
At this point, I have two major questions:
1.) For training, I used a loss consisting of two components. First one is the reconstruction error, which is a crossentropy function:
# x: actual input
# x_hat: reconstructed input
epsilon = 1e-10 # <-- small number for numeric stability within log
recons_loss = - f.reduce_sum( x * tf.log( epsilon + x_hat) + (1 - x) * tf.log( epsilon + 1 - x_hat),
axis=1)
The second one is KL-divergence, which is a measure of how similar two probability distributions are, as we are demanding that the latent variable space is a distribution similar to a Gaussian.
# z_mean: vector representing the means of the latent distribution
# z_log_var: vector representing the variances of the latent distribution
KL_div = -0.5 * tf.reduce_sum( 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var),
axis=1)
For determining the reconstruction error of a new image, do I have to use both parts of the training loss? Intuitively, I would say no and just go with the recon_loss.
2.) How do I determine the threshold value? Is there already a tf functionality implemented that I can use?
If you have some good source for anything related, please share the link!
Thanks!
I had a similar problem recently. VAEs are very well in projecting a high dimensional data into a lower dimensional latent space. Altering the latent vector and feeding it to the decoder part creates new samples.
I hope I get your question right, you try to do an anomaly detection with the encoder part on the lower dimensional latent space?
I guess you have trained your VAE on MNIST. What you can do is getting all latent vectors of the MNIST-digits and compare the latent vector of your new digit via euclidian distance to them. The threshold would be a max distance set by you.
The code would be something like this:
x_mnist_encoded = encoder.predict(x_mnist, batch_size=batch_size) #array of MNIST latent vectors
test_digit_encoded = encoder.predict(x_testdigit, batch_size=1) #your testdigit latent vector
#calc the distance
from scipy.spatial import distance
threshold = 0.3 #min eucledian distance
for vector in x_mnist_encoded:
dst = distance.euclidean(vector,test_digit_encoded[0])
if dst <= threshold:
return True
VAE code is from https://blog.keras.io/building-autoencoders-in-keras.html
I'm using Scikit-Learn for text classification in Python. My classifier is currently making false predictions for everything (I was fooled for a while because it reported "75% accuracy" when 75% of the labels were false), so I'm trying to figure out what's wrong.
Currently, I'm doing SVC(kernel='precomputed') and computing the Gram matrix manually before passing it to fit() and predict(). The entry $G_{ij}$ of the Gram matrix is the kernel $K(d_i, d_j)$, where K denotes the kernel function and d_i is the ith document.
For my kernel function, the Gram matrix entries are not normalized, i.e. some are greater than 1. Do I need to apply kernel normalization
$$
K'(d_i, d_j) = \frac{K(d_i, d_j)}{\sqrt{K(d_i, d_i) \times K(d_j, d_j)}}
$$
to get it between 0 and 1? Or do SVMs not care?
No, you should not need to pre-scale the vectors. An SVM modelling process should be invariant to linear transforms of the data.