stop gradient with n/a label in tensorflow - python

I'm implementing a Convolutional Neural Network in Tensorflow with python.
I'm in the following scenario: I've got a tensor of labels y (batch labels) like this:
y = [[0,1,0]
[0,0,1]
[1,0,0]]
where each row is a one-hot vector that represents a label related to the correspondent example. Now in training I want stop loss gradient (set to 0) of the example with that label (the third):
[1,0,0]
which rappresents the n/a label,
instead the loss of the other examples in the batch are computed.
For my loss computation I use a method like that:
self.y_loss = kl_divergence(self.pred_y, self.y)
I found this function that stop gradient, but how can apply it to conditionally to the batch elements?

If you don't want some samples to contribute to the gradients you could just avoid feeding them to the network during training at all. Simply remove the samples with that label from your training set.
Alternatively, since the loss is computed by summing over the KL-divergences for each sample, you could multiply the KL-divergence for each sample with either 1 if the sample should be taken into account and 0 otherwise before summing over them.
You can get the vectors of values you need to multiply the individual KL-divergences with by subtracting the first column of the tensor of labels from 1: 1 - y[:,0]
For the kl_divergence function from the answer to your previous question it might look like this:
def kl_divergence(p, q)
return tf.reduce_sum(tf.reduce_sum(p * tf.log(p/q), axis=1)*(1-p[:,0]))
where p is the groundtruth tensor and q are the predictions

Related

Loss function for comparing two vectors for categorization

I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?
The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3
In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
Image taken from d2l.ai
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.

Tensorflow with Keras: sparse_categorical_crossentropy

I'm new on StackOverflow and I also recently started to work with Tensorflow and Keras. Currently I'm developing an architecture using LSTM units. My question was partially discussed here:
What does the implementation of keras.losses.sparse_categorical_crossentropy look like?
However, in my model I have a predicted tensor, y_hat, of size (batch_size, seq_length, vocabulary_dimension) and the true labels, y, of size (batch_size, seq_length).
I would like to know how the value of the loss is computed when I call
loss = sparse_categorical_crossentropy(y,y_hat): how does the sparse_crossentropy function calculate the loss value starting from two tensors of different dimensions?
The cross entropy is a way to compare two probability distributions. That is, it says how different or similar the two are. It is a mathematical function defined on two arrays or continuous distributions as shown here.
The 'sparse' part in 'sparse_categorical_crossentropy' indicates that the y_true value must have a single value per row, e.g. [0, 2, ...] that indicates which outcome (category) was the right choice. The model then outputs the y_pred that must be like [[.99, .01, 0], [.01, .5, .49], ...]. Here, model predicts that the 0th category has a chance of .99 in the first row. This is very close to the true value, that is [1,0,0]. The sparse_categorical_crossentropy would then calculate a single number with two distributions using the above mentioned formula and return that number.
If you used a 'categorical_crossentropy' it would expect the y_true to be a one-hot encoded vector, like [[0,0,1], [0,1,0], ...].
If you would like to know the details in depth, you can take a look at the source.

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

Keras tensor has an additional dimension and causes wrong results for net.evaluate()

I'd like to train a neural network in Python and Keras using a metric learning custom loss function. The loss minimizes the distances of the outputs for similar inputs and maximizes the distances between dissimilar ones. The part considering similar inputs is:
# function to create a pairwise similarity matrix, i.e
# L[i,j] == 1 for similar samples i, j and 0 otherwise
def build_indicator_matrix(y_, thr=0.1):
# y_: contains the labels of the samples,
# samples are similar in case of same label
# prevent checking equality of floats --> check if absolute
# differences are below threshold
lbls_diff = K.expand_dims(y_, axis=0) - K.expand_dims(y_, axis=1)
lbls_thr = K.less(K.abs(lbls_diff), thr)
# cast bool tensor back to float32
L = K.cast(lbls_thr, 'float32')
# POSSIBLE WORKAROUND
#L = K.sum(L, axis=2)
return L
# function to compute the (squared) Euclidean distances between all pairs
# of samples, store in DIST[i,j] the distance between output y_pred[i,:] and y_pred[j,:]
def compute_pairwise_distances(y_pred):
DIFF = K.expand_dims(y_pred, axis=0) - K.expand_dims(y_pred, axis=1)
DIST = K.sum(K.square(DIFF), axis=-1)
return DIST
# function to compute the average distance between all similar samples
def my_loss(y_true, y_pred):
# y_true: contains true labels of the samples
# y_pred: contains network outputs
L = build_indicator_matrix(y_true)
DIST = compute_pairwise_distances(y_pred)
return K.mean(DIST * L, axis=1)
For training, I pass a numpy array y of shape (n,) as target variable to my_loss. However, I found (using the computational graph in TensorBoard) that the tensorflow backend creates a 2D variable out of y (displayed shape ? x ?), and hence L in build_indicator_matrix is not 2 but 3-dimensional (shape ? x ? x ? in TensorBoard). This causes net.evaulate() and net.fit() to compute wrong results.
Why does tensorflow create a 2D rather than a 1D array? And how does this affect net.evaluate() and net.fit()?
As quick workarounds I found that either replacing the build_indicator_matrix() with static numpy code for computing L , or collapsing the "fake" dimension with the line L = K.sum(L, axis=2) solves the problem. In the latter case, however, the output of K.eval(build_indicator_matrix(y)) is of only of shape (n,) and not (n,n), so I do not understand why this workaround still yields correct results. Why does tensorflow introduce an additional dimension?
My library versions are:
keras: 2.2.4
tensorflow: 1.8.0
numpy: 1.15.0
This is because evaluate and fit work in batches.
The first dimension you see in tensorboard is the batch dimension, unknown in advance and therefore denoted ?.
When using custom metrics, remember the tensors (y_true and y_pred) you get are the ones corresponding to the batch.
For more info, show us how you call both those functions.

3darray training/testing TensorFlow RNN LSTM

(I am testing my abilities to write short but effective questions so let me know how I do here)
I am trying to train/test a TensorFlow recurrent neural network, specifically an LSTM, with some trials of time-series data in the following ndarray format:
[[[time_step_trial_0, feature, feature, ...]
[time_step_trial_0, feature, feature, ...]]
[[time_step_trial_1, feature, feature, ...]
[time_step_trial_1, feature, feature, ...]]
[[time_step_trial_2, feature, feature, ...]
[time_step_trial_2, feature, feature, ...]]]
The the 1d portion of this 3darray holds the a time step and all feature values that were observed at that time step. The 2d block contains all 1d arrays (time steps) that were observed in one trial. The 3d block contains all 2d blocks (trials) recorded for the time-series dataset. For each trial, the time step frequency is constant and the window interval is the same across all trials (0 to 50 seconds, 0 to 50 seconds, etc.).
For example, I am given data for Formula 1 race cars such as torque, speed, acceleration, rotational velocity, etc. Over a certain time interval recording time steps every 0.5 seconds, I form 1d arrays with each time step versus the recorded features recorded at that time step. Then I form a 2D array around all time steps corresponding to one Formula 1 race car's run on the track. I create a final 3D array holding all F1 cars and their time-series data. I want to train and test a model to detect anomalies in the F1 common trajectories on the course for new cars.
I am currently aware that the TensorFlow models support 2d arrays for training and testing. I was wondering what procedures I would have to go through in order the be able to train and test the model on all of the independent trials (2d) contained in this 3darray. In addition, I will be adding more trials in the future. So what are the proper procedures to go through in order to constantly be updating my model with the new data/trials to strengthen my LSTM.
Here is the model I was trying to initially replicate for a different purpose other than human activity: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition. Another more feasible model would be this which I would much rather look at for anomaly detection in the time-series data: https://arxiv.org/abs/1607.00148. I want to build a anomaly detection model that given the set of non-anomalous time-series training data, we can detect anomalies in the test data where parts of the data over time is defined as "out of family."
I think for most LSTM's you're going to want to think of your data in this way (as it will be easy to use as input for the networks).
You'll have 3 dimension measurements:
feature_size = the number of different features (torque, velocity, etc.)
number_of_time_steps = the number of time steps collected for a single car
number_of_cars = the number of cars
It will most likely be easiest to read your data in as a set of matrices, where each matrix corresponds to one full sample (all the time steps for a single car).
You can arrange these matrices so that each row is an observation and each column is a different parameter (or the opposite, you may have to transpose the matrices, look at how your network input is formatted).
So each matrix is of size:
number_of_time_steps x feature_size (#rows x #columns). You will have number_of_cars different matrices. Each matrix is a sample.
To convert your array to this format, you can use this block of code (note, you can already access a single sample in your array with A[n], but this makes it so the shape of the accessed elements are what you expect):
import numpy as np
A = [[['car1', 'timefeatures1'],['car1', 'timefeatures2']],
[['car2', 'timefeatures1'],['car2', 'timefeatures2']],
[['car3', 'timefeatures1'],['car3', 'timefeatures2']]
]
easy_format = np.array(A)
Now you can get an individual sample with easy_format[n], where n is the sample you want.
easy_format[1] prints
array([['car2', 'timefeatures1'],
['car2', 'timefeatures2']],
dtype='|S12')
easy_format[1].shape = (2,2)
Now that you can do that, you can format them however you need for the network you're using (transposing rows and columns if necessary, presenting a single sample at a time or all of them at once, etc.)
What you're looking to do (if I'm reading that second paper correctly) most likely requires a sequence to sequence lstm or rnn. Your original sequence is your time series for a given trial, and you're generating an intermediate set of weights (an embedding) that can recreate that original sequence with a low amount of error. You're doing this for all the trials. You will train this lstm on a series of reasonably normal trials and get it to perform well (reconstruct the sequence accurately). You can then use this same set of embeddings to try to reconstruct a new sequence, and if it has a high reconstruction error, you can assume it's anomalous.
Check this repo for a sample of what you'd want along with explanations of how to use it and what the code is doing (it only maps a sequence of integers to another sequence of integers, but can easily be extended to map a sequence of vectors to a sequence of vectors): https://github.com/ichuang/tflearn_seq2seq The pattern you'd define is just your original sequence. You might also take a look at autoencoders for this problem.
Final Edit: Check this repository: https://github.com/beld/Tensorflow-seq2seq-autoencoder/blob/master/simple_seq2seq_autoencoder.py
I have modified the code in it very slightly to work on the newest version of tensorflow and to make some of the variable names clearer. You should be able to modify it to run on your dataset. Right now I'm just having it autoencode a randomly generated array of 1's and 0's. You would do this for a large subset of your data and then see if other data was reconstructed accurately or not (much higher error than average might imply an anomaly).
import numpy as np
import tensorflow as tf
learning_rate = 0.001
training_epochs = 30000
display_step = 100
hidden_state_size = 100
samples = 10
time_steps = 20
step_dims = 5
test_data = np.random.choice([ 0, 1], size=(time_steps, samples, step_dims))
initializer = tf.random_uniform_initializer(-1, 1)
seq_input = tf.placeholder(tf.float32, [time_steps, samples, step_dims])
encoder_inputs = [tf.reshape(seq_input, [-1, step_dims])]
decoder_inputs = ([tf.zeros_like(encoder_inputs[0], name="GO")]
+ encoder_inputs[:-1])
targets = encoder_inputs
weights = [tf.ones_like(targets_t, dtype=tf.float32) for targets_t in targets]
cell = tf.contrib.rnn.BasicLSTMCell(hidden_state_size)
_, enc_state = tf.contrib.rnn.static_rnn(cell, encoder_inputs, dtype=tf.float32)
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, step_dims)
dec_outputs, dec_state = tf.contrib.legacy_seq2seq.rnn_decoder(decoder_inputs, enc_state, cell)
y_true = [tf.reshape(encoder_input, [-1]) for encoder_input in encoder_inputs]
y_pred = [tf.reshape(dec_output, [-1]) for dec_output in dec_outputs]
loss = 0
for i in range(len(y_true)):
loss += tf.reduce_sum(tf.square(tf.subtract(y_pred[i], y_true[i])))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
x = test_data
for epoch in range(training_epochs):
#x = np.arange(time_steps * samples * step_dims)
#x = x.reshape((time_steps, samples, step_dims))
feed = {seq_input: x}
_, cost_value = sess.run([optimizer, loss], feed_dict=feed)
if epoch % display_step == 0:
print "logits"
a = sess.run(y_pred, feed_dict=feed)
print a
print "labels"
b = sess.run(y_true, feed_dict=feed)
print b
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(cost_value))
print("Optimization Finished!")
Your input shape and the corresponding model depends on why type of Anomaly you want to detect. You can consider:
1. Feature only Anomaly:
Here you consider individual features and decide whether any of them is Anomalous, without considering when its measured. In your example,the feature [torque, speed, acceleration,...] is an anomaly if one or more is an outlier with respect to the other features. In this case your inputs should be of form [batch, features].
2. Time-feature Anomaly:
Here your inputs are dependent on when you measure the feature. Your current feature may depend on the previous features measured over time. For example there may be a feature whose value is an outlier if it appears at time 0 but not outlier if it appears furture in time. In this case you divide each of your trails with overlapping time windows and form a feature set of form [batch, time_window, features].
It should be very simple to start with (1) using an autoencoder where you train an auto-encoder and on the error between input and output, you can choose a threshold like 2-standard devations from the mean to determine whether its an outlier or not.
For (2), you can follow the second paper you mentioned using a seq2seq model, where your decoder error will determine which features are outliers. You can check on this for the implementation of such a model.

Categories