I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?

The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3

In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.


I'm creating a model using the Keras functional API.
The layer architecture is as follows:
n = tf.keras.layers.Dense(1)(input)
for i in tf.range(n):
output = tf.keras.layers.Dense(4)(input)
I then concat the outputs and return for a tensor with shape [1, None, 4] where [1] is the batch dimension, [None] is n, and [4] is the output from the second dense layer.
My loss function involves comparing the shape of the expected output, and comparing the outputs.
loss = tf.convert_to_tensor(abs(tf.shape(logits)[1] - tf.shape(expected)[1])) * 100.
When running this on a custom training loop, I'm getting the error
ValueError: No gradients provided for any variable: (['while/dense/kernel:0',
'while/dense/bias:0', 'while/while/dense_1/kernel:0', 'while/while/dense_1/bias:0'],).
Provided `grads_and_vars` is ((None, <tf.Variable 'while/dense/kernel:0' shape=(786432, 1)
Shape is not differentiable, you cannot do things like this with gradient based learning. Problems like this need to be tackled with more powerful tools, e.g. reinforcement learning where one considers n as an action, and get policy gradient for that.
A rule of thumb to remember is that you cannot really backprop through discrete objects. You need to produce floats, as gradients require smooth functions. In your case n should be an integer (what does a loop over a float mean?) so this should be your first warning sign. The other being shape itself, which is also an integer. A target can be discrete, but not the prediction. Note that even in classification we do not output class we output probability as probability is smooth.
You could build your model by assuming some maximum number of N and treat it more like a classification where you supervise N directly, and use some form of masking to keep all the results around.

I'm new on StackOverflow and I also recently started to work with Tensorflow and Keras. Currently I'm developing an architecture using LSTM units. My question was partially discussed here:
What does the implementation of keras.losses.sparse_categorical_crossentropy look like?
However, in my model I have a predicted tensor, y_hat, of size (batch_size, seq_length, vocabulary_dimension) and the true labels, y, of size (batch_size, seq_length).
I would like to know how the value of the loss is computed when I call
loss = sparse_categorical_crossentropy(y,y_hat): how does the sparse_crossentropy function calculate the loss value starting from two tensors of different dimensions?
The cross entropy is a way to compare two probability distributions. That is, it says how different or similar the two are. It is a mathematical function defined on two arrays or continuous distributions as shown here.
The 'sparse' part in 'sparse_categorical_crossentropy' indicates that the y_true value must have a single value per row, e.g. [0, 2, ...] that indicates which outcome (category) was the right choice. The model then outputs the y_pred that must be like [[.99, .01, 0], [.01, .5, .49], ...]. Here, model predicts that the 0th category has a chance of .99 in the first row. This is very close to the true value, that is [1,0,0]. The sparse_categorical_crossentropy would then calculate a single number with two distributions using the above mentioned formula and return that number.
If you used a 'categorical_crossentropy' it would expect the y_true to be a one-hot encoded vector, like [[0,0,1], [0,1,0], ...].
If you would like to know the details in depth, you can take a look at the source.

I've set up a neural network regression model using Keras with one target. This works fine,
now I'd like to include multiple targets. The dataset includes a total of 30 targets, and I'd rather train one neural network instead of 30 different ones.
My problem is that in the preprocessing of the data I have to remove some target values, for a given example, as they represent unphysical values that are not to be predicted.
This creates the issues that I have a varying number of targets/output.
For example:
Targets =
None, 0.007798, 0.012522
0.261140, 2110.000000, 2440.000000
0.048799, None, None
How would I go about creating a keras.Sequential model(or functional) with a varying number of outputs for a given input?
edit: Could I perhaps first train a classification model that predicts the number of outputs given some test inputs, and then vary the number of outputs in the output layer according to this prediction? I guess I would have to use the functional API for something like that.
The "classification" edit here is unnecessary, i.e. ignore it. The number of outputs of the test targets is a known quantity.
First, do you know up front whether some of the output values will be invalid or is part of the problem predicting which outputs will actually be valid?
If you don't know up front which outputs to disregard, you could go with something like the 2-step approach you described in your comment.
If it is deterministic (and you know how so) which outputs will be valid for any given input and your problem is just how to set up a proper model, here's how I would do that in keras:
Use the functional API
Create 30 named output layers (e.g. out_0, out_1, ... out_29)
When creating the model, just use the outputs argument to list all 30 outputs
When compiling the model, specify a loss for each separate output, you can do this by passing a dictionary to the loss argument where the keys are the names of your output layers and the values are the respective losses
Assuming you'll use mean-squared error for all outputs, the dictionary will look something like {'out_0': 'mse', 'out_1': 'mse', ..., 'out_29': 'mse'}
When passing inputs to the models, pass three things per input: x, y, loss-weights
y has to be a dictionary where the key is the output layer name and the value is the target output value
The loss-weights are also a dictionary in the same format as y. The weights in your case can just be binary, 1 for each output that corresponds to a real value, 0 for each output that corresponds to unphysical values (so they are disregarded during training) for any given sample
Don't pass None's for the unphysical value targets, use some kind of numeric filler, otherwise you'll get issues. It is completely irrelevant what you use for your filler as it will not affect gradients during training
This will give you a trainable model. BUT once you move on from training and try to predict on new data, YOU will have to decide which outputs to disregard for each sample, the network will likely still give you "valid"-looking outputs for those inputs.
One possible solution would be to have a separate output of "validity flags" which takes values in range from zero to one. For example, your first target will be
y=[0.0, 0.007798, 0.012522]
yf=[0.0, 1.0, 1.0]
where zeros indicate invalid values.
Use sigmoid activation function for yf.
Loss function can be the sum of losses for y and yf.
During inference, analyze the network output for yf and only consider y value valid if corresponding yf exceeds 0.5 threshold

I am trying to build two neural network for classification. One for Binary and the second is for multi-class classification. I am trying to use the torch.nn.CrossEntropyLoss() as a loss function, but I try to train my first neural network I get the following error:
multi-target not supported at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THNN/generic/ClassNLLCriterion.c:22
From my analysis, I found that the my dataset has two problems that caused the error.
My data set is one hot encoded. I used one hot encoding to pre processes my dataset. The first target Y_binary variable has the shape of torch.Size([125973, 1]) full of 0s and 1 indicating classes 'No' and 'Yes'.
My data has the wrong dimensions? I found that I can't use a simple vector with the cross entropy loss function. Some people used the following code to reshape their target vector before feeding to the loss function.
out = out.permute(0, 2, 3, 1).contiguous().view(-1, class_number)
But I didn't really understand the reasoning behind this code. But it seems for my that I need to keep track of the following variables: Class_Number, Batch_size, Dimension_Output. For my code here are the dimensions
X_train.shape: (125973, 122)
Y_train2.shape: (125973, 1)
batch_size = 64
K = len(set(Y_train2)) # Binary classification For multi class classification use K = len(set(Y_train5))
Should the target value be one hot encoded? If not, how I can feed a nominal feature to the loss function?
If I use reshape the output, can you help me do this for my code ?
I am trying to use this loss function for both my neural networks.
The error is due to the usage of torch.nn.CrossEntropyLoss() which can be used if you want to predict 1 class out of N classes. For multiclass classification, you should use torch.nn.BCEWithLogitsLoss() which combines a Sigmoid layer and the BCELoss in one single class.
In case of multi-class, and if you use Sigmoid + BCELoss, then you need the target to be one-hot encoding, i.e. something like this per sample: [0 1 0 0 0 1 0 0 1 0], where 1 will be at the locations of classes present.

I want to weight the training data based on a column in the training data set. Thereby giving more importance to certain training items than others. The weighting column should not be included as a feature for the input layer.
The Tensorflow documentation holds an example how to use the label of the item to assign a custom loss and thereby assigning weight:
# Ensures that the loss for examples whose ground truth class is `3` is 5x
# higher than the loss for all other examples.
weight = tf.multiply(4, tf.cast(tf.equal(labels, 3), tf.float32)) + 1
onehot_labels = tf.one_hot(labels, num_classes=5)
tf.contrib.losses.softmax_cross_entropy(logits, onehot_labels, weight=weight)
I am using this in a custom DNN with three hidden layers. In theory i simply need to replace labels in the example above with a tensor containing the weight column.
I am aware that there are several threads that already discuss similar problems e.g. defined loss function in tensorflow?
For some reason i am running into a lot of problems trying to bring my weight column in. It's probably two easy lines of code or maybe there is an easier way to achieve the same result.
I believe i found the answer:
weight_tf = tf.range(features.get_shape()[0]-1, features.get_shape()[0])
loss = tf.losses.softmax_cross_entropy(target, logits, weights=weight_tf)
The weight is the last column in the features tensorflow.
