I want to weight the training data based on a column in the training data set. Thereby giving more importance to certain training items than others. The weighting column should not be included as a feature for the input layer.
The Tensorflow documentation holds an example how to use the label of the item to assign a custom loss and thereby assigning weight:
# Ensures that the loss for examples whose ground truth class is `3` is 5x
# higher than the loss for all other examples.
weight = tf.multiply(4, tf.cast(tf.equal(labels, 3), tf.float32)) + 1
onehot_labels = tf.one_hot(labels, num_classes=5)
tf.contrib.losses.softmax_cross_entropy(logits, onehot_labels, weight=weight)
I am using this in a custom DNN with three hidden layers. In theory i simply need to replace labels in the example above with a tensor containing the weight column.
I am aware that there are several threads that already discuss similar problems e.g. defined loss function in tensorflow?
For some reason i am running into a lot of problems trying to bring my weight column in. It's probably two easy lines of code or maybe there is an easier way to achieve the same result.
I believe i found the answer:
weight_tf = tf.range(features.get_shape()[0]-1, features.get_shape()[0])
loss = tf.losses.softmax_cross_entropy(target, logits, weights=weight_tf)
The weight is the last column in the features tensorflow.
Related
I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?
The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3
In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
Image taken from d2l.ai
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.
I've set up a neural network regression model using Keras with one target. This works fine,
now I'd like to include multiple targets. The dataset includes a total of 30 targets, and I'd rather train one neural network instead of 30 different ones.
My problem is that in the preprocessing of the data I have to remove some target values, for a given example, as they represent unphysical values that are not to be predicted.
This creates the issues that I have a varying number of targets/output.
For example:
Targets =
None, 0.007798, 0.012522
0.261140, 2110.000000, 2440.000000
0.048799, None, None
How would I go about creating a keras.Sequential model(or functional) with a varying number of outputs for a given input?
edit: Could I perhaps first train a classification model that predicts the number of outputs given some test inputs, and then vary the number of outputs in the output layer according to this prediction? I guess I would have to use the functional API for something like that.
The "classification" edit here is unnecessary, i.e. ignore it. The number of outputs of the test targets is a known quantity.
(Sorry, I don't have enough reputation to comment)
First, do you know up front whether some of the output values will be invalid or is part of the problem predicting which outputs will actually be valid?
If you don't know up front which outputs to disregard, you could go with something like the 2-step approach you described in your comment.
If it is deterministic (and you know how so) which outputs will be valid for any given input and your problem is just how to set up a proper model, here's how I would do that in keras:
Use the functional API
Create 30 named output layers (e.g. out_0, out_1, ... out_29)
When creating the model, just use the outputs argument to list all 30 outputs
When compiling the model, specify a loss for each separate output, you can do this by passing a dictionary to the loss argument where the keys are the names of your output layers and the values are the respective losses
Assuming you'll use mean-squared error for all outputs, the dictionary will look something like {'out_0': 'mse', 'out_1': 'mse', ..., 'out_29': 'mse'}
When passing inputs to the models, pass three things per input: x, y, loss-weights
y has to be a dictionary where the key is the output layer name and the value is the target output value
The loss-weights are also a dictionary in the same format as y. The weights in your case can just be binary, 1 for each output that corresponds to a real value, 0 for each output that corresponds to unphysical values (so they are disregarded during training) for any given sample
Don't pass None's for the unphysical value targets, use some kind of numeric filler, otherwise you'll get issues. It is completely irrelevant what you use for your filler as it will not affect gradients during training
This will give you a trainable model. BUT once you move on from training and try to predict on new data, YOU will have to decide which outputs to disregard for each sample, the network will likely still give you "valid"-looking outputs for those inputs.
One possible solution would be to have a separate output of "validity flags" which takes values in range from zero to one. For example, your first target will be
y=[0.0, 0.007798, 0.012522]
yf=[0.0, 1.0, 1.0]
where zeros indicate invalid values.
Use sigmoid activation function for yf.
Loss function can be the sum of losses for y and yf.
During inference, analyze the network output for yf and only consider y value valid if corresponding yf exceeds 0.5 threshold
I can't find the info in the documentation so I am asking here.
I have a multioutput model with 3 different outputs:
model = tf.keras.Model(inputs=[input], outputs=[output1, output2, output3])
The predicted labels for validation are constructed from these 3 outputs to form only one, it's a post-processing step. The dataset used for training is a dataset of those 3 intermediary outputs, for validation I evaluate on a dataset of labels instead of the 3 kind of intermediary data.
I would like to evaluate my model using a custom metric that handle the post processing and comparaison with the ground truth.
My question is, in the code of the custom metric, will y_pred be a list of the 3 outputs of the model?
class MyCustomMetric(tf.keras.metrics.Metric):
def __init__(self, name='my_custom_metric', **kwargs):
super(MyCustomMetric, self).__init__(name=name, **kwargs)
def update_state(self, y_true, y_pred, sample_weight=None):
# ? is y_pred a list [batch_output_1, batch_output_2, batch_output_3] ?
def result(self):
pass
# one single metric handling the 3 outputs?
model.compile(optimizer=tf.compat.v1.train.RMSPropOptimizer(0.01),
loss=tf.keras.losses.categorical_crossentropy,
metrics=[MyCustomMetric()])
With your given model definition, this is a standard multi-output Model.
model = tf.keras.Model(inputs=[input], outputs=[output_1, output_2, output_3])
In general, all (custom) Metrics as well as (custom) Losses will be called on every output separately (as y_pred)! Within the loss/metric function you will only see one output together with the one
corresponding target tensor.
By passing a list of loss functions (length == number of outputs of your model) you can specifiy which loss will be used for which output:
model.compile(optimizer=Adam(), loss=[loss_for_output_1, loss_for_output_2, loss_for_output_3], loss_weights=[1, 4, 8])
The total loss (which is the objective function to minimize) will be the additive combination of all losses multiplied with the given loss weights.
It is almost the same for the metrics! Here you can pass (as for the loss) a list (lenght == number of outputs) of metrics and tell Keras which metric to use for which of your model outputs.
model.compile(optimizer=Adam(), loss='mse', metrics=[metrics_for_output_1, metrics_for_output2, metrics_for_output3])
Here metrics_for_output_X can be either a function or a list of functions, which all be called with the one corresponding output_X as y_pred.
This is explained in detail in the documentation of Multi-Output Models in Keras. They also show examples for using dictionarys (to map loss/metric functions to a specific output) instead of lists.
https://keras.io/getting-started/functional-api-guide/#multi-input-and-multi-output-models
Further information:
If I understand you correctly you want to train your model using a loss function comparing the
three model outputs with three ground truth values and want to do some sort of performance evaluation by comparing
a derived value from the three model outputs and a single ground truth value.
Usually the model gets trained on the same objective it is evaluated on, otherwise you might get poorer results when
evaluating your model!
Anyways... for evaluating your model on a single label I suggest you either:
1. (The clean solution)
Rewrite your model and incorporate the post-processing steps. Add all the necessary operations (as layers) and map those
to an auxiliary output. For training your model you can set the loss_weight of the auxiliary output to zero.
Merge your Datasets so you can feed your model the model input, the intermediate target outputs as well as the labels.
As explained above you can define now a metric comparing the auxiliary model output with the given target labels.
2.
Or you train your model and derive the metric e.g. in a custom Callback by calculating your post-processing steps on the three outputs of model.predict(input).
This will make it necessary to write custom summaries if you want to track those values in your tensorboard! That's why I would not recommend this solution.
I'm working on a binary semantic segmentation task where the distribution of one class is very smalls across any input image, hence there are only a few pixels which are labeled. When using sparse_softmax_cross_entropy
the over all error is easily decreased when ignoring this class. Now, I'm looking for a way to weight the classes by a coefficient which penalizes missclassifications for the specific class higher compared to the other class.
The doc of the loss function states:
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weights is a tensor of shape [batch_size], then the loss weights apply to each corresponding sample.
If I understand this correctly, it says that specific sample in a batch get weighted differently compared to others. But this is actually not what I'm looking for. Does anyone know how to implement a weighted version of this loss function where the weights scale the importance of a specific class rather than samples?
To answer my own question:
The authors of the U-Net paper used a pre-computed weight-map to handle imbalanced classes.
The Institute for Anstronomy of ETH Zurich provided a Tensorflow-based U-Net package which contains a weighted version of the Softmax function (not sparse but they flatten their labels and logits first):
class_weights = tf.constant(np.array(class_weights, dtype=np.float32))
weight_map = tf.multiply(flat_labels, class_weights)
weight_map = tf.reduce_sum(weight_map, axis=1)
loss_map = tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits, labels=flat_labels)
weighted_loss = tf.multiply(loss_map, weight_map)
loss = tf.reduce_mean(weighted_loss)
I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.