Lets say I've ground truth (output) of a number in range [0-100].
Is it possible to learn to predict a number that minimize the delta from the original number (gt) according the input ?
The objective types of Keras are here http://keras.io/objectives/
I think your best bet would be to use mean squared error (loss='mse'), which penalizes predictions based on the square of their difference from the ground truth. You would also want to use linear activations (the default) for the last layer.
If you're especially concerned about keeping the predictions within the range [0, 100], you could create a modified objective function that penalizes predictions outside [0, 100] even more than quadratically, but that's probably not necessary and you could instead just clip the predictions using np.clip(predictions, 0, 100).
Related
I am working on a NN with Pytorch which simply maps points from the plane into real numbers, for example
model = nn.Sequential(nn.Linear(2,2),nn.ReLU(),nn.Linear(2,1))
What I want to do, since this network defines a map h:R^2->R, is to compute the gradient of this mapping h in the training loop. So for example
for it in range(epochs):
pred = model(X_train)
grad = torch.autograd.grad(pred,X_train)
....
The training set has been defined as a tensor requiring the gradient. My problem is that even if the output, for each fixed point, is a scalar, since I am propagating a set of N=100 points, the output is actually a Nx1 tensor. This brings to the error: autograd can compute the gradient just of scalar functions.
In fact, trying with the little change
pred = torch.sum(model(X_train))
everything works perfectly. However I am interested in all the single gradients so, is there a way to compute all these gradients together?
Actually computing the sum as presented above gives exactly the same result I expect of course, but I wanted to know if this is the only possiblity.
There are other possibilities but using .sum is the simplest way. Using .sum() on the final loss vector and computing dpred/dinput will give you the desired output. Here is why:
Since, pred = sum(loss) = sum (f(xi))
where i is the index of input x.
dpred/dinput will be a matrix [dpred/dx0, dpred/dx1, dpred/dx...]
Consider, dpred/dx0, it will be equal to df(x0)/dx0, since other df(xi)/dx0 is 0.
PS: Please excuse the crappy mathematical expressions... SO does not support latex/math expressions.
I'm new on StackOverflow and I also recently started to work with Tensorflow and Keras. Currently I'm developing an architecture using LSTM units. My question was partially discussed here:
What does the implementation of keras.losses.sparse_categorical_crossentropy look like?
However, in my model I have a predicted tensor, y_hat, of size (batch_size, seq_length, vocabulary_dimension) and the true labels, y, of size (batch_size, seq_length).
I would like to know how the value of the loss is computed when I call
loss = sparse_categorical_crossentropy(y,y_hat): how does the sparse_crossentropy function calculate the loss value starting from two tensors of different dimensions?
The cross entropy is a way to compare two probability distributions. That is, it says how different or similar the two are. It is a mathematical function defined on two arrays or continuous distributions as shown here.
The 'sparse' part in 'sparse_categorical_crossentropy' indicates that the y_true value must have a single value per row, e.g. [0, 2, ...] that indicates which outcome (category) was the right choice. The model then outputs the y_pred that must be like [[.99, .01, 0], [.01, .5, .49], ...]. Here, model predicts that the 0th category has a chance of .99 in the first row. This is very close to the true value, that is [1,0,0]. The sparse_categorical_crossentropy would then calculate a single number with two distributions using the above mentioned formula and return that number.
If you used a 'categorical_crossentropy' it would expect the y_true to be a one-hot encoded vector, like [[0,0,1], [0,1,0], ...].
If you would like to know the details in depth, you can take a look at the source.
I've read multiple answers to the question surrounding the 'accuracy' metric used in Keras but I'm not entirely confident I understand what this means in terms of lane detection. Does the Keras metric compare the pixels detected in the prediction equal to the pixels in the ground truth and divide by total number of pixels? Or is it necessary to create a custom metric that does this?
From keras' github:
Calculates how often predictions matches labels.
For example, if `y_true` is [1, 2, 3, 4] and `y_pred` is [0, 2, 3, 4]
then the accuracy is 3/4 or .75. If the weights were specified as
[1, 1, 0, 0] then the accuracy would be 1/2 or .5.
So it all depends on how you describe the target vector, i.e. the values obtained from the output layer. Let's assume, that you have a 255x255 image, where, in a matrix form, 1 represents a line, and 0 represents no-line. Vectorizing it into a vector of length 255*255 = 65025 would result in a binary vector. Then, for each accuracy measurement, keras compares your model's prediction (where it put the line) with the original (test) data, and computes the accuracy.
Please note, that for such large data, there are many transforms to reduce the size of the model, and many interesting papers describe various methods.
I applied a linear regression on some features to predict the target with 10 folds cross validation.
MinMax scale was applied for both the features and the target.
Then the features standardized.
When I run the model, the r2 equal to 0.65 and MSE is 0.02.
But when I use the target as they are without MinMax scaling, I got r2 same but the MSE increase a lot to 18.
My question is, do we have to deal with targets as same we do with features in terms of data preprocessing? and which of the values above is correct? because the mse got quit bigger with out scaling the target.
Some people say we have to scale the targets too while others say no.
Thanks in advance.
Whether you scale your target or not will change the 'meaning' of your error. For example, consider 2 different targets, one ranged [0, 100] and another one [0, 10000]. If you run models against them (with no scaling), MSE of 20 would mean different things for the two models. In the former case it will be disastrous, while in the latter case it will be pretty decent.
So the fact that you get lower MSE with target range [0, 1] than the original is not surprising.
At the same time, r2 value is independent of the range since it is calculated using variances.
Scaling allows you to compare model performance for different targets, among other things.
Also for some model types (like NNs) scaling would be more important.
Hope it helps!
First of all I narrate you about my question and situation.
I want to do multi-label classification in chainer and my class imbalance problem is very serious.
In this cases I must slice the vector inorder to calculate loss function, For example, In multi-label classification, ground truth label vector most elements is 0, only few of them is 1, In this situation, directly use F.sigmoid_cross_entropy to apply all the 0/1 elements may cause training not convergence, So I decide to use a[[xx,xxx,...,xxx]] slice( a is chainer.Variable output by last FC layer) to slice specific elements to calculate loss function.
In this case, because of label imbalance may cause rare class low classification performance, so I want to set rare gt-label variable high loss weight during back propagation, but set major label(occur too many in gt) variable low weight during back propagation.
How should I do it? What is your suggestion about multi-label imbalance class problem training in chainer?
You can use sigmoid_cross_entropy() of no-reduce mode (by passing reduce='no') to obtain a loss value at each spatial location and the average function for weighted averaging.
sigmoid_cross_entropy() first computes the loss value at each spatial location and each data along the batch dimension, and then take the mean or summation over the spatial dimensions and batch dimension (depending on the normalize option). You can disable the reduction part by passing reduce='no'. If you want to do the weighted average, you should specify it so that you can get the loss value at each location and reduce them by yourself.
After that, the simplest way to manually do weighted averaging is using average(), which can accept weight argument that indicates the weights for averaging. It first does weighted summation using the input and weight, and then divides the result by the summation of weight. You can pass appropriate weight array that has the same shape as the input and pass it to average() along with the raw (unreduced) loss values obtained by sigmoid_cross_entropy(..., reduce='no'). It is also ok to manually multiply a weight array and take summation like F.sum(score * weight) if weight is appropriately scaled (e.g. summing up to 1).
If you work on multi-label classification, how about using softmax_crossentropy loss?
softmax_crossentropy can take into account the class imbalance by specifying the class_weight attribute.
https://github.com/chainer/chainer/blob/v3.0.0rc1/chainer/functions/loss/softmax_cross_entropy.py#L57
https://docs.chainer.org/en/stable/reference/generated/chainer.functions.softmax_cross_entropy.html