Converting PyTorch Boolean target to regression target - python

Question
I have code that is based on Part 2, Chapter 11 of Deep Learning with PyTorch, by Luca Pietro Giovanni Antiga, Thomas Viehmann, and Eli Stevens. It's working just fine. It predicts the value of a Boolean variable. I want to convert this so that it predicts the value of a real number variable that happens to be always between 0 and 34.
There are two parts that I don't know how to convert. First, this part:
pos_t = torch.tensor([
not candidateInfo_tup.isNodule_bool,
candidateInfo_tup.isNodule_bool
],
dtype=torch.long,
)
(Why are two values passed in here when one is completely determined by the other?)
and then this part:
self.head_linear = nn.Linear(1152, 2)
self.head_softmax = nn.Softmax(dim=1)
How do I do this?
Guess
I don't want people to think I haven't thought about this at all, so here is my guess:
First part:
age_t = torch.tensor(candidateInfo_tup.age_int, dtype=torch.double)
Second part:
self.head_linear = nn.Linear(299520, 1)
self.head_relu = nn.ReLU()
I'm also guessing that I need to change this:
loss_func = nn.CrossEntropyLoss(reduction='none')
to something like this:
loss_func = nn.L1Loss()
My guesses are based on this article by Christian Versloot.

The example from the book is working but it has some redundant elements which confuse you.
Normally output size of 1 is enough for a binary classification problem. To bring it to 0 or 1, one may use sigmoid and then rounding, like in the example here: PyTorch Binary Classification - same network structure, 'simpler' data, but worse performance?
Or just put after single output neuron this:
y_pred_binary = torch.round(torch.sigmoid(y_pred))
The book example uses output size of 2 and then applies softmax to get to 0 and 1. This works, but such technique is typically used in multi-class classification.
For prediction of 0-34 variable:
if these are discrete variables - it is called "multiclass classification" as Ken indicated. Use size 35 output and softmax in this case. Search for "pytorch multiclass classification" for examples.
if this is a regression - than your changes seem in the right direction, except 'Second part'. instead of RELU - clip output at both ends at [0, 34]. Also 299520 - is too much for the previous layer. Use whatever input size there was before. Search for "pytorch regression" for examples.

Related

Compute cross entropy loss for classification in pytorch

I am trying to build two neural network for classification. One for Binary and the second is for multi-class classification. I am trying to use the torch.nn.CrossEntropyLoss() as a loss function, but I try to train my first neural network I get the following error:
multi-target not supported at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THNN/generic/ClassNLLCriterion.c:22
From my analysis, I found that the my dataset has two problems that caused the error.
My data set is one hot encoded. I used one hot encoding to pre processes my dataset. The first target Y_binary variable has the shape of torch.Size([125973, 1]) full of 0s and 1 indicating classes 'No' and 'Yes'.
My data has the wrong dimensions? I found that I can't use a simple vector with the cross entropy loss function. Some people used the following code to reshape their target vector before feeding to the loss function.
out = out.permute(0, 2, 3, 1).contiguous().view(-1, class_number)
But I didn't really understand the reasoning behind this code. But it seems for my that I need to keep track of the following variables: Class_Number, Batch_size, Dimension_Output. For my code here are the dimensions
X_train.shape: (125973, 122)
Y_train2.shape: (125973, 1)
batch_size = 64
K = len(set(Y_train2)) # Binary classification For multi class classification use K = len(set(Y_train5))
Should the target value be one hot encoded? If not, how I can feed a nominal feature to the loss function?
If I use reshape the output, can you help me do this for my code ?
I am trying to use this loss function for both my neural networks.
Thank you in advance,
The error is due to the usage of torch.nn.CrossEntropyLoss() which can be used if you want to predict 1 class out of N classes. For multiclass classification, you should use torch.nn.BCEWithLogitsLoss() which combines a Sigmoid layer and the BCELoss in one single class.
In case of multi-class, and if you use Sigmoid + BCELoss, then you need the target to be one-hot encoding, i.e. something like this per sample: [0 1 0 0 0 1 0 0 1 0], where 1 will be at the locations of classes present.

How to get output with maximum probability from the all the predicted outputs from dense layer?

I trained a neural network for sign language recognition. Here's my output layer model.add(Dense(units=26,activation="softmax"))
Now I'm getting probability for all 26 alphabets. Somehow I'm getting 99% accuracy when I test this model accuracy = model.evaluate(x=test_X,y=test_Y,batch_size=32). I'm new at this. I can't understand how this code works and I'm missing something major here. How to get a 1D list having just the predicted alphabet in it?
To get probabilities you need to do something like this:
prediction = model.predict(test_X)
probs = prediction.max(1)
But it is important to remember that softmax doesn't exactly provide probabilities of each class.
To get outputs with maximum probability in a single list, run:
np.argmax(model.predict(x_test),axis=1)
Supposing alphabet is a list with all alphabet symbols alphabet = ['a', 'b', ...]
pred = model.predict(test_X)
pred_ind = pred.max(1)
pred_alphabet = [alphabet[ind] for ind in pred_ind]
will give you the list with predicted symbols.
In neural networks first layer is for the input image you have. Let's say your image is 32x32 pixels. In that case you would have 32x32x3 nodes in the input layer. This 3 comes for the RGA color scheme. Then depending on your design and model you should use appropriate number of hidden input layers. At most scenarios we use 2 hidden input layers. Then the final layer is for the number of distinct classes you have. Let's say you're going to identify 26 distinct signs. Then you will have 26 nodes in the final layer.
model.evaluate(x=test_X,y=test_Y,batch_size=32)
I think here you're trying to make predictions on your test data set. At first you may have separated your data set into train and test sets. Here test_X stands for the images in test set. test_Y stands for corresponding labels. You're trying to evaluate your network by taking 32 images at a time. That's the meaning of batch_size=32.
I think this information might helpful for you to understand what you're doing. But your question is not clear. Please refer the below tutorial. That might helpful for you.
https://www.pyimagesearch.com/2018/09/10/keras-tutorial-how-to-get-started-with-keras-deep-learning-and-python/

SKlearn prediction on test dataset with different shape from training dataset shape

I'm new to ML and would be grateful for any assistance provided. I've run a linear regression prediction using test set A and training set A. I saved the linear regression model and would now like to use the same model to predict a test set A target using features from test set B. Each time I run the model it throws up the error below
How can I successfully predict a test data set from features and a target with different shapes?
Input
print(testB.shape)
print(testA.shape)
Output
(2480, 5)
(1315, 6)
Input
saved_model = joblib.load(filename)
testB_result = saved_model.score(testB_features, testA_target)
print(testB_result)
Output
ValueError: Found input variables with inconsistent numbers of samples: [1315, 2480]
Thanks again
They are inconsistent shapes which is why the error is being thrown. Have you tried to reshape the data so one of them are same shape? From a quick look, it seems that you have more samples and one less feature in testA.
Think about it, if you have trained your model with 5 features you cannot then ask the same model to make a prediction given 6 features. You speak of using a Linear Regressor, the equation is roughly:
y = b + w0*x0 + w1*x1 + w2*x2 + .. + wN-1*xN-1
Where {
y is your output/label
N is the number of features
b is the bias term
w(i) is the ith weight
x(i) is the ith feature value
}
You have trained a linear regressor with 5 features, effectively producing the following
y (your output/label) = b + w0*x0 + w1*x1 + w2*x2 + w3*x3 + w4*x4
You then ask it to make a prediction given 6 features but it only knows how to deal with 5.
Aside from that issue, you also have too many samples, testB has 2480 and testA has 1315. These need to match, as the model wants to make 2480 predictions, but you only give it 1315 outputs to compare it to. How can you get a score for 1165 missing samples? Do you now see why the data has to be reshaped?
EDIT
Assuming you have datasets with an equal amount of features as discussed above, you may now look at reshaping (removing data) testB like so:
testB = testB[0:1314, :]
testB.shape
(1315, 5)
Or, if you would prefer a solution using the numpy API:
testB = np.delete(testB, np.s_[0:(len(testB)-len(testA))], axis=0)
testB.shape
(1315, 5)
Keep in mind, when doing this you slice out a number of samples. If this is important to you (which it can be) then it may be better to introduce a pre-processing step to help out with the missing values, namely imputing them like this. It is worth noting that the data you are reshaping should be shuffled (unless it is already), as you may be removing parts of the data the model should be learning about. Neglecting to do this could result in a model that may not generalise as well as you hoped.

conv_lstm.py example uses 'binary_crossentropy' loss for regression. Why not using 'mean_squared_error' instead?

2 questions:
I was using keras team conv_lstm.py example on github to predict the next frame of the video created in that example. It is a regression problem obviously, since we are going to predict the next frame. I was wondering why they used this loss
line 38:
seq.compile(loss='binary_crossentropy', optimizer='adadelta')
Instead, I believe using:
seq.compile(loss='mean_squared_error', optimizer='rmsprop')
would result in better predictions, since we are implementing a regression problem, rather than classification.
Am I correct?
In line 107 of the code, they left a comment saying that:
feed it with the first 7 positions and then predict the new positions.
here is the code they used to predict 7 frames given 7 input frames :
which = 1004
track = noisy_movies[which][:7, ::, ::, ::]
for j in range(16):
new_pos = seq.predict(track[np.newaxis, ::, ::, ::, ::])
new = new_pos[::, -1, ::, ::, ::]
track = np.concatenate((track, new), axis=0)
Suppose I want to predict 7th frame of a test video.
If I don't feed the model with the last 7 frames, instead feed it with just the 7th frame, would it make difference in prediction?
Thanks.
Well, if the outputs are in the range between 0 and 1, it's totally ok to use 'binary_crossentropy'.
It's as if you had a classification problem in just one class: true or false. (The function is still continuous, though, and in the end the point of zero error will be the same)
Depending on the types of activation functions used (especially with 'sigmoid'), 'binary_crossentropy' would bring you the results way faster than 'mse' due to mathematical details.
An LSTM layer learns from analysing frames (or steps of any kind of data) recurrently.
It has what is called an "inner state". Every step it analyses brings an update to this inner state, so it works both like a "memory" of what happened until this point and also as some kind of positioner like "where in the movie am I now?"
Thus, having predicted the previous steps is absolutely necessary for the LSTM to give good predictions.
Imagine you never watched Star Wars before, and you start playing it at the scene where
Dart says: "Luke, I'm your father". You'd just say: "what?".
Now watch all the movies from the beginning and reach that part. Will your understanding be different? The LSTM will agree with you.

Initializing weights in a 3 layer neural network

So I'm learning the SIMPLEST way to code a neural network, one that can be modified in many ways depending on what you want, basically like a template. I found i am trask's 11 line neural network code, and the weight initialization makes perfect sense:
syn0 = 2*np.random.random((3,1)) - 1
However, when I look at it for his extended 3 layer network, it looks like this:
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1
I would understand if syn1 was a bit different, but BOTH are now different! He doesn't explain it, only gives a comment saying, "randomly initialize our weights with mean 0."
Can someone explain to me the mathematical reasoning behind this? Go full crazy if you want, I'm a math person since 5.
If by different, you are referring to the arguments of np.random.random(), then it is because you are creating weights with different shapes/dimensions. In this example (which ignores biases), you are trying to go from an input of dimension 3 to an output of dimension 1. With one layer, you require the shape (3,1). For two layers, you need shapes (3,n) and (n,1), where n is any integer. This is just to ensure that matrix multiplication is valid. Here n = 4 has been chosen as the hidden layer dimension.

Categories