PyTorch: Extract learned weights correctly - python

I am trying to extract the weights from a linear layer, but they do not appear to change, although error is dropping monotonously (i.e. training is happening). Printing the weights' sum, nothing happens because it stays constant:
np.sum(model.fc2.weight.data.numpy())
Here are the code snippets:
def train(epochs):
model.train()
for epoch in range(1, epochs+1):
# Train on train set
print(np.sum(model.fc2.weight.data.numpy()))
for batch_idx, (data, target) in enumerate(train_loader):
data, target = Variable(data), Variable(data)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
and
# Define model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(100, 80, bias=False)
init.normal(self.fc1.weight, mean=0, std=1)
self.fc2 = nn.Linear(80, 87)
self.fc3 = nn.Linear(87, 94)
self.fc4 = nn.Linear(94, 100)
def forward(self, x):
x = self.fc1(x)
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = F.relu(self.fc4(x))
return x
Maybe I am looking on the wrong parameters, although I checked the docs. Thanks for your help!

Use model.parameters() to get trainable weight for any model or layer. Remember to put it inside list(), or you cannot print it out.
The following code snip worked
>>> import torch
>>> import torch.nn as nn
>>> l = nn.Linear(3,5)
>>> w = list(l.parameters())
>>> w

Related

PyTorch Binary classification not learning

I state that I am new on PyTorch. I wrote this simple program for binary classification. I also created the CSV with two columns of random values, with the "ok" column whose value is 1 only if the other two values are included between two values I decided at the same time. Example:
diam_int,diam_est,ok
37.782,125.507,0
41.278,115.15,1
42.248,115.489,1
29.582,113.141,0
37.428,107.247,0
32.947,123.233,0
37.146,121.537,0
38.537,110.032,0
26.553,113.752,0
27.369,121.144,0
41.632,108.178,0
27.655,111.279,0
29.779,109.268,0
43.695,115.649,1
44.587,116.126,0
It seems to me all done correctly, loss actually lowers (it comes back up slightly after many epochs but I don't think it's a problem), but when I try to test my Net after the training, with a sample batch of the trainset, what I got is always a prediction below 0.5 (so always 0 as estimated output) with a completely random trend.
with torch.no_grad():
pred = net(trainSet[10])
trueVal = ySet[10]
for i in range(len(trueVal)):
print(trueVal[i], pred[i])
Here is my Net class:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self) :
super().__init__()
self.fc1 = nn.Linear(2, 32)
self.fc2 = nn.Linear(32, 64)
self.fc3 = nn.Linear(64, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return torch.sigmoid(x)
Here is my Main class:
import torch
import torch.optim as optim
import torch.nn.functional as F
import pandas as pd
from net import Net
df = pd.read_csv("test.csv")
y = torch.Tensor(df["ok"])
ySet = torch.split(y, 32)
df.drop(["ok"], axis=1, inplace=True)
data = F.normalize(torch.Tensor(df.values), dim=1)
trainSet = torch.split(data, 32)
net = Net()
optimizer = optim.Adam(net.parameters(), lr=0.001)
lossFunction = torch.nn.BCELoss()
EPOCHS = 300
for epoch in range(EPOCHS):
for i, X in enumerate(trainSet):
optimizer.zero_grad()
output = net(X)
target = ySet[i].reshape(-1, 1)
loss = lossFunction(output, target)
loss.backward()
optimizer.step()
if epoch % 20 == 0:
print(loss)
What am I doing wrong? Thanks in advance for the help
Your model is underfit. Increasing the number of epochs to (say) 3000 makes the model predict perfectly on the examples you showed.
However after this many epochs the model may be overfit. A good practice is to use validation data (separate the generated data into train and validation sets), and check the validation loss in each epoch. When the validation loss starts increasing you start overfitting and stop the training.

GRU Loss decreased upto 0.9 but not further, PyTorch

the code that I am using for experimenting with GRU.
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import *
class N(nn.Module):
def __init__(self):
super().__init__()
self.embed = nn.Embedding(5,2)
self.layers = 4
self.gru = nn.GRU(2, 512, self.layers, batch_first=True)
self.bat = nn.BatchNorm1d(4)
self.bat1 = nn.BatchNorm1d(4)
self.bat2 = nn.BatchNorm1d(4)
self.fc = nn.Linear(512,100)
self.fc1 = nn.Linear(100,100)
self.fc2 = nn.Linear(100,5)
self.s = nn.Softmax(dim=-1)
def forward(self,x):
h0 = torch.zeros(self.layers, x.size(0), 512).requires_grad_()
x = self.embed(x)
x,hn = self.gru(x,h0)
x = self.bat(x)
x = self.fc(x)
x = nn.functional.relu(x)
x = self.bat1(x)
x = self.fc1(x)
x = nn.functional.relu(x)
x = self.bat2(x)
x = self.fc2(x)
softmaxed = self.s(x)
return softmaxed
inp = torch.tensor([[4,3,2,1],[2,3,4,1],[4,1,2,3],[1,2,3,4]])
out = torch.tensor([[3,2,1,4],[3,2,4,1],[1,2,3,4],[2,3,4,1]])
k = 0
n = N()
opt = torch.optim.Adam(n.parameters(),lr=0.0001)
while k<10000:
print(inp.shape)
o = n(inp)
o = o.view(-1, o.size(-1))
out = out.view(-1)
loss = nn.functional.cross_entropy(o.view(-1,o.size(-1)),out.view(-1)-1)
acc = ((torch.argmax(o, dim=1) == (out -1)).sum().item() / out.size(0))
if k==10000:
print(torch.argmax(o, dim=1))
print(out-1)
exit()
print(loss,acc)
loss.backward()
opt.step()
opt.zero_grad()
k+=1
print(o[0])
Shrinked Output:
torch.Size([4, 4])
tensor(0.9593, grad_fn=<NllLossBackward>) 0.9375
torch.Size([4, 4])
tensor(0.9593, grad_fn=<NllLossBackward>) 0.9375
tensor([4.8500e-01, 9.7813e-06, 5.1498e-01, 6.2428e-06, 7.5929e-06],
grad_fn=<SelectBackward>)
The Loss is 0.9593 and accuracy reached up to 0.9375. For this simple input data, the GRU loss is this big. What is the reason? Is there anything wrong in this code? I used cross_entropy as loss function and Adam as the optimizer. Learning rate is 0.001. I tried multiple learning rates but all gave the same final result. I added batch normalization, it speed up the training, but the same loss and accuracy. Why loss does not decrease up to 0.2 or something.
I think it's because you are using cross entropy loss function which in PyTorch combines log-softmax and negative log likelihood. Since your model already performs softmax before returning the output, you actually end up calculating the negative log likelihood for softmax of softmax. Try removing the final softmax from your model.
PyTorch documentation for cross entropy loss: https://pytorch.org/docs/stable/nn.functional.html#cross-entropy

How to solve size mismatch error in Python (Pytorch)

I am currently working on multivariate linear regression using PyTorch and I am getting the following error, I did search a lot about this error and the only thing I got to know is that there is a size mismatch between data and labels. But how to solve this error. Please help me or show me the right way to solve this problem.
size mismatch, m1: [824 x 1], m2: [8 x 8]
import torch
import torch.nn as nn
import numpy as np
Xtr = np.loadtxt("TrainData.csv")
Ytr = np.loadtxt("TrainLabels.csv")
X_train = torch.FloatTensor(Xtr)
Y_train = torch.FloatTensor(Ytr)
#### MODEL ARCHITECTURE ####
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(8,8)
self.lin2 = torch.nn.Linear(8,1)
def forward(self, x):
x = self.lin2(x)
y_pred = self.linear(x)
return y_pred
model = Model()
loss_func = nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
#print(len(list(model.parameters())))
def count_params(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
### TRAINING
for epoch in range(2):
y_pred = model(X_train)
loss = loss_func(y_pred, Y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
count = count_params(model)
print(count)
test_exp = torch.FloatTensor([[6.0]])
It looks like the order of operations in your forward pass is incorrect. The short answer is swap them as shown below. More context on the various shapes below.
def forward(self, x):
x = self.lin2(x)
y_pred = self.linear(x)
return y_pred
Should be:
def forward(self, x):
x = self.linear(x)
y_pred = self.lin2(x)
return y_pred
Assuming that you have 8 features and some batch size N your input data to the forward pass will have size (N x 8). After you pass it through lin2 it will have shape (N x 1). The linear node expects an input with shape (N x 8) but it's getting (N x 1) hence the error.

If we combine one trainable parameters with a non-trainable parameter, is the original trainable param trainable?

I have two nets and I combine their parameters in some fancy way using only pytorch operations. I store the result in a third net which has its parameters set to non-trainable. Then I proceed and pass data through this new net. The new net is just a placeholder for:
placeholder_net.W = Op( not_trainable_net.W, trainable_net.W )
Then I pass data:
output = placeholder_net(input)
I am concerned that since the parameters of the placeholder net are set to non-trainable that it won’t actually train the variable that it should train. Will this happen? Or what is the result when you combine a trainable param with and non-trainable param (and then set that where the param is not trainable)?
Current solution:
del net3.conv0.weight
net3.conv0.weight = net.conv0.weight + net2.conv0.weight
import torch
from torch import nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from collections import OrderedDict
import copy
def dont_train(net):
'''
set training parameters to false.
'''
for param in net.parameters():
param.requires_grad = False
return net
def get_cifar10():
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,shuffle=True, num_workers=2)
classes = ('plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck')
return trainloader,classes
def combine_nets(net_train, net_no_train, net_place_holder):
'''
Combine nets in a way train net is trainable
'''
params_train = net_train.named_parameters()
dict_params_place_holder = dict(net_place_holder.named_parameters())
dict_params_no_train = dict(net_no_train.named_parameters())
for name, param_train in params_train:
if name in dict_params_place_holder:
layer_name, param_name = name.split('.')
param_no_train = dict_params_no_train[name]
## get place holder layer
layer_place_holder = getattr(net_place_holder, layer_name)
delattr(layer_place_holder, param_name)
## get new param
W_new = param_train + param_no_train # notice addition is just chosen for the sake of an example
## store param in placehoder net
setattr(layer_place_holder, param_name, W_new)
return net_place_holder
def combining_nets_lead_to_error():
'''
Intention is to only train the net with trainable params.
Placeholder rnet is a dummy net, it doesn't actually do anything except hold the combination of params and its the
net that does the forward pass on the data.
'''
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
''' create three musketeers '''
net_train = nn.Sequential(OrderedDict([
('conv1', nn.Conv2d(1,20,5)),
('relu1', nn.ReLU()),
('conv2', nn.Conv2d(20,64,5)),
('relu2', nn.ReLU())
])).to(device)
net_no_train = copy.deepcopy(net_train).to(device)
net_place_holder = copy.deepcopy(net_train).to(device)
''' prepare train, hyperparams '''
trainloader,classes = get_cifar10()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net_train.parameters(), lr=0.001, momentum=0.9)
''' train '''
net_train.train()
net_no_train.eval()
net_place_holder.eval()
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, (inputs, labels) in enumerate(trainloader, 0):
optimizer.zero_grad() # zero the parameter gradients
inputs, labels = inputs.to(device), labels.to(device)
# combine nets
net_place_holder = combine_nets(net_train,net_no_train,net_place_holder)
#
outputs = net_place_holder(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
''' DONE '''
print('Done \a')
if __name__ == '__main__':
combining_nets_lead_to_error()
First, do not use eval() mode for any network. Set requires_grad flag to false to make the parameters non-trainable for only the second network and train the placeholder network.
If this doesn't work, you can try the following approach which I prefer.
Instead of using multiple networks, you can use a single network and use a non-trainable layer as a parallel connection after every trainable layer before non-linearity.
For example look at this image:
Set requires_grad flag to false to make the parameters non-trainable. Do not use eval() and train the network.
Combining outputs of the layers before non-linearity is important. Initialize the parameters of the parallel layer and choose the post-operation such that it gives the same result as when you combine the parameters.
I'm not sure if this is what you want to know.
But when I understand you correct - you want to know if the results of operations with non-trainable and trainable variables are still trainable?
If so, this is indeed the case, here is an example:
>>> trainable = torch.ones(1, requires_grad=True)
>>> non_trainable = torch.ones(1, requires_grad=False)
>>> result = trainable + non_trainable
>>> result.requires_grad
True
Maybe you might also find torch.set_grad_enabled useful, with some examples given here (PyTorch Migration Guide for version 0.4.0):
https://pytorch.org/2018/04/22/0_4_0-migration-guide.html
Original answer below, here I address the added code you've uploaded.
in your combine_nets functions you needlessly try to remove and set the attribute, when you can simply copy your required value like this:
def combine_nets(net_train, net_no_train, net_place_holder):
'''
Combine nets in a way train net is trainable
'''
params_train = net_no_train.named_parameters()
dict_params_place_holder = dict(net_place_holder.named_parameters())
dict_params_no_train = dict(net_train.named_parameters())
for name, param_train in params_train:
if name in dict_params_place_holder:
param_no_train = dict_params_no_train[name]
W_new = param_train + param_no_train
dict_params_no_train[name].data.copy_(W_new.data)
return net_place_holder
Due to other errors in the code you've supplied I couldn't make it run without further changes so I attach below an updated version of the code I've given you earlier:
import torch
from torch import nn
from torch.autograd import Variable
import torch.optim as optim
# toy feed-forward net
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(10, 5)
self.fc2 = nn.Linear(5, 5)
self.fc3 = nn.Linear(5, 1)
def forward(self, x):
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
return x
def combine_nets(net_train, net_no_train, net_place_holder):
'''
Combine nets in a way train net is trainable
'''
params_train = net_no_train.named_parameters()
dict_params_place_holder = dict(net_place_holder.named_parameters())
dict_params_no_train = dict(net_train.named_parameters())
for name, param_train in params_train:
if name in dict_params_place_holder:
param_no_train = dict_params_no_train[name]
W_new = param_train + param_no_train
dict_params_no_train[name].data.copy_(W_new.data)
return net_place_holder
# define random data
random_input1 = Variable(torch.randn(10,))
random_target1 = Variable(torch.randn(1,))
random_input2 = Variable(torch.rand(10,))
random_target2 = Variable(torch.rand(1,))
random_input3 = Variable(torch.randn(10,))
random_target3 = Variable(torch.randn(1,))
# define net
net1 = Net()
net_place_holder = Net()
net2 = Net()
# train the net1
criterion = nn.MSELoss()
optimizer = optim.SGD(net1.parameters(), lr=0.1)
for i in range(100):
net1.zero_grad()
output = net1(random_input1)
loss = criterion(output, random_target1)
loss.backward()
optimizer.step()
# train the net2
criterion = nn.MSELoss()
optimizer = optim.SGD(net2.parameters(), lr=0.1)
for i in range(100):
net2.zero_grad()
output = net2(random_input2)
loss = criterion(output, random_target2)
loss.backward()
optimizer.step()
# train the net2
criterion = nn.MSELoss()
optimizer = optim.SGD(net_place_holder.parameters(), lr=0.1)
for i in range(100):
net_place_holder.zero_grad()
output = net_place_holder(random_input3)
loss = criterion(output, random_target3)
loss.backward()
optimizer.step()
print('#'*50)
print('Weights before combining')
print('')
print('net1 fc2 weight after train:')
print(net1.fc3.weight)
print('net2 fc2 weight after train:')
print(net2.fc3.weight)
combine_nets(net1, net2, net_place_holder)
print('#'*50)
print('')
print('Weights after combining')
print('net1 fc2 weight after train:')
print(net1.fc3.weight)
print('net2 fc2 weight after train:')
print(net2.fc3.weight)
# train the net
criterion = nn.MSELoss()
optimizer1 = optim.SGD(net1.parameters(), lr=0.1)
for i in range(100):
net1.zero_grad()
net2.zero_grad()
output1 = net1(random_input3)
output2 = net2(random_input3)
loss1 = criterion(output1, random_target3)
loss2 = criterion(output2, random_target3)
loss = loss1 + loss2
loss.backward()
optimizer1.step()
print('#'*50)
print('Weights after further training')
print('')
print('net1 fc2 weight after freeze:')
print(net1.fc3.weight)
print('net2 fc2 weight after freeze:')
print(net2.fc3.weight)

pytorch loss value not change

I wrote a module based on this article: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
The idea is pass the input into multiple streams then concat together and connect to a FC layer. I divided my source code into 3 custom modules: TextClassifyCnnNet >> FlatCnnLayer >> FilterLayer
FilterLayer:
class FilterLayer(nn.Module):
def __init__(self, filter_size, embedding_size, sequence_length, out_channels=128):
super(FilterLayer, self).__init__()
self.model = nn.Sequential(
nn.Conv2d(1, out_channels, (filter_size, embedding_size)),
nn.ReLU(inplace=True),
nn.MaxPool2d((sequence_length - filter_size + 1, 1), stride=1)
)
for m in self.modules():
if isinstance(m, nn.Conv2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
def forward(self, x):
return self.model(x)
FlatCnnLayer:
class FlatCnnLayer(nn.Module):
def __init__(self, embedding_size, sequence_length, filter_sizes=[3, 4, 5], out_channels=128):
super(FlatCnnLayer, self).__init__()
self.filter_layers = nn.ModuleList(
[FilterLayer(filter_size, embedding_size, sequence_length, out_channels=out_channels) for
filter_size in filter_sizes])
def forward(self, x):
pools = []
for filter_layer in self.filter_layers:
out_filter = filter_layer(x)
# reshape from (batch_size, out_channels, h, w) to (batch_size, h, w, out_channels)
pools.append(out_filter.view(out_filter.size()[0], 1, 1, -1))
x = torch.cat(pools, dim=3)
x = x.view(x.size()[0], -1)
x = F.dropout(x, p=dropout_prob, training=True)
return x
TextClassifyCnnNet (main module):
class TextClassifyCnnNet(nn.Module):
def __init__(self, embedding_size, sequence_length, num_classes, filter_sizes=[3, 4, 5], out_channels=128):
super(TextClassifyCnnNet, self).__init__()
self.flat_layer = FlatCnnLayer(embedding_size, sequence_length, filter_sizes=filter_sizes,
out_channels=out_channels)
self.model = nn.Sequential(
self.flat_layer,
nn.Linear(out_channels * len(filter_sizes), num_classes)
)
def forward(self, x):
x = self.model(x)
return x
def fit(net, data, save_path):
if torch.cuda.is_available():
net = net.cuda()
for param in list(net.parameters()):
print(type(param.data), param.size())
optimizer = optim.Adam(net.parameters(), lr=0.01, weight_decay=0.1)
X_train, X_test = data['X_train'], data['X_test']
Y_train, Y_test = data['Y_train'], data['Y_test']
X_valid, Y_valid = data['X_valid'], data['Y_valid']
n_batch = len(X_train) // batch_size
for epoch in range(1, n_epochs + 1): # loop over the dataset multiple times
net.train()
start = 0
end = batch_size
for batch_idx in range(1, n_batch + 1):
# get the inputs
x, y = X_train[start:end], Y_train[start:end]
start = end
end = start + batch_size
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
predicts = _get_predict(net, x)
loss = _get_loss(predicts, y)
loss.backward()
optimizer.step()
if batch_idx % display_step == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(x), len(X_train), 100. * batch_idx / (n_batch + 1), loss.data[0]))
# print statistics
if epoch % display_step == 0 or epoch == 1:
net.eval()
valid_predicts = _get_predict(net, X_valid)
valid_loss = _get_loss(valid_predicts, Y_valid)
valid_accuracy = _get_accuracy(valid_predicts, Y_valid)
print('\r[%d] loss: %.3f - accuracy: %.2f' % (epoch, valid_loss.data[0], valid_accuracy * 100))
print('\rFinished Training\n')
net.eval()
test_predicts = _get_predict(net, X_test)
test_loss = _get_loss(test_predicts, Y_test).data[0]
test_accuracy = _get_accuracy(test_predicts, Y_test)
print('Test loss: %.3f - Test accuracy: %.2f' % (test_loss, test_accuracy * 100))
torch.save(net.flat_layer.state_dict(), save_path)
def _get_accuracy(predicts, labels):
predicts = torch.max(predicts, 1)[1].data[0]
return np.mean(predicts == labels)
def _get_predict(net, x):
# wrap them in Variable
inputs = torch.from_numpy(x).float()
# convert to cuda tensors if cuda flag is true
if torch.cuda.is_available:
inputs = inputs.cuda()
inputs = Variable(inputs)
return net(inputs)
def _get_loss(predicts, labels):
labels = torch.from_numpy(labels).long()
# convert to cuda tensors if cuda flag is true
if torch.cuda.is_available:
labels = labels.cuda()
labels = Variable(labels)
return F.cross_entropy(predicts, labels)
It seems that parameters 're just updated slightly each epoch, the accuracy remains for all the process. While with the same implementation and the same params in Tensorflow, it runs correctly.
I'm new to Pytorch, so maybe my instructions has something wrong, please help me to find out. Thank you!
P.s: I try to use F.nll_loss + F.log_softmax instead of F.cross_entropy. Theoretically, it should return the same, but in fact another result is printed out (but it still be a wrong loss value)
I have seen that in your original code, weight_decay term is set to be 0.1. weight_decay is used to regularize the network's parameters. This term maybe too strong so that the regularization is too much. Try to reduce the value of weight_decay.
For convolutional neural networks in computer vision tasks. weight_decay term are usually set to be 5e-4 or 5e-5. I am not familiar with text classification. These values may work for you out of the box or you have to tweak it a little bit by trial and error.
Let me know if it works for you.
I realised that L2_loss in Adam Optimizer make loss value remain unchanged (I haven't tried in other Optimizer yet). It works when I remove L2_loss:
# optimizer = optim.Adam(net.parameters(), lr=0.01, weight_decay=0.1)
optimizer = optim.Adam(model.parameters(), lr=0.001)
=== UPDATE (See above answer for more detail!) ===
self.features = nn.Sequential(self.flat_layer)
self.classifier = nn.Linear(out_channels * len(filter_sizes), num_classes)
...
optimizer = optim.Adam([
{'params': model.features.parameters()},
{'params': model.classifier.parameters(), 'weight_decay': 0.1}
], lr=0.001)
In my case, I was facing the same error. On my laptop without GPU the training was fine. When I tried on GPU the model didn’t change the accuracy and loss after the first epochs. I was using nn.CrossEntropyLoss() with Adam.
Changing Adam with SGD worked for me.
I am sharing this, anyone may suffer from this.

Categories