why is my simple feedforward neural network diverging (pytorch)? - python

I am experimenting with a simple 2 layer neural network with pytorch, feeding in only three inputs of size 10 each, with a single value as output. I have normalised inputs and lowered learning rate. It is my understanding that a two layer fully connected neural network should be able to trivially fit to this data
Features:
0.8138 1.2342 0.4419 0.8273 0.0728 2.4576 0.3800 0.0512 0.6872 0.5201
1.5666 1.3955 1.0436 0.1602 0.1688 0.2074 0.8810 0.9155 0.9641 1.3668
1.7091 0.9091 0.5058 0.6149 0.3669 0.1365 0.3442 0.9482 1.2550 1.6950
[torch.FloatTensor of size 3x10]
Targets
[124, 125, 122]
[torch.FloatTensor of size 3]
The code is adapted from a simple example and I am using MSELoss as the loss function. The loss diverges to infinity after just a few iterations:
features = torch.from_numpy(np.array(features))
x_data = Variable(torch.Tensor(features))
y_data = Variable(torch.Tensor(targets))
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear = torch.nn.Linear(10,5)
self.linear2 = torch.nn.Linear(5,1)
def forward(self, x):
l_out1 = self.linear(x)
y_pred = self.linear2(l_out1)
return y_pred
model = Model()
criterion = torch.nn.MSELoss(size_average = False)
optim = torch.optim.SGD(model.parameters(), lr = 0.001)
def main():
for iteration in range(1000):
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
print(iteration, loss.data[0])
optim.zero_grad()
loss.backward()
optim.step()
Any help would be appreciated. Thanks
EDIT:
Indeed it seems that this was simply due to the learning rate being too high. Setting to 0.00001 fixes convergence issues, albeit giving very slow convergence.

This is because you're not using a non-linearity between layers, and your network is still Linear.
You can use Relu in order to make it non linear. You can change the forward method like this :
...
y_pred = torch.nn.functional.F.relu(self.linear2(l_out1))
...

Maybe you can try to predict a log(y) instead of y to improve the convergence even more. Also Adam optimizer (adaptive learning rate) should help + BatchNormalization (for example between your linear layers).

Related

neural network trained with PyTorch outputs the mean value for every input

I am using PyTorch in order to get my neural network to recognize digits from the MNIST database.
import torch
import torchvision
I'd like to implement a very simple design similar to what is shown in 3Blue1Brown's video series about neural networks. The following design in particular achieved an error rate of 1.6%.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layer1 = torch.nn.Linear(784, 800)
self.layer2 = torch.nn.Linear(800, 10)
def forward(self, x):
x = torch.sigmoid(self.layer1(x))
x = torch.sigmoid(self.layer2(x))
return x
The data is gathered using torchvision and organised in mini batches containing 32 images each.
batch_size = 32
training_set = torchvision.datasets.MNIST("./", download=True, transform=torchvision.transforms.ToTensor())
training_loader = torch.utils.data.DataLoader(training_set, batch_size=32)
I am using the mean squared error as a loss funtion and stochastic gradient descent with a learning rate of 0.001 as my optimization algorithm.
net = Net()
loss_function = torch.nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.001)
Finally the network gets trained and saved using the following code:
for images, labels in training_loader:
optimizer.zero_grad()
for i in range(batch_size):
output = net(torch.flatten(images[i]))
desired_output = torch.tensor([float(j == labels[i]) for j in range(10)])
loss = loss_function(output, desired_output)
loss.backward()
optimizer.step()
torch.save(net.state_dict(), "./trained_net.pth")
However, here are the outputs of some test images:
tensor([0.0978, 0.1225, 0.1018, 0.0961, 0.1022, 0.0885, 0.1007, 0.1077, 0.0994,
0.1081], grad_fn=<SigmoidBackward>)
tensor([0.0978, 0.1180, 0.1001, 0.0929, 0.1006, 0.0893, 0.1010, 0.1051, 0.0978,
0.1067], grad_fn=<SigmoidBackward>)
tensor([0.0981, 0.1227, 0.1018, 0.0970, 0.0979, 0.0908, 0.1001, 0.1092, 0.1011,
0.1088], grad_fn=<SigmoidBackward>)
tensor([0.1061, 0.1149, 0.1037, 0.1001, 0.0957, 0.0919, 0.1044, 0.1022, 0.0997,
0.1052], grad_fn=<SigmoidBackward>)
tensor([0.0996, 0.1137, 0.1005, 0.0947, 0.0977, 0.0916, 0.1048, 0.1109, 0.1013,
0.1085], grad_fn=<SigmoidBackward>)
tensor([0.1008, 0.1154, 0.0986, 0.0996, 0.1031, 0.0952, 0.0995, 0.1063, 0.0982,
0.1094], grad_fn=<SigmoidBackward>)
tensor([0.0972, 0.1235, 0.1013, 0.0984, 0.0974, 0.0907, 0.1032, 0.1075, 0.1001,
0.1080], grad_fn=<SigmoidBackward>)
tensor([0.0929, 0.1258, 0.1016, 0.0978, 0.1006, 0.0889, 0.1001, 0.1068, 0.0986,
0.1024], grad_fn=<SigmoidBackward>)
tensor([0.0982, 0.1207, 0.1040, 0.0990, 0.0999, 0.0910, 0.0980, 0.1051, 0.1039,
0.1078], grad_fn=<SigmoidBackward>)
As you can see the network seems to approach a state where the answer for every input is:
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
This neural network is not better than just guessing. Where did I go wrong in my design or code?
Here are a few points that would be useful for you:
At first glance your model is not learning since your prediction are as good as a random guess. The first initiative would be to monitor your loss, here you only have a single epoch. At least you could evaluate your model on unseen data:
validation_set = torchvision.datasets.MNIST('./',
download=True, train=False, transform=T.ToTensor())
validation_loader = DataLoader(validation_set, batch_size=32)
You are using a MSE loss (the L2-norm) to train a classification task which is not the right tool for this kind of task. You could instead be using the negative log-likelihood. PyTorch offers nn.CrossEntropyLoss which includes a log-softmax and the negative log-likelihood loss in one module. This change can be implemented by adding in:
loss_function = nn.CrossEntropyLoss()
and using the right target shapes when applying loss_function (see below). Since the loss function will apply a log-softmax, you shouldn't have an activation function on your model's output.
You are using sigmoid as an activation function, intermediate non-linearities will work better as ReLU (see related post). A sigmoid is more suited for a binary classification task. Again, since we are using nn.CrossEntropyLoss, we have to remove the activation after layer2.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.flatten = nn.Flatten()
self.layer1 = torch.nn.Linear(784, 800)
self.layer2 = torch.nn.Linear(800, 10)
def forward(self, x):
x = self.flatten(x)
x = torch.relu(self.layer1(x))
x = self.layer2(x)
return x
A less crucial point is the fact that you could infer estimations on a whole batch instead of looping through each batch one element at a time. A typical training loop for one epoch would look like:
for images, labels in training_loader:
optimizer.zero_grad()
output = net(images)
loss = loss_function(output, labels)
loss.backward()
optimizer.step()
With these kinds of modifications, you can expect to have a validation of around 80% after a single epoch.

ValueError: No gradients provided for any variable when calculating loss

I have been trying to implement the training step for a DQN described in this paper on various RL methods using TensorFlow, but when I try to compute the gradient using a GradientTape I get a ValueError: No gradients provided for any variable:. Below is the training step code:
def train_step(model, target, optimizer, observations, actions, rewards, next_observations):
with tf.GradientTape() as tape:
target_logits = tf.math.reduce_max(target(np.expand_dims(next_observations, -1)), 1)
logits = model(np.expand_dims(observations, -1))
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
act_logits = tf.convert_to_tensor(act_logits, dtype=tf.float32)
y_T = tf.math.add(tf.convert_to_tensor(rewards, dtype=tf.float32), tf.math.scalar_mul(DISCOUNT_RATE, target_logits))
loss = tf.math.squared_difference(act_logits, y_T)
loss = tf.math.scalar_mul(1.0 / EXPERIENCE_SAMPLE_SIZE, loss)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Where model and target are tf.keras.Sequential that output the expected value for taking each of 5 possible actions, optimizer is SGD, and observations, actions, rewards, and next_observations are numpy arrays sampled from an experience replay buffer.
This is part of implementing the following pseudocode from the aforementioned paper:
My best guess is that this error is because indexing logits makes the gradient impossible to differentiate, but I don't know else to calculate the Q*(s,a,theta) quantity.
Adding the Solution in the Answer Section for the benefit of the Community.
From Comments:
The problem is resolved by replacing the code:
act_logits = np.ndarray(EXPERIENCE_SAMPLE_SIZE)
for i in range(EXPERIENCE_SAMPLE_SIZE):
act_logits[i] = logits[i][actions[i]]
with the code:
act_logits = tf.math.reduce_max(tf.math.multiply(act_logits, logits), 1)

Is normalization necessary for regression problem in Neural network

I am learning how to build a neural network using PyTorch.
This formula is the target of my code:
y =2X^3 + 7X^2 - 8*X + 120
It is a regression problem.
I used this because it is simple and the output can be calculated so that I can ensure my neural network is able to predict output with the given input.
However, I met some problem during training.
The problem occurs in this line of code:
loss = loss_func(prediction, outputs)
The loss computed in this line is NAN (not a number)
I am using MSEloss as the loss function. 100 datasets are used for training the ANN model. The input X_train is ranged from -1000 to 1000.
I believed that the problem is owing to the value of X_train and MSEloss. X_train should be scaled into some values between 0 and 1 so that MSEloss can compute the loss.
However, is it possible to train the ANN model without scaling the input into value between 0 and 1 in a regression problem?
Here is my code, it does not use MinMaxScaler and it print the loss with NAN:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torch.autograd import Variable
#Load datasets
dataset = pd.read_csv('test_100.csv')
x_temp_train = dataset.iloc[:79, :-1].values
y_temp_train = dataset.iloc[:79, -1:].values
x_temp_test = dataset.iloc[80:, :-1].values
y_temp_test = dataset.iloc[80:, -1:].values
#Turn into tensor
X_train = torch.FloatTensor(x_temp_train)
Y_train = torch.FloatTensor(y_temp_train)
X_test = torch.FloatTensor(x_temp_test)
Y_test = torch.FloatTensor(y_temp_test)
#Define a Artifical Neural Network
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.linear = nn.Linear(1,1) #input=1, output=1, bias=True
def forward(self, x):
x = self.linear(x)
return x
net = Net()
print(net)
#Define a Loss function and optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.2)
loss_func = torch.nn.MSELoss()
#Training
inputs = Variable(X_train)
outputs = Variable(Y_train)
for i in range(100): #epoch=100
prediction = net(inputs)
loss = loss_func(prediction, outputs)
optimizer.zero_grad() #zero the parameter gradients
loss.backward() #compute gradients(dloss/dx)
optimizer.step() #updates the parameters
if i % 10 == 9: #print every 10 mini-batches
#plot and show learning process
plt.cla()
plt.scatter(X_train.data.numpy(), Y_train.data.numpy())
plt.plot(X_train.data.numpy(), prediction.data.numpy(), 'r-', lw=2)
plt.text(0.5, 0, 'Loss=%.4f' % loss.data.numpy(), fontdict={'size': 10, 'color': 'red'})
plt.pause(0.1)
plt.show()
Thanks for your time.
Is normalization necessary for regression problem in Neural Network?
No.
But...
I can tell you that MSELoss works with non-normalised values. You can tell because:
>>> import torch
>>> torch.nn.MSELoss()(torch.randn(1)-1000, torch.randn(1)+1000)
tensor(4002393.)
MSE is a very well-behaved loss function, and you can't really get NaN without giving it a NaN. I would bet that your model is giving a NaN output.
The two most common causes of a NaN are: an accidental divide by 0, and absurdly large weights/gradients.
I ran a variant of your code on my machine using:
x = torch.randn(79, 1)*1000
y = 2*x**3 + 7*x**2 - 8*x + 120
And it got to NaN in about 20 training steps due to absurdly large weights.
A model can get absurdly large weights if the learning rate is too large. You may think 0.2 is not too large, but that's a typical learning rate people use for normalised data, which forces their gradients to be fairly small. Since you are not using normalised data, let's calculate how large your gradients are (roughly).
First, your x is on the order of 1e3, your expected output y scales at x^3, then MSE calculates (pred - y)^2. Then your loss is on the scale of 1e3^3^2=1e18. This propagates to your gradients, and recall that weight updates are += gradient*learning_rate, so it's easy to see why your weights fairly quickly explode outside of float precision.
How to fix this? Well you could use a learning rate of 2e-7. Or you could just normalise your data. I recommend normalising your data; it has other nice properties for training and avoids these kinds of problems.

PyTorch - custom loss function - calculations outside graph

I'm kinda new to pytorch and trying to wrap my head around it.
I've read about custom loss functions and as far as I've seen, they cannot be decoupled from internal computational graph. This means loss function consumes tensors, does operations on them, which are implemented in pytorch, and outputs tensor. Is there any way to have decoupled loss calculation and plug it back somehow?
Use case
I'm trying to train an encoder, where latent space will be optimized to some statistical quality. This means I don't train in batches and I calculate single loss value for whole epoch and whole data set. Is it even possible to teach net that way?
class Encoder(nn.Module):
def __init__(self, genome_size: int):
super(Encoder, self).__init__()
self.fc1 = nn.Linear(genome_size, genome_size)
self.fc2 = nn.Linear(genome_size, genome_size)
self.fc3 = nn.Linear(genome_size, genome_size)
self.genome_size = genome_size
def forward(self, x):
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
return x
def train_encoder(
net: nn.Module,
optimizer: Optimizer,
epochs: int,
population: Tensor,
fitness: Tensor,
):
running_loss = 0.0
for epoch in range(epochs):
optimizer.zero_grad()
outputs = net(population)
# encoder_loss is computationally heavy and cannot be done only on tensors
# I need to unwrap those tensors to numpy arrays and use them as an input to another model
loss = encoder_loss(outputs, fitness)
running_loss += loss
running_loss.backward()
optimizer.step()
print('Encoder loss:', loss)
I've seen some examples with accumulated running_loss, but my encoder is unable to learn anything. Convergence plot just jumps all over the place.
Thanks for your time <3

Neural Network predicting the same probability for each row

I've depeloped a 5 layers neural network for classification and it always predict the same probability for each row which ends predicting the same class.
I'm using Relu as activation fuction (if I use sigmoid or tanh it outputs NaN's) and tf.nn.sigmoid_cross_entropy_with_logits
The model will be this:
X, Y = create_placeholders(n_x, n_y)
parameters = initialize_parameters()
Z5 = forward_propagation(X, parameters)
cost = cost_function(Z5, Y)
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
And the prediction:
predicted = tf.nn.sigmoid(Z5)
correct_pred = tf.equal(tf.round(predicted), Y)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
The cost value drops until 0.24, in the first 100 epochs, but the values predicted don't change.
At the beginning I thought it was cause of an imbalance class problem but I fixed it (upsampling and wonsampling 1s and 0s) so it should not be the problem.
Thanks in advance

Categories