I am learning how to build a neural network using PyTorch.
This formula is the target of my code:
y =2X^3 + 7X^2 - 8*X + 120
It is a regression problem.
I used this because it is simple and the output can be calculated so that I can ensure my neural network is able to predict output with the given input.
However, I met some problem during training.
The problem occurs in this line of code:
loss = loss_func(prediction, outputs)
The loss computed in this line is NAN (not a number)
I am using MSEloss as the loss function. 100 datasets are used for training the ANN model. The input X_train is ranged from -1000 to 1000.
I believed that the problem is owing to the value of X_train and MSEloss. X_train should be scaled into some values between 0 and 1 so that MSEloss can compute the loss.
However, is it possible to train the ANN model without scaling the input into value between 0 and 1 in a regression problem?
Here is my code, it does not use MinMaxScaler and it print the loss with NAN:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torch.autograd import Variable
#Load datasets
dataset = pd.read_csv('test_100.csv')
x_temp_train = dataset.iloc[:79, :-1].values
y_temp_train = dataset.iloc[:79, -1:].values
x_temp_test = dataset.iloc[80:, :-1].values
y_temp_test = dataset.iloc[80:, -1:].values
#Turn into tensor
X_train = torch.FloatTensor(x_temp_train)
Y_train = torch.FloatTensor(y_temp_train)
X_test = torch.FloatTensor(x_temp_test)
Y_test = torch.FloatTensor(y_temp_test)
#Define a Artifical Neural Network
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.linear = nn.Linear(1,1) #input=1, output=1, bias=True
def forward(self, x):
x = self.linear(x)
return x
net = Net()
print(net)
#Define a Loss function and optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.2)
loss_func = torch.nn.MSELoss()
#Training
inputs = Variable(X_train)
outputs = Variable(Y_train)
for i in range(100): #epoch=100
prediction = net(inputs)
loss = loss_func(prediction, outputs)
optimizer.zero_grad() #zero the parameter gradients
loss.backward() #compute gradients(dloss/dx)
optimizer.step() #updates the parameters
if i % 10 == 9: #print every 10 mini-batches
#plot and show learning process
plt.cla()
plt.scatter(X_train.data.numpy(), Y_train.data.numpy())
plt.plot(X_train.data.numpy(), prediction.data.numpy(), 'r-', lw=2)
plt.text(0.5, 0, 'Loss=%.4f' % loss.data.numpy(), fontdict={'size': 10, 'color': 'red'})
plt.pause(0.1)
plt.show()
Thanks for your time.
Is normalization necessary for regression problem in Neural Network?
No.
But...
I can tell you that MSELoss works with non-normalised values. You can tell because:
>>> import torch
>>> torch.nn.MSELoss()(torch.randn(1)-1000, torch.randn(1)+1000)
tensor(4002393.)
MSE is a very well-behaved loss function, and you can't really get NaN without giving it a NaN. I would bet that your model is giving a NaN output.
The two most common causes of a NaN are: an accidental divide by 0, and absurdly large weights/gradients.
I ran a variant of your code on my machine using:
x = torch.randn(79, 1)*1000
y = 2*x**3 + 7*x**2 - 8*x + 120
And it got to NaN in about 20 training steps due to absurdly large weights.
A model can get absurdly large weights if the learning rate is too large. You may think 0.2 is not too large, but that's a typical learning rate people use for normalised data, which forces their gradients to be fairly small. Since you are not using normalised data, let's calculate how large your gradients are (roughly).
First, your x is on the order of 1e3, your expected output y scales at x^3, then MSE calculates (pred - y)^2. Then your loss is on the scale of 1e3^3^2=1e18. This propagates to your gradients, and recall that weight updates are += gradient*learning_rate, so it's easy to see why your weights fairly quickly explode outside of float precision.
How to fix this? Well you could use a learning rate of 2e-7. Or you could just normalise your data. I recommend normalising your data; it has other nice properties for training and avoids these kinds of problems.
Related
I am trying to create a bare minimum PyTorch example only for learning purpose. I found out that my PyTorch code works fine with a really small training data but as soon as I increase the input data size it stops working. This seems very counterintuitive, ideally bigger training data size should give better results.
[ I have intentionally not used Object Oriented Paradigm as I am trying to first learn the core functionality hence keeping the code to a bare minimum. ]
import numpy as np
import torch
x_train = np.float32(np.random.rand(25,1)*10)
#Synthesize training data; we will verify the weights and bias later with the trained model
def synthesize_output(input):
return (1.29*input[0] + 13)
y_train = np.array([synthesize_output(row) for row in x_train]).reshape(-1,1)
X_train = torch.from_numpy(x_train)
Y_train = torch.from_numpy(y_train)
learning_rate = 0.001
# Initialize Weights and Bias to random starting values
w = torch.rand(1, 1, requires_grad=True)
b = torch.rand(1, 1, requires_grad=True)
for iter in range(1, 4001):
#forward pass : predict values
y_pred = X_train.mm(w).clamp(min=0).add(b)
#find loss
loss = (y_pred - Y_train).pow(2).sum()
#Backword pass for computing gradients
loss.backward()
#Just printing the loss to see how it is changing over the iterations
if (iter % 100) == 0:
print(f"Iter: {iter}, Loss={loss}")
#Manually updating weights
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad
w.grad.zero_()
b.grad.zero_()
#finally check the weight and bias
print(f"Weights: {w} \n\nBias: {b}")
Above code works as it is but as soon as I increase ( just doubling ) the data size it stops working.
x_train = np.float32(np.random.rand(50,1)*10)
Unlike the above code the basic sklearn sample I had created however seems to work fine even with a much larger input dataset.
import numpy as np
from sklearn.linear_model import LinearRegression
x_train = np.float32(np.random.rand(2000000,1)*10)
def synthesize_output(input):
return (1.29*input[0] + 13)
y_train = np.array([synthesize_output(row) for row in x_train]).reshape(-1,1)
lm = LinearRegression()
lm.fit(x_train, y_train)
#finally check the weight and constant
lm.score(x_train, y_train)
print(f"Weight: {lm.coef_}")
print(f"Bias: {lm.intercept_}")
Why is PyTorch not able to handle large input data like sklearn?
When I use input very large size ( > 5000 ) training datasize the loss goes to a NaN.
How do we typically work around this problem?
Recently, I found a very interesting paper, Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations and want to give it a trial. For this, I create a dummy problem and implement what I understand from the paper.
Problem Statement
Suppose, I want to solve the ODE dy/dx = cos(x) with initial conditions y(0)=y(2*pi)=0. Actually, we can easily guess the analytic solution y(x)=sin(x). But I want to see how the model predict the solution using PINN.
# import libraries
import torch
import torch.autograd as autograd # computation graph
import torch.nn as nn # neural networks
import torch.optim as optim # optimizers e.g. gradient descent, ADAM, etc.
import matplotlib.pyplot as plt
import numpy as np
#Set default dtype to float32
torch.set_default_dtype(torch.float)
#PyTorch random number generator
torch.manual_seed(1234)
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
Model Architecture
## Model Architecture
class FCN(nn.Module):
##Neural Network
def __init__(self,layers):
super().__init__() #call __init__ from parent class
# activation function
self.activation = nn.Tanh()
# loss function
self.loss_function = nn.MSELoss(reduction ='mean')
# Initialise neural network as a list using nn.Modulelist
self.linears = nn.ModuleList([nn.Linear(layers[i], layers[i+1]) for i in range(len(layers)-1)])
self.iter = 0
# Xavier Normal Initialization
for i in range(len(layers)-1):
nn.init.xavier_normal_(self.linears[i].weight.data, gain=1.0)
# set biases to zero
nn.init.zeros_(self.linears[i].bias.data)
# foward pass
def forward(self,x):
if torch.is_tensor(x) != True:
x = torch.from_numpy(x)
a = x.float()
for i in range(len(layers)-2):
z = self.linears[i](a)
a = self.activation(z)
a = self.linears[-1](a)
return a
# Loss Functions
#Loss PDE
def lossPDE(self,x_PDE):
g=x_PDE.clone()
g.requires_grad=True #Enable differentiation
f=self.forward(g)
f_x=autograd.grad(f,g,torch.ones([x_PDE.shape[0],1]).to(device),\
retain_graph=True, create_graph=True)[0]
loss_PDE=self.loss_function(f_x,PDE(g))
return loss_PDE
Generate data
# generate training and evaluation points
x = torch.linspace(min,max,total_points).view(-1,1)
y = torch.sin(x)
print(x.shape, y.shape)
# Set Boundary conditions:
# Actually for this problem
# we don't need extra boundary constraint
# as it was concided with x_PDE point & value
# BC_1=x[0,:]
# BC_2=x[-1,:]
# print(BC_1,BC_2)
# x_BC=torch.vstack([BC_1,BC_2])
# print(x_BC)
x_PDE = x[1:-1,:]
print(x_PDE.shape)
x_PDE=x_PDE.float().to(device)
# x_BC=x_BC.to(device)
#Create Model
layers = np.array([1,50,50,50,50,1])
model = FCN(layers)
print(model)
model.to(device)
params = list(model.parameters())
optimizer = torch.optim.Adam(model.parameters(),lr=lr,amsgrad=False)
Train Neural Network
for i in range(500):
yh = model(x_PDE)
loss = model.loss_PDE(x_PDE) # use mean squared error
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i%(500/10)==0:
print(loss)
predict the solution using PINN
# predict the solution beyond training set
x = torch.linspace(0,max+max,total_points).view(-1,1)
yh=model(x.to(device))
y=torch.sin(x)
#Error
print(model.lossBC(x.to(device)))
y_plot=y.detach().numpy()
yh_plot=yh.detach().cpu().numpy()
fig, ax1 = plt.subplots()
ax1.plot(x,y_plot,color='blue',label='Real')
ax1.plot(x,yh_plot,color='red',label='Predicted')
ax1.set_xlabel('x',color='black')
ax1.set_ylabel('f(x)',color='black')
ax1.tick_params(axis='y', color='black')
ax1.legend(loc = 'upper left')
But the end result was so disappointing. The model was unable to learn the simple ODE. I was wondering the model architecture of mine may have some issue which I couldn't figure out myself. It will be a great help if anyone suggest me any improvement.
Thanks in advance.
after checking your code, I have a question about test dataset; I am not pretty sure if the reason why your preds are bad is because you did not add model.eval()
I am not familiar with this net/model, but as my exp with cnn and basic gcn, I tended to use model.eval() to predict for my result (oriange line on your graph)
For example, if I were you, I would do:
for i in range(500):
model.train()
yh = model(x_PDE)
loss = model.loss_PDE(x_PDE) # use mean squared error
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i%(500/10)==0:
print(loss)
model.eval()
-- your test function just like your train but without backward optim --
I am not sure if this will effect on your answer
I am running a PyTorch ANN model (for a classification task) and I am using skorch’s GridSearchCV to search for the optimal hyperparameters.
When I ran GridSearchCV using n_jobs=1 (ie. doing one hyperparameter combination at a time), it runs really slowly.
When I set n_jobs to greater than 1, I get a memory blow-out error. So I am now trying to see if I could use PyTorch’s DataLoader to split up the dataset into batches to avoid the memory blow-out issue. According to this other PyTorch Forum question (https://discuss.pytorch.org/t/how-to-use-skorch-for-data-that-does-not-fit-into-memory/70081/2), it appears we could use SliceDataset. My code for this is as below:
# Setting up artifical neural net model
class TabularModel(nn.Module):
# Initialize parameters embeds, emb_drop, bn_cont and layers
def __init__(self, emb_szs, n_cont, out_sz, layers, p=0.5):
super().__init__()
self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in emb_szs])
self.emb_drop = nn.Dropout(p)
self.bn_cont = nn.BatchNorm1d(n_cont)
# Create empty list for each layer in the neural net
layerlist = []
# Number of all embedded columns for categorical features
n_emb = sum((nf for ni, nf in emb_szs))
# Number of inputs for each layer
n_in = n_emb + n_cont
for i in layers:
# Set the linear function for the weights and biases, wX + b
layerlist.append(nn.Linear(n_in, i))
# Using ReLu activation function
layerlist.append(nn.ReLU(inplace=True))
# Normalised all the activation function output values
layerlist.append(nn.BatchNorm1d(i))
# Set some of the normalised activation function output values to zero
layerlist.append(nn.Dropout(p))
# Reassign number of inputs for the next layer
n_in = i
# Append last layer
layerlist.append(nn.Linear(layers[-1], out_sz))
# Create sequential layers
self.layers = nn.Sequential(*layerlist)
# Function for feedforward
def forward(self, x_cat_cont):
x_cat = x_cat_cont[:,0:cat_train.shape[1]].type(torch.int64)
x_cont = x_cat_cont[:,cat_train.shape[1]:].type(torch.float32)
# Create empty list for embedded categorical features
embeddings = []
# Embed categorical features
for i, e in enumerate(self.embeds):
embeddings.append(e(x_cat[:,i]))
# Concatenate embedded categorical features
x = torch.cat(embeddings, 1)
# Apply dropout rates to categorical features
x = self.emb_drop(x)
# Batch normalize continuous features
x_cont = self.bn_cont(x_cont)
# Concatenate categorical and continuous features
x = torch.cat([x, x_cont], 1)
# Feed categorical and continuous features into neural net layers
x = self.layers(x)
return x
# Use cross entropy loss function since this is a classification problem
# Assign class weights to the loss function
criterion_skorch = nn.CrossEntropyLoss
# Use Adam solver with learning rate 0.001
optimizer_skorch = torch.optim.Adam
from skorch import NeuralNetClassifier
# Random seed chosen to ensure results are reproducible by using the same initial random weights and biases,
# and applying dropout rates to the same random embedded categorical features and neurons in the hidden layers
torch.manual_seed(0)
net = NeuralNetClassifier(module=TabularModel,
module__emb_szs=emb_szs,
module__n_cont=con_train.shape[1],
module__out_sz=2,
module__layers=[30],
module__p=0.0,
criterion=criterion_skorch,
criterion__weight=cls_wgt,
optimizer=optimizer_skorch,
optimizer__lr=0.001,
max_epochs=150,
device='cuda'
)
from sklearn.model_selection import GridSearchCV
param_grid = {'module__layers': [[30], [50,20]],
'module__p': [0.0],
'max_epochs': [150, 175]
}
from torch.utils.data import TensorDataset, DataLoader
from skorch.helper import SliceDataset
# cat_con_train and y_train is a PyTorch tensor
tsr_ds = TensorDataset(cat_con_train.cpu(), y_train.cpu())
torch.manual_seed(0) # Set random seed for shuffling results to be reproducible
d_loader = DataLoader(tsr_ds, batch_size=100000, shuffle=True)
d_loader_slice_X = SliceDataset(d_loader, idx=0)
d_loader_slice_y = SliceDataset(d_loader, idx=1)
models = GridSearchCV(net, param_grid, scoring='roc_auc', n_jobs=2).fit(d_loader_slice_X, d_loader_slice_y)
However, when I ran this code, I get the following error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-47-df3fc792ad5e> in <module>()
104
--> 105 models = GridSearchCV(net, param_grid, scoring='roc_auc', n_jobs=2).fit(d_loader_slice_X, d_loader_slice_y)
106
6 frames
/usr/local/lib/python3.6/dist-packages/skorch/helper.py in __getitem__(self, i)
230 def __getitem__(self, i):
231 if isinstance(i, (int, np.integer)):
--> 232 Xn = self.dataset[self.indices_[i]]
233 Xi = self._select_item(Xn)
234 return self.transform(Xi)
TypeError: 'DataLoader' object does not support indexing
How do I fix this? Is there a way to use PyTorch’s DataLoader together with skorch’s GridSearchCV (ie. is there a way to load data in batches into skorch’s GridSearchCV, to avoid memory blow-out issues when I set n_jobs to greater than 1 in GridSearchCV)?
Many many thanks in advance!
So first thing is to find out where you run out of memory. You have a very high batch size and presumably only one GPU. In case you have more than one GPU, you are already set you can follow these steps to parallelize grid-search over multiple GPUs using skorch + dask.
If you only have one GPU the RAM of your GPU is obviously a bottle-neck and it does not support two instances of the model in RAM. You could:
reduce the model size (fewer parameters)
reduce the batch size (data takes up less space)
Which route you take is up to you, though.
I'm currently working on a variation of Variational Autoencoder in a sequential setting, where the task is to fit/recover a sequence of real-valued observation data (hence it is a regression problem).
I have built my model using tf.keras with eager execution enabled, and tensorflow_probability (tfp). Following VAE concept, the generative net emits the distribution parameters of the observation data, which I model as multivariate normal. Therefore the outputs are mean and logvar of the predicted distribution.
Regarding training process, the first component of the loss is reconstruction error. That is the log likelihood of the true observation, given the predicted (parameters) distribution from the generative net. Here, I use tfp.distributions, since it is fast and handy.
However, after training is done, marked by a considerably low loss value, it turns out that my model seems not to learn anything. The predicted value from the model is just barely flat across the time dimension (recall that the problem is sequential).
Nevertheless, for the sake of sanity check, when I replace log likelihood with MSE loss (which is not justifiable while working on VAE), it yields very good data fitting. So I conclude that there must be something wrong with this log likelihood term. Is there anyone having some clue and/or solution for this?
I have considered replacing the log likelihood with cross-entropy loss, but I think that is not applicable in my case, since my problem is regression and the data can't be normalized into [0,1] range.
I also have tried to implement annealed KL term (i.e. weighing the KL term with constant < 1) when using the log likelihood as the reconstruction loss. But it also didn't work.
Here is my code snippet of the original (using log likelihood as reconstruction error) loss function:
import tensorflow as tf
tfe = tf.contrib.eager
tf.enable_eager_execution()
import tensorflow_probability as tfp
tfd = tfp.distributions
def loss(model, inputs):
outputs, _ = SSM_model(model, inputs)
#allocate the corresponding output component
infer_mean = outputs[:,:,:latent_dim] #mean of latent variable from inference net
infer_logvar = outputs[:,:,latent_dim : (2 * latent_dim)]
trans_mean = outputs[:,:,(2 * latent_dim):(3 * latent_dim)] #mean of latent variable from transition net
trans_logvar = outputs[:,:, (3 * latent_dim):(4 * latent_dim)]
obs_mean = outputs[:,:,(4 * latent_dim):((4 * latent_dim) + output_obs_dim)] #mean of observation from generative net
obs_logvar = outputs[:,:,((4 * latent_dim) + output_obs_dim):]
target = inputs[:,:,2:4]
#transform logvar to std
infer_std = tf.sqrt(tf.exp(infer_logvar))
trans_std = tf.sqrt(tf.exp(trans_logvar))
obs_std = tf.sqrt(tf.exp(obs_logvar))
#computing loss at each time step
time_step_loss = []
for i in range(tf.shape(outputs)[0].numpy()):
#distribution of each module
infer_dist = tfd.MultivariateNormalDiag(infer_mean[i],infer_std[i])
trans_dist = tfd.MultivariateNormalDiag(trans_mean[i],trans_std[i])
obs_dist = tfd.MultivariateNormalDiag(obs_mean[i],obs_std[i])
#log likelihood of observation
likelihood = obs_dist.prob(target[i]) #shape = 1D = batch_size
likelihood = tf.clip_by_value(likelihood, 1e-37, 1)
log_likelihood = tf.log(likelihood)
#KL of (q|p)
kl = tfd.kl_divergence(infer_dist, trans_dist) #shape = batch_size
#the loss
loss = - log_likelihood + kl
time_step_loss.append(loss)
time_step_loss = tf.convert_to_tensor(time_step_loss)
overall_loss = tf.reduce_sum(time_step_loss)
overall_loss = tf.cast(overall_loss, dtype='float32')
return overall_loss
I am experimenting with a simple 2 layer neural network with pytorch, feeding in only three inputs of size 10 each, with a single value as output. I have normalised inputs and lowered learning rate. It is my understanding that a two layer fully connected neural network should be able to trivially fit to this data
Features:
0.8138 1.2342 0.4419 0.8273 0.0728 2.4576 0.3800 0.0512 0.6872 0.5201
1.5666 1.3955 1.0436 0.1602 0.1688 0.2074 0.8810 0.9155 0.9641 1.3668
1.7091 0.9091 0.5058 0.6149 0.3669 0.1365 0.3442 0.9482 1.2550 1.6950
[torch.FloatTensor of size 3x10]
Targets
[124, 125, 122]
[torch.FloatTensor of size 3]
The code is adapted from a simple example and I am using MSELoss as the loss function. The loss diverges to infinity after just a few iterations:
features = torch.from_numpy(np.array(features))
x_data = Variable(torch.Tensor(features))
y_data = Variable(torch.Tensor(targets))
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear = torch.nn.Linear(10,5)
self.linear2 = torch.nn.Linear(5,1)
def forward(self, x):
l_out1 = self.linear(x)
y_pred = self.linear2(l_out1)
return y_pred
model = Model()
criterion = torch.nn.MSELoss(size_average = False)
optim = torch.optim.SGD(model.parameters(), lr = 0.001)
def main():
for iteration in range(1000):
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
print(iteration, loss.data[0])
optim.zero_grad()
loss.backward()
optim.step()
Any help would be appreciated. Thanks
EDIT:
Indeed it seems that this was simply due to the learning rate being too high. Setting to 0.00001 fixes convergence issues, albeit giving very slow convergence.
This is because you're not using a non-linearity between layers, and your network is still Linear.
You can use Relu in order to make it non linear. You can change the forward method like this :
...
y_pred = torch.nn.functional.F.relu(self.linear2(l_out1))
...
Maybe you can try to predict a log(y) instead of y to improve the convergence even more. Also Adam optimizer (adaptive learning rate) should help + BatchNormalization (for example between your linear layers).