Recently, I found a very interesting paper, Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations and want to give it a trial. For this, I create a dummy problem and implement what I understand from the paper.
Problem Statement
Suppose, I want to solve the ODE dy/dx = cos(x) with initial conditions y(0)=y(2*pi)=0. Actually, we can easily guess the analytic solution y(x)=sin(x). But I want to see how the model predict the solution using PINN.
# import libraries
import torch
import torch.autograd as autograd # computation graph
import torch.nn as nn # neural networks
import torch.optim as optim # optimizers e.g. gradient descent, ADAM, etc.
import matplotlib.pyplot as plt
import numpy as np
#Set default dtype to float32
torch.set_default_dtype(torch.float)
#PyTorch random number generator
torch.manual_seed(1234)
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
Model Architecture
## Model Architecture
class FCN(nn.Module):
##Neural Network
def __init__(self,layers):
super().__init__() #call __init__ from parent class
# activation function
self.activation = nn.Tanh()
# loss function
self.loss_function = nn.MSELoss(reduction ='mean')
# Initialise neural network as a list using nn.Modulelist
self.linears = nn.ModuleList([nn.Linear(layers[i], layers[i+1]) for i in range(len(layers)-1)])
self.iter = 0
# Xavier Normal Initialization
for i in range(len(layers)-1):
nn.init.xavier_normal_(self.linears[i].weight.data, gain=1.0)
# set biases to zero
nn.init.zeros_(self.linears[i].bias.data)
# foward pass
def forward(self,x):
if torch.is_tensor(x) != True:
x = torch.from_numpy(x)
a = x.float()
for i in range(len(layers)-2):
z = self.linears[i](a)
a = self.activation(z)
a = self.linears[-1](a)
return a
# Loss Functions
#Loss PDE
def lossPDE(self,x_PDE):
g=x_PDE.clone()
g.requires_grad=True #Enable differentiation
f=self.forward(g)
f_x=autograd.grad(f,g,torch.ones([x_PDE.shape[0],1]).to(device),\
retain_graph=True, create_graph=True)[0]
loss_PDE=self.loss_function(f_x,PDE(g))
return loss_PDE
Generate data
# generate training and evaluation points
x = torch.linspace(min,max,total_points).view(-1,1)
y = torch.sin(x)
print(x.shape, y.shape)
# Set Boundary conditions:
# Actually for this problem
# we don't need extra boundary constraint
# as it was concided with x_PDE point & value
# BC_1=x[0,:]
# BC_2=x[-1,:]
# print(BC_1,BC_2)
# x_BC=torch.vstack([BC_1,BC_2])
# print(x_BC)
x_PDE = x[1:-1,:]
print(x_PDE.shape)
x_PDE=x_PDE.float().to(device)
# x_BC=x_BC.to(device)
#Create Model
layers = np.array([1,50,50,50,50,1])
model = FCN(layers)
print(model)
model.to(device)
params = list(model.parameters())
optimizer = torch.optim.Adam(model.parameters(),lr=lr,amsgrad=False)
Train Neural Network
for i in range(500):
yh = model(x_PDE)
loss = model.loss_PDE(x_PDE) # use mean squared error
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i%(500/10)==0:
print(loss)
predict the solution using PINN
# predict the solution beyond training set
x = torch.linspace(0,max+max,total_points).view(-1,1)
yh=model(x.to(device))
y=torch.sin(x)
#Error
print(model.lossBC(x.to(device)))
y_plot=y.detach().numpy()
yh_plot=yh.detach().cpu().numpy()
fig, ax1 = plt.subplots()
ax1.plot(x,y_plot,color='blue',label='Real')
ax1.plot(x,yh_plot,color='red',label='Predicted')
ax1.set_xlabel('x',color='black')
ax1.set_ylabel('f(x)',color='black')
ax1.tick_params(axis='y', color='black')
ax1.legend(loc = 'upper left')
But the end result was so disappointing. The model was unable to learn the simple ODE. I was wondering the model architecture of mine may have some issue which I couldn't figure out myself. It will be a great help if anyone suggest me any improvement.
Thanks in advance.
after checking your code, I have a question about test dataset; I am not pretty sure if the reason why your preds are bad is because you did not add model.eval()
I am not familiar with this net/model, but as my exp with cnn and basic gcn, I tended to use model.eval() to predict for my result (oriange line on your graph)
For example, if I were you, I would do:
for i in range(500):
model.train()
yh = model(x_PDE)
loss = model.loss_PDE(x_PDE) # use mean squared error
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i%(500/10)==0:
print(loss)
model.eval()
-- your test function just like your train but without backward optim --
I am not sure if this will effect on your answer
Related
I am building up a cascade of neural networks and I would like to backpropagate the main loss back to the DNNs and also compute an auxillary loss back to each DNN.
I am trying to figure out what is the best practice when building such a model and how to make sure that my losses are computed properly. Do I build a single torch.nn.Module and a single optimizer, or do I have to create separate modules and optimizers for each network? Also I am likely to have more than three cascaded DNNs.
Approach a)
import torch
from torch import nn, optim
class MasterNetwork(nn.Module):
def init(self):
super(MasterNetwork, self).__init__()
dnn1 = nn.ModuleList()
dnn2 = nn.ModuleList()
dnn3 = nn.ModuleList()
def forward(self, x, z1, z2):
out1 = dnn1(x)
out2 = dnn2(out1 + z1)
out3 = dnn3(out2 + z2)
return [out1, out2, out3]
def LossFunction(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_1_fn(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_2_fn(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_3_fn(in):
# do stuff
return loss # loss is a scalar value
model = MasterNetwork()
optimizer = optim.Adam(model.parameters())
input = torch.tensor()
z1 = torch.tensor()
z2 = torch.tensor()
outputs = model(input, z1, z2)
main_loss = LossFunction(outputs[2])
ac1_loss = ac_loss_1_fn(outputs[0])
ac2_loss = ac_loss_2_fn(outputs[1])
ac3_loss = ac_loss_3_fn(outputs[2])
optimizer.zero_grad()
'''
This is where I am uncertain about how to backpropagate the AC losses for each DNN
in addition to the main loss.
'''
optimizer.step()
Approach b)
This would creating a nn.Module class and optimizer for each DNN and then forwarding the loss to the next DNN.
I would prefer to have a solution for approach a) since it is less tedious and I don't have to deal with tuning multiple optimizers. However, I am not sure if this is possible. There was a similar question about backpropagating multiple losses, however, I was not able to understand how combining the losses would work for the distinct components.
the solution you are looking for is likely to use some form of the following:
y = torch.tensor([main_loss, ac1_loss, ac2_loss, ac3_loss])
y.backward(gradient=torch.tensor([1.0,1.0,1.0,1.0]))
See https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients for confirmation.
A similar question exists but this one uses a different phrasing and was the question which I found first when hitting the issue. The similar question can be found at Pytorch. Can autograd be used when the final tensor has more than a single value in it?
I am learning how to build a neural network using PyTorch.
This formula is the target of my code:
y =2X^3 + 7X^2 - 8*X + 120
It is a regression problem.
I used this because it is simple and the output can be calculated so that I can ensure my neural network is able to predict output with the given input.
However, I met some problem during training.
The problem occurs in this line of code:
loss = loss_func(prediction, outputs)
The loss computed in this line is NAN (not a number)
I am using MSEloss as the loss function. 100 datasets are used for training the ANN model. The input X_train is ranged from -1000 to 1000.
I believed that the problem is owing to the value of X_train and MSEloss. X_train should be scaled into some values between 0 and 1 so that MSEloss can compute the loss.
However, is it possible to train the ANN model without scaling the input into value between 0 and 1 in a regression problem?
Here is my code, it does not use MinMaxScaler and it print the loss with NAN:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torch.autograd import Variable
#Load datasets
dataset = pd.read_csv('test_100.csv')
x_temp_train = dataset.iloc[:79, :-1].values
y_temp_train = dataset.iloc[:79, -1:].values
x_temp_test = dataset.iloc[80:, :-1].values
y_temp_test = dataset.iloc[80:, -1:].values
#Turn into tensor
X_train = torch.FloatTensor(x_temp_train)
Y_train = torch.FloatTensor(y_temp_train)
X_test = torch.FloatTensor(x_temp_test)
Y_test = torch.FloatTensor(y_temp_test)
#Define a Artifical Neural Network
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.linear = nn.Linear(1,1) #input=1, output=1, bias=True
def forward(self, x):
x = self.linear(x)
return x
net = Net()
print(net)
#Define a Loss function and optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.2)
loss_func = torch.nn.MSELoss()
#Training
inputs = Variable(X_train)
outputs = Variable(Y_train)
for i in range(100): #epoch=100
prediction = net(inputs)
loss = loss_func(prediction, outputs)
optimizer.zero_grad() #zero the parameter gradients
loss.backward() #compute gradients(dloss/dx)
optimizer.step() #updates the parameters
if i % 10 == 9: #print every 10 mini-batches
#plot and show learning process
plt.cla()
plt.scatter(X_train.data.numpy(), Y_train.data.numpy())
plt.plot(X_train.data.numpy(), prediction.data.numpy(), 'r-', lw=2)
plt.text(0.5, 0, 'Loss=%.4f' % loss.data.numpy(), fontdict={'size': 10, 'color': 'red'})
plt.pause(0.1)
plt.show()
Thanks for your time.
Is normalization necessary for regression problem in Neural Network?
No.
But...
I can tell you that MSELoss works with non-normalised values. You can tell because:
>>> import torch
>>> torch.nn.MSELoss()(torch.randn(1)-1000, torch.randn(1)+1000)
tensor(4002393.)
MSE is a very well-behaved loss function, and you can't really get NaN without giving it a NaN. I would bet that your model is giving a NaN output.
The two most common causes of a NaN are: an accidental divide by 0, and absurdly large weights/gradients.
I ran a variant of your code on my machine using:
x = torch.randn(79, 1)*1000
y = 2*x**3 + 7*x**2 - 8*x + 120
And it got to NaN in about 20 training steps due to absurdly large weights.
A model can get absurdly large weights if the learning rate is too large. You may think 0.2 is not too large, but that's a typical learning rate people use for normalised data, which forces their gradients to be fairly small. Since you are not using normalised data, let's calculate how large your gradients are (roughly).
First, your x is on the order of 1e3, your expected output y scales at x^3, then MSE calculates (pred - y)^2. Then your loss is on the scale of 1e3^3^2=1e18. This propagates to your gradients, and recall that weight updates are += gradient*learning_rate, so it's easy to see why your weights fairly quickly explode outside of float precision.
How to fix this? Well you could use a learning rate of 2e-7. Or you could just normalise your data. I recommend normalising your data; it has other nice properties for training and avoids these kinds of problems.
My inducing points are set to trainable but do not change when I call opt.minimize(). Why is it and what does it mean? Does it mean, the model is not learning?
What is the difference between tf.optimizers.Adam(lr) and gpflow.optimizers.Scipy?
The following is the simple classification example adapted from the documentation. When I run this code example with gpflow's Scipy optimizer then I get the trained results and the values for inducing variables keep changing. But when I use Adam optimizer then I get only a straight line prediction, and the values for inducing points remain the same. It indicates that the model is not learning with Adam optimizer.
plot of data before training
plot of data after training with Adam
plot of data after training with gpflow optimizer Scipy
The link for the example is https://gpflow.readthedocs.io/en/develop/notebooks/advanced/multiclass_classification.html
import numpy as np
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore') # ignore DeprecationWarnings from tensorflow
import matplotlib.pyplot as plt
import gpflow
from gpflow.utilities import print_summary, set_trainable
from gpflow.ci_utils import ci_niter
from tensorflow2_work.multiclass_classification import plot_posterior_predictions, colors
np.random.seed(0) # reproducibility
# Number of functions and number of data points
C = 3
N = 100
# RBF kernel lengthscale
lengthscale = 0.1
# Jitter
jitter_eye = np.eye(N) * 1e-6
# Input
X = np.random.rand(N, 1)
kernel_se = gpflow.kernels.SquaredExponential(lengthscale=lengthscale)
K = kernel_se(X) + jitter_eye
# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(C)).T
# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)
print(Y.shape)
# One-hot encoding
Y_hot = np.zeros((N, C), dtype=bool)
Y_hot[np.arange(N), Y] = 1
data = (X, Y)
plt.figure(figsize=(12, 6))
order = np.argsort(X.reshape(-1,))
print(order.shape)
for c in range(C):
plt.plot(X[order], f[order, c], '.', color=colors[c], label=str(c))
plt.plot(X[order], Y_hot[order, c], '-', color=colors[c])
plt.legend()
plt.xlabel('$X$')
plt.ylabel('Latent (dots) and one-hot labels (lines)')
plt.title('Sample from the joint $p(Y, \mathbf{f})$')
plt.grid()
plt.show()
# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
# Robustmax Multiclass Likelihood
invlink = gpflow.likelihoods.RobustMax(C) # Robustmax inverse link function
likelihood = gpflow.likelihoods.MultiClass(C, invlink=invlink) # Multiclass likelihood
Z = X[::5].copy() # inducing inputs
#print(Z)
m = gpflow.models.SVGP(kernel=kernel, likelihood=likelihood,
inducing_variable=Z, num_latent_gps=C, whiten=True, q_diag=True)
# Only train the variational parameters
set_trainable(m.kernel.kernels[1].variance, True)
set_trainable(m.inducing_variable, True)
print(m.inducing_variable.Z)
print_summary(m)
training_loss = m.training_loss_closure(data)
opt.minimize(training_loss, m.trainable_variables)
print(m.inducing_variable.Z)
print_summary(m.inducing_variable.Z)
print(m.inducing_variable.Z)
# %%
plot_posterior_predictions(m, X, Y)
The example given in the question isn't copy&pastable, but it seems like you simply exchange opt = gpflow.optimizers.Scipy() with opt = tf.optimizers.Adam(). The minimize() method of gpflow's Scipy optimizer runs one call of scipy.optimize.minimize, which by default runs to convergence (you can also specify a maximum number of iterations by passing, e.g., options=dict(maxiter=100) to the minimize() call).
In contrast, the minimize() method of TensorFlow optimizers runs only a single optimization step. To run more steps, say iter = 100, you need to manually write a loop:
for _ in range(iter):
opt.minimize(model.training_loss, model.trainable_variables)
For this to actually run fast, you also need to wrap the optimization step in tf.function:
#tf.function
def optimization_step():
opt.minimize(model.training_loss, model.trainable_variables)
for _ in range(iter):
optimization_step()
This runs exactly iter steps - in TensorFlow you have to handle convergence checks yourself, your model may or may not be converged after this many steps.
So in your usage, you only ran one step - this did change the parameters, but presumably too little to notice the difference. (You could see a larger effect in one step by making the learning rate much higher, though that would not be a good idea for actually optimizing the model with many steps.)
Usage of the Adam optimizer with GPflow models is demonstrated in the notebook on stochastic variational inference, though it also works for non-stochastic optimization.
Note that, in any case, all parameters such as inducing point locations are set trainable by default, so your call to set_trainable(..., True) doesn't affect what's going on here.
I am experimenting with a simple 2 layer neural network with pytorch, feeding in only three inputs of size 10 each, with a single value as output. I have normalised inputs and lowered learning rate. It is my understanding that a two layer fully connected neural network should be able to trivially fit to this data
Features:
0.8138 1.2342 0.4419 0.8273 0.0728 2.4576 0.3800 0.0512 0.6872 0.5201
1.5666 1.3955 1.0436 0.1602 0.1688 0.2074 0.8810 0.9155 0.9641 1.3668
1.7091 0.9091 0.5058 0.6149 0.3669 0.1365 0.3442 0.9482 1.2550 1.6950
[torch.FloatTensor of size 3x10]
Targets
[124, 125, 122]
[torch.FloatTensor of size 3]
The code is adapted from a simple example and I am using MSELoss as the loss function. The loss diverges to infinity after just a few iterations:
features = torch.from_numpy(np.array(features))
x_data = Variable(torch.Tensor(features))
y_data = Variable(torch.Tensor(targets))
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear = torch.nn.Linear(10,5)
self.linear2 = torch.nn.Linear(5,1)
def forward(self, x):
l_out1 = self.linear(x)
y_pred = self.linear2(l_out1)
return y_pred
model = Model()
criterion = torch.nn.MSELoss(size_average = False)
optim = torch.optim.SGD(model.parameters(), lr = 0.001)
def main():
for iteration in range(1000):
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
print(iteration, loss.data[0])
optim.zero_grad()
loss.backward()
optim.step()
Any help would be appreciated. Thanks
EDIT:
Indeed it seems that this was simply due to the learning rate being too high. Setting to 0.00001 fixes convergence issues, albeit giving very slow convergence.
This is because you're not using a non-linearity between layers, and your network is still Linear.
You can use Relu in order to make it non linear. You can change the forward method like this :
...
y_pred = torch.nn.functional.F.relu(self.linear2(l_out1))
...
Maybe you can try to predict a log(y) instead of y to improve the convergence even more. Also Adam optimizer (adaptive learning rate) should help + BatchNormalization (for example between your linear layers).
I'm using a basic neural network in Theano/Lasagne to try to identify facial keypoints in images, and am currently trying to get it to learn a single image (I've just taken the first image from my training set). The images are 96x96 pixels, and there are 30 key points (outputs) that it needs to learn, but it fails to do so. This is my first attempt at using Theano/Lasagne, so I'm sure I've just missed something obvious, but I can't see what I've done wrong:
import sys
import os
import time
import numpy as np
import theano
import theano.tensor as T
import lasagne
import pickle
import matplotlib.pyplot as plt
def load_data():
with open('FKD.pickle', 'rb') as f:
save = pickle.load(f)
trainDataset = save['trainDataset'] # (5000, 1, 96, 96) np.ndarray of pixel values [-1,1]
trainLabels = save['trainLabels'] # (5000, 30) np.ndarray of target values [-1,1]
del save # Hint to help garbage collection free up memory
# Overtrain on dataset of 1
trainDataset = trainDataset[:1]
trainLabels = trainLabels[:1]
return trainDataset, trainLabels
def build_mlp(input_var=None):
relu = lasagne.nonlinearities.rectify
softmax = lasagne.nonlinearities.softmax
network = lasagne.layers.InputLayer(shape=(None, 1, imageSize, imageSize), input_var=input_var)
network = lasagne.layers.DenseLayer(network, num_units=numLabels, nonlinearity=softmax)
return network
def main(num_epochs=500, minibatch_size=500):
# Load the dataset
print "Loading data..."
X_train, y_train = load_data()
# Prepare Theano variables for inputs and targets
input_var = T.tensor4('inputs')
target_var = T.matrix('targets')
# Create neural network model
network = build_mlp(input_var)
# Create a loss expression for training, the mean squared error (MSE)
prediction = lasagne.layers.get_output(network)
loss = lasagne.objectives.squared_error(prediction, target_var)
loss = loss.mean()
# Create update expressions for training
params = lasagne.layers.get_all_params(network, trainable=True)
updates = lasagne.updates.nesterov_momentum(loss, params, learning_rate=0.01, momentum=0.9)
# Compile a function performing a training step on a mini-batch
train_fn = theano.function([input_var, target_var], loss, updates=updates)
# Collect points for final plot
train_err_plot = []
# Finally, launch the training loop.
print "Starting training..."
# We iterate over epochs:
for epoch in range(num_epochs):
# In each epoch, we do a full pass over the training data:
start_time = time.time()
train_err = train_fn(X_train, y_train)
# Then we print the results for this epoch:
print "Epoch %s of %s took %.3fs" % (epoch+1, num_epochs, time.time()-start_time)
print " training loss:\t\t%s" % train_err
# Save accuracy to show later
train_err_plot.append(train_err)
# Show plot
plt.plot(train_err_plot)
plt.title('Graph')
plt.xlabel('Epochs')
plt.ylabel('Training loss')
plt.tight_layout()
plt.show()
imageSize = 96
numLabels = 30
if __name__ == '__main__':
main(minibatch_size=1)
This gives me a graph that looks like this:
I'm pretty this network should be able to get the loss down to basically zero. I'd appreciate any help or thoughts on the matter :)
EDIT: Removed dropout and hidden layer to simplify the problem.
It turns out that I'd forgotten to change the output node functions from:
lasagne.nonlinearities.softmax
to:
lasagne.nonlinearities.linear
The code I was using as a base was for a classification problem (e.g. working out which digit the picture showed), whereas I was using the network for a regression problem (e.g. trying to find where certain features in an image are located). There are several useful output functions for classification problems, of which softmax is one of them, but regression problems require a linear output function to work.
Hope this helps someone else in the future :)