What is the purpose of with torch.no_grad(): - python

Consider the following code for Linear Regression implemented using PyTorch:
X is the input, Y is the output for the training set, w is the parameter that needs to be optimised
import torch
X = torch.tensor([1, 2, 3, 4], dtype=torch.float32)
Y = torch.tensor([2, 4, 6, 8], dtype=torch.float32)
w = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)
def forward(x):
return w * x
def loss(y, y_pred):
return ((y_pred - y)**2).mean()
print(f'Prediction before training: f(5) = {forward(5).item():.3f}')
learning_rate = 0.01
n_iters = 100
for epoch in range(n_iters):
# predict = forward pass
y_pred = forward(X)
# loss
l = loss(Y, y_pred)
# calculate gradients = backward pass
l.backward()
# update weights
#w.data = w.data - learning_rate * w.grad
with torch.no_grad():
w -= learning_rate * w.grad
# zero the gradients after updating
w.grad.zero_()
if epoch % 10 == 0:
print(f'epoch {epoch+1}: w = {w.item():.3f}, loss = {l.item():.8f}')
What does the 'with' block do? The requires_grad argument for w is already set to True. Why is it then being put under a with torch.no_grad() block?

There is no reason to track gradients when updating the weights; that is why you will find a decorator (#torch.no_grad()) for the step method in any implementation of an optimizer.
"With torch.no_grad" block means doing these lines without keeping track of the gradients.

The requires_grad argument tells PyTorch that we want to be able to calculate the gradients for those values. However, the with torch.no_grad() tells PyTorch to not calculate the gradients, and the program explicitly uses it here (as with most neural networks) in order to not update the gradients when it is updating the weights as that would affect the back propagation.

Related

The parameters of the model with custom loss function doesn’t upgraded thorough its learning over epochs

Thank you for reading my post.
I’m currently developing the peak detection algorithm using CNN to determine the ideal convolution kernel which is representable as the ideal mother wavelet function that will maximize the peak detection accuracy.
To begin with, I created my own IoU loss function and the simple model and tried to run the learning. The execution itself worked without any errors, but somehow it failed.
The parameters of the model with custom loss function doesn't upgraded thorough its learning over epochs
My own loss function is described as below.
def IoU(inputs: torch.Tensor, labels: torch.Tensor,
smooth: float=0.1, threshold: float = 0.5, alpha: float = 1.0):
'''
- alpha: a parameter that sharpen the thresholding.
if alpha = 1 -> thresholded input is the same as raw input.
'''
thresholded_inputs = inputs**alpha / (inputs**alpha + (1 - inputs)**alpha)
inputs = torch.where(thresholded_inputs < threshold, 0, 1)
batch_size = inputs.shape[0]
intersect_tensor = (inputs * labels).view(batch_size, -1)
intersect = intersect_tensor.sum(-1)
union_tensor = torch.max(inputs, labels).view(batch_size, -1)
union = union_tensor.sum(-1)
iou = (intersect + smooth) / (union + smooth) # We smooth our devision to avoid 0/0
iou_score = iou.mean()
return 1- iou_score
and my training model is,
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Conv1d(1, 1, kernel_size=32, stride=1, padding=16),
nn.Linear(257, 256),
nn.LogSoftmax(1)
)
def forward(self, x):
return self.net(x)
model = MLP()
opt = optim.Adadelta(model.parameters())
# initialization of the kernel of Conv1d
def init_kernel(m):
if type(m) == nn.Conv1d:
nn.init.kaiming_normal_(m.weight)
print(m.weight)
plt.plot(m.weight[0][0].detach().numpy())
model.apply(init_kernel)
def step(x, y, is_train=True):
opt.zero_grad()
y_pred = model(x)
y_pred = y_pred.reshape(-1, 256)
loss = IoU(y_pred, y)
loss.requires_grad = True
loss.retain_grad = True
if is_train:
loss.backward()
opt.step()
return loss, y_pred
and lastly, the execution code is,
from torch.autograd.grad_mode import F
train_loss_arr, val_loss_arr = [], []
valbose = 10
epochs = 200
for e in range(epochs):
train_loss, val_loss, acc = 0., 0., 0.,
for x, y in train_set.as_numpy_iterator():
x = torch.from_numpy(x)
y = torch.from_numpy(y)
model.train()
loss, y_pred = step(x, y, is_train=True)
train_loss += loss.item()
train_loss /= len(train_set)
for x, y ,in val_set.as_numpy_iterator():
x = torch.from_numpy(x)
y = torch.from_numpy(y)
model.eval()
with torch.no_grad():
loss, y_pred = step(x, y, is_train=False)
val_loss += loss.item()
val_loss /= len(val_set)
train_loss_arr.append(train_loss)
val_loss_arr.append(val_loss)
# visualize current kernel to check whether the learning is on progress safely.
if e % valbose == 0:
print(f"Epoch[{e}]({(e*100/epochs):0.2f}%): train_loss: {train_loss:0.4f}, val_loss: {val_loss:0.4f}")
fig, axs = plt.subplots(1, 4, figsize=(12, 4))
print(y_pred[0], y_pred[0].shape)
axs[0].plot(x[0][0])
axs[0].set_title("spectra")
axs[1].plot(y_pred[0])
axs[1].set_title("y pred")
axs[2].plot(y[0])
axs[2].set_title("y true")
axs[3].plot(model.state_dict()["net.0.weight"][0][0].numpy())
axs[3].set_title("kernel1")
plt.show()
with these programs, I tried to evaluate this simple model, however, model parameters didn't change at all over epochs.
Visualization of the results at epoch 0 and 30.
epoch 0:
prediction and kernel at epoch0
epoch 30:
prediction and kernel at epoch30
As you can see, the kernel has not be modified through its learning over epochs.
I took a survey to figure out what causes this problem for hours but I'm still not sure how to fix my loss function and model into trainable ones.
Thank you.
Try printing the gradient after loss.backward() with:
y_pred.grad()
I suspect what you'll find is that after a backward pass, the gradient of y_pred is zero. This means that either a.) gradient is not enabled for one or more of the variables at which the computation graph has a node, or b.) (more likely) you are using an operation which is not differentiable.
In your case, at a minimum torch.where is non-differentiable, so you'll need to replace that. Thersholding operations are non-differentiable and are generally replaced with "soft" thresholding operations (see Softmax instead of max function for classification) so that gradient computation still works. Try replacing this with a soft threshold or no threshold at all.

Why PyTorch optimizer might fail to update its parameters?

I am trying to do a simple loss-minimization for a specific variable coeff using PyTorch optimizers. This variable is supposed to be used as an interpolation coefficient for two vectors w_foo and w_bar to find a third vector, w_target.
w_target = `w_foo + coeff * (w_bar - w_foo)
With w_foo and w_bar set as constant, at each optimization step I calculate w_target for the given coeff. Loss is determined from w_target using a fairly complex process beyond the scope of this question.
# w_foo.shape = [1, 16, 512]
# w_bar.shape = [1, 16, 512]
# num_layers = 16
# num_steps = 10000
vgg_loss = VGGLoss()
coeff = torch.randn([num_layers, ])
optimizer = torch.optim.Adam([coeff], lr=initial_learning_rate)
for step in range(num_steps):
w_target = w_foo + torch.matmul(coeff, (w_bar - w_foo))
optimizer.zero_grad()
target_image = generator.synthesis(w_target)
processed_target_image = process(target_image)
loss = vgg_loss(processed_target_image, source_image)
loss.backward()
optimizer.step()
However, when running this optimizer, I am met with query_opt not changing from one step to another, optimizer being essentially useless. I would like to ask for some advice on what I am doing wrong here.
Edit:
As suggested, I will try to elaborate on the loss function. Essentially, w_target is used to generate an image, and VGGLoss uses VGG feature extractor to compare this synthetic image with a certain exemplar source image.
class VGGLoss(torch.nn.Module):
def __init__(self, device, vgg):
super().__init__()
for param in self.parameters():
param.requires_grad = True
self.vgg = vgg # VGG16 in eval mode
def forward(self, source, target):
loss = 0
source_features = self.vgg(source, resize_images=False, return_lpips=True)
target_features = self.vgg(target, resize_images=False, return_lpips=True)
loss += (source_features - target_features).square().sum()
return loss

PyTorch matrix factorization with fixed item matrix

I estimate ratings in a user-item matrix by decomposing the matrix into two matrices P and Q using PyTorch Matrix Factorization. I got my loss function L(X-PQ).
Let's say rows of X correspond to users, and x is new user's row, so that new X is X concatenated with x.
Now I want to minimize L(X' - P'Q) = L(X - PQ) + L(x - x_pQ). Since I have already trained P and Q.
I want to train x_p that is the new user's row, but leave Q fixed.
So my question would be, is there a way in PyTorch to train MatrixFactorization model for P with fixed Q?
Code I'm working with:
class MatrixFactorizationWithBiasXavier(nn.Module):
def __init__(self, num_people, num_partners, bias=(-0.01, 0.01), emb_size=100):
super(MatrixFactorizationWithBiasXavier, self).__init__()
self.person_emb = nn.Embedding(num_people, emb_size)
self.person_bias = nn.Embedding(num_people, 1)
self.partner_emb = nn.Embedding(num_partners, emb_size)
self.parnter_bias = nn.Embedding(num_partners, 1)
torch.nn.init.xavier_uniform_(self.person_emb.weight)
torch.nn.init.xavier_uniform_(self.partner_emb.weight)
self.person_bias.weight.data.uniform_(bias[0], bias[1])
self.parnter_bias.weight.data.uniform_(bias[0], bias[1])
def forward(self, u, v):
u = self.person_emb(u)
v = self.partner_emb(v)
bias_u = self.person_bias(u).squeeze()
bias_v = self.parnter_bias(v).squeeze()
# calculate dot product
# u*v is a element wise vector multiplication
return torch.sigmoid((u*v).sum(1) + bias_u + bias_v)
def test(model, df_test, verbose=False):
model.eval()
# .to(dev) puts code on either gpu or cpu.
people = torch.LongTensor(df_test.id.values).to(dev)
partners = torch.LongTensor(df_test.pid.values).to(dev)
decision = torch.FloatTensor(df_test.decision.values).to(dev)
y_hat = model(people, partners)
loss = F.mse_loss(y_hat, decision)
if verbose:
print('test loss %.3f ' % loss.item())
return loss.item()
def train(model, df_train, epochs=100, learning_rate=0.01, weight_decay=1e-5, verbose=False):
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
model.train()
for epoch in range(epochs):
# From numpy to PyTorch tensors.
# .to(dev) puts code on either gpu or cpu.
people = torch.LongTensor(df_train.id.values).to(dev)
partners = torch.LongTensor(df_train.pid.values).to(dev)
decision = torch.FloatTensor(df_train.decision.values).to(dev)
# calls forward method of the model
y_hat = model(people, partners)
# Using mean squared errors loss function
loss = F.mse_loss(y_hat, decision)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if verbose and epoch % 100 == 0:
print(loss.item())
Found a solution.
Turns out I can register a hook on my embedding(partner emb, that is my Q) (that i want to stay fixed) so that it says it's gradient is zeroed.
mask = torch.zeros_like(mf_model.partner_emb.weight)
mf_model.partner_emb.weight.register_hook(lambda grad: grad*mask)

How to update weights with TensorFlow Eager Execution?

So I tried TensorFlow's eager execution and my implementation of it wasn't successful. I used gradient.tape, and while the program runs, there is no visible update in any of the weights. I've seen some sample algorithms and tutorials using optimizer.apply_gradients() in order to update all variables, but I'm assuming I'm not using it properly.
import tensorflow as tf
import tensorflow.contrib.eager as tfe
# emabling eager execution
tf.enable_eager_execution()
# establishing hyperparameters
LEARNING_RATE = 20
TRAINING_ITERATIONS = 3
# establishing all LABLES
LABELS = tf.constant(tf.random_normal([3, 1]))
# print(LABELS)
# stub statment for input
init = tf.Variable(tf.random_normal([3, 1]))
# declare and intialize all weights
weight1 = tfe.Variable(tf.random_normal([2, 3]))
bias1 = tfe.Variable(tf.random_normal([2, 1]))
weight2 = tfe.Variable(tf.random_normal([3, 2]))
bias2 = tfe.Variable(tf.random_normal([3, 1]))
weight3 = tfe.Variable(tf.random_normal([2, 3]))
bias3 = tfe.Variable(tf.random_normal([2, 1]))
weight4 = tfe.Variable(tf.random_normal([3, 2]))
bias4 = tfe.Variable(tf.random_normal([3, 1]))
weight5 = tfe.Variable(tf.random_normal([3, 3]))
bias5 = tfe.Variable(tf.random_normal([3, 1]))
VARIABLES = [weight1, bias1, weight2, bias2, weight3, bias3, weight4, bias4, weight5, bias5]
def thanouseEyes(input): # nn model aka: Thanouse's Eyes
layerResult = tf.nn.relu(tf.matmul(weight1, input) + bias1)
input = layerResult
layerResult = tf.nn.relu(tf.matmul(weight2, input) + bias2)
input = layerResult
layerResult = tf.nn.relu(tf.matmul(weight3, input) + bias3)
input = layerResult
layerResult = tf.nn.relu(tf.matmul(weight4, input) + bias4)
input = layerResult
layerResult = tf.nn.softmax(tf.matmul(weight5, input) + bias5)
return layerResult
# Begin training and update variables
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
with tf.GradientTape(persistent=True) as tape: # gradient calculation
for i in range(TRAINING_ITERATIONS):
COST = tf.reduce_sum(LABELS - thanouseEyes(init))
GRADIENTS = tape.gradient(COST, VARIABLES)
optimizer.apply_gradients(zip(GRADIENTS, VARIABLES))
print(weight1)
The usage of optimizer seems fine, however the computation defined by thanouseEyes() will always return [1., 1., 1.] irrespective of the variables, thus the gradients are always 0 and thus the variables will never be updated (print(thanouseEyes(init)) and print(GRADIENTS) should demonstrate that).
Digging in a bit more, tf.nn.softmax is applied to x = tf.matmul(weight5, input) + bias5 which has a shape of [3, 1]. So the tf.nn.softmax(x) is effectively computing [softmax(x[0]), softmax(x[1]), softmax(x[2])] as tf.nn.softmax applies (by default) on the last axis of the input. x[0], x[1], and x[2] are vectors with one element so softmax(x[i]) will always be 1.0.
Hope that helps.
Some additional points unrelated to your question that you may be interested in:
As of TensorFlow 1.11, you don't need the tf.contrib.eager module in your program. Replace all occurrences of tfe with tf (i.e., tf.Variable instead of tfe.Variable) and you'll get the same result
Computation performed inside the context of a GradientTape is "recorded", i.e., it holds on to intermediate tensors so that gradients can be computed later on. Long story short, you'd want to move the GradientTape inside the loop body:
-
for i in range(TRAINING_ITERATIONS):
with tf.GradientTape() as tape:
COST = tf.reduce_sum(LABELS - thanouseEyes(init))
GRADIENTS = tape.gradient(COST, VARIABLES)
optimizer.apply_gradients(zip(GRADIENTS, VARIABLES))

How to implement multivariate linear stochastic gradient descent algorithm in tensorflow?

I started with simple implementation of single variable linear gradient descent but don't know to extend it to multivariate stochastic gradient descent algorithm ?
Single variable linear regression
import tensorflow as tf
import numpy as np
# create random data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.5
# Find values for W that compute y_data = W * x_data
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
y = W * x_data
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# Before starting, initialize the variables
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
# Fit the line.
for step in xrange(2001):
sess.run(train)
if step % 200 == 0:
print(step, sess.run(W))
You have two part in your question:
How to change this problem to a higher dimension space.
How to change from the batch gradient descent to a stochastic gradient descent.
To get a higher dimensional setting, you can define your linear problem y = <x, w>. Then, you just need to change the dimension of your Variable W to match the one of w and replace the multiplication W*x_data by a scalar product tf.matmul(x_data, W) and your code should run just fine.
To change the learning method to a stochastic gradient descent, you need to abstract the input of your cost function by using tf.placeholder.
Once you have defined X and y_ to hold your input at each step, you can construct the same cost function. Then, you need to call your step by feeding the proper mini-batch of your data.
Here is an example of how you could implement such behavior and it should show that W quickly converges to w.
import tensorflow as tf
import numpy as np
# Define dimensions
d = 10 # Size of the parameter space
N = 1000 # Number of data sample
# create random data
w = .5*np.ones(d)
x_data = np.random.random((N, d)).astype(np.float32)
y_data = x_data.dot(w).reshape((-1, 1))
# Define placeholders to feed mini_batches
X = tf.placeholder(tf.float32, shape=[None, d], name='X')
y_ = tf.placeholder(tf.float32, shape=[None, 1], name='y')
# Find values for W that compute y_data = <x, W>
W = tf.Variable(tf.random_uniform([d, 1], -1.0, 1.0))
y = tf.matmul(X, W, name='y_pred')
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y_ - y))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# Before starting, initialize the variables
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
# Fit the line.
mini_batch_size = 100
n_batch = N // mini_batch_size + (N % mini_batch_size != 0)
for step in range(2001):
i_batch = (step % n_batch)*mini_batch_size
batch = x_data[i_batch:i_batch+mini_batch_size], y_data[i_batch:i_batch+mini_batch_size]
sess.run(train, feed_dict={X: batch[0], y_: batch[1]})
if step % 200 == 0:
print(step, sess.run(W))
Two side notes:
The implementation below is called a mini-batch gradient descent as at each step, the gradient is computed using a subset of our data of size mini_batch_size. This is a variant from the stochastic gradient descent that is usually used to stabilize the estimation of the gradient at each step. The stochastic gradient descent can be obtained by setting mini_batch_size = 1.
The dataset can be shuffle at every epoch to get an implementation closer to the theoretical consideration. Some recent work also consider only using one pass through your dataset as it prevent over-fitting. For a more mathematical and detailed explanation, you can see Bottou12. This can be easily change according to your problem setup and the statistic property your are looking for.

Categories