Pytorch gradients has not calculated - python

I create a NN. I'm having a problem with recounting gradients. The problem is that I scalarly multiply 2 tensors u # v and normalize one of them. It is important that gradients cannot be calculated for h. Therefore, I use detach(). In addition, during the recalculation of gradients, normalization should not be taken into account (I do not know how to do this).
import torch
from torch import nn
class Nn(nn.Module):
def __init__(self):
super(Nn, self).__init__()
self.ln = nn.Linear(5, 5)
def forward(self, x):
v = self.ln(x)
u = v.clone()
h = v.clone()
u /= u.norm()
h = h.detach()
h /= h.norm()
res = torch.stack([torch.stack([u # h, u # h])])
return res
def patches_generator():
while True:
decoder = torch.rand((5, ))
target = torch.randint(2, (1,))
yield decoder, target
net = Nn()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters())
for decoder, targets in patches_generator():
outputs = net(decoder)
loss = criterion(outputs, targets)
As a result, I get the following error:
RuntimeError: one of the variables needed for gradient computation has
been modified by an inplace operation: [torch.FloatTensor [9, 512, 1,
1]], which is output 0 of ReluBackward1, is at version 3; expected
version 2 instead. Hint: the backtrace further above shows the
operation that failed to compute its gradient. The variable in
question was changed in there or anywhere later. Good luck!

The problem is the in-place division operator applied to u in this line:
u /= u.norm()
changing it to
u = u / u.norm()
makes the code run. The reason is that the in-place operator overwrites the intermediate result from this line
u = v.clone()
which makes it impossible for Pytorch to compute the gradient.
(The error message in the question contains a reference to a ReluBackward1 layer which is not in the reduced code example. Pytorch ReLU layers have an optional in_place argument which makes the operation in place while supporting backprop. This often works, because in a sequential network there is no need to distinguish between the output of the ReLU activation and the output of the weights to compute the gradient, but in more complex architectures it might be necessary to retain the output of the weights.)


Apply gradient to a tensor in a sparse way in PyTorch

I have a very large tensor L (millions of elements), from which I gather a relatively small subtensor S (maybe a thousand of elements).
I then apply my model to S, compute loss, and backpropagate to S and to L with the intent to only update selected elements in L. Problem is PyTorch makes L's gradient to be a continuous tensor, so it basically doubles L's memory usage.
Is there an easy way to compute and apply gradient to L without doubling memory usage?
Sample code to illustrate the problem:
import torch
import torch.nn as nn
from torch.nn.parameter import Parameter
net = nn.Sequential(
nn.Linear(1, 64),
nn.Linear(64, 1))
L = Parameter(torch.zeros([1024*1024*256], dtype=torch.float32)), 1)
indices = torch.randint(high=256*1024*1024, size=[1024])
S = torch.unsqueeze(L[indices], dim=1)
out = net(S)
loss = out.sum()
g = L.grad
print(g.shape) # this is huge!
You don't actually need requires_grad on L as gradients will be computed and applied manually. Instead, set it on S. That will stop backpropagation at S.
Then, you can update the values of L using S.grad and your preferred optimization. Something along these lines
L = torch.zeros([1024*1024*256], dtype=torch.float32)
S = torch.unsqueeze(L[indices], dim=1)
out = net(S)
loss = torch.abs(out).sum()
with torch.no_grad():
L[indices] -= learning_rate * torch.squeeze(S.grad)

Runtime Error : both arguments to matmul need to be at least 1d but they are 0d and 2d

This is the code I have written, I have tried modifying here and there and have always gotten the same error. As I am a beginner to PyTorch, I am just trying things out to see if machine learning will work on a linear dataset. So, with random, I initialized a dataset. Then, made a single linear neural network. Then, train the neural network on the linear data. Nevertheless, I have an error stating that matmul expects a 1d data, but then my data is at least 1d but is read as 0d instead.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
class LinearNetwork(nn.Module):
def __init__(self):
super(LinearNetwork, self).__init__()
self.fc1 = nn.Linear(1, 1)
def forward(self, input):
x = self.fc1(input)
return x;
#Generate a Dataset to train on. Just a linear function, and we wanna see if this thing is working
DataSet = []
grad = random.randint(0,100)
const = random.randint(0,100)
for i in range(0, 1000):
DataSet.append([i, grad*i + const])
DataSet = torch.tensor(DataSet,dtype = torch.double, requires_grad = True)
#Declare a Linear Network
Net = LinearNetwork()
criterion = nn.MSELoss()
optimizer = optim.SGD(Net.parameters(), lr = 0.01)
#Train Model
for i, data in enumerate(DataSet, 0):
Input, Target = data
Output = Net(Input)
loss = criterion(Output, Target)
print("Gradient of the Function is : " + str(grad))
print("Constant Value of the Function is : " + str(const))
print("Learned Gradient and Constant : " + list(Net.parameters()))
There are two errors in your code, first regarding shapes, second regarding dtypes. BTW. Please use snake_case for variables (e.g. my_dataset, net) and CamelCase for classes as it's a common Python convention.
Shape error
This one lies here:
for i, data in enumerate(DataSet, 0):
input, target = data
output = net(input)
loss = criterion(output, target)
When you print input.shape you get torch.Size([]) which is a 0d tensor. Matrix multiplication needs 1d tensor so you should unsqueeze it so it has this dimension. Change above output = net(input) to:
output = net(input.unsqueeze(dim=0))
dtype error
DataSet = torch.tensor(DataSet,dtype = torch.double, requires_grad = True)
You create tensor of torch.double type while your neural network is (by default) of type torch.float. The latter is usually used as that's enough numerical precision and saves memory. So instead of the above you should do:
DataSet = torch.tensor(DataSet,dtype = torch.float, requires_grad = True)
Shape again
Coming back to shapes, neural networks should operate on at least two dimensions: (batch, features). In your case batch=1 and features=1, so it should be unsqueezed once more (and i advise you to try your code using batches).

'None' gradient is being returned for variables

I am currently using TensorFlow version 1.14.
In the code below, I am trying to create a dummy model that takes in 2 inputs and provides two outputs, with all weights set to ones and biases to zeros (Single layered perceptron). I am defining a custom loss function that computes the jacobian of the input layer wrt the output layer.
# Prior function
def f_i(x):
x1 = np.arctanh(x)
return np.exp(-x1**2)
B = np.random.choice(x, (10000,2), p = f_i(x)/np.sum(f_i(x)))
def my_loss(y_pred, y_true):
jacobian_tf = jacobian_tensorflow3(sim.output, sim.input)
loss = tf.abs(tf.linalg.det(jacobian_tf))
return K.mean(loss)
def jacobian_tensorflow3(x,y, verbose=False):
jacobian_matrix = []
it = tqdm(range(ndim)) if verbose else range(ndim)
for o in it:
grad_func = tf.gradients(x[:,o], y)
jacobian_matrix = tf.stack(jacobian_matrix)
jacobian_matrix1 = tf.transpose(jacobian_matrix, perm=[1,0,2])
return jacobian_matrix1
sim = Sequential()
sim.add(Dense(2, kernel_initializer='ones', bias_initializer='zeros', activation='linear', input_dim=2))
sim.compile(optimizer='adam', loss=my_loss), np.random.random(B.shape), batch_size=100, epochs=2)
While this model works in giving the result of the Jacobian matrix and also has no issues with compilation, but when I run I get the following error:
ValueError: Variable <tf.Variable 'dense_14/bias:0' shape=(2,) dtype=float32> has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
I am stuck at this step for a long time, and I am not able to proceed ahead. Any help/suggestions would be beneficial.

Not fully connected layer in tensorflow

I want to create a network where in the input layer nodes are just connected to some nodes in the next layer. Here is a small example:
My solution so far is that I set the weight of the edge between i1 and h1 to zero and after every optimization step I multiply the weights with a matrix (I call this matrix mask matrix) in which every entry is 1 except the entry of the weight of the edge between i1 and h1.
(See code below)
Is this approach right? Or does this have a affect on the GradientDescent? Is there another approach to create this kind of a network in TensorFlow?
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
model = tf.keras.Sequential([
tf.keras.layers.Dense(2, activation=tf.sigmoid, input_shape=(2,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.sigmoid)
#set the weights
weights=[np.array([[0, 0.25],[0.2,0.3]]),np.array([0.35,0.35]),np.array([[0.4,0.5],[0.45, 0.55]]),np.array([0.6,0.6])]
features = tf.convert_to_tensor([[0.05,0.10 ]])
labels = tf.convert_to_tensor([[0.01,0.99 ]])
mask =np.array([[0, 1],[1,1]])
#define the loss function
def loss(model, x, y):
y_ = model(x)
return tf.losses.mean_squared_error(labels=y, predictions=y_)
#define the gradient calculation
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
#create optimizer an global Step
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
global_step = tf.train.get_or_create_global_step()
#optimization step
loss_value, grads = grad(model, features, labels)
optimizer.apply_gradients(zip(grads, model.variables),global_step)
#masking the optimized weights
If you are looking for a solution for the specific example you provided, you can simply use tf.keras Functional API and define two Dense layers where one is connected to both neurons in the previous layer and the other one is only connected to one of the neurons:
from tensorflow.keras.layer import Input, Lambda, Dense, concatenate
from tensorflow.keras.models import Model
inp = Input(shape=(2,))
inp2 = Lambda(lambda x: x[:,1:2])(inp) # get the second neuron
h1_out = Dense(1, activation='sigmoid')(inp2) # only connected to the second neuron
h2_out = Dense(1, activation='sigmoid')(inp) # connected to both neurons
h_out = concatenate([h1_out, h2_out])
out = Dense(2, activation='sigmoid')(h_out)
model = Model(inp, out)
# simply train it using `fit`
The problem with your solution and some others suggested by other answers in this post is that they do not prevent training of this weight. They allow the gradient descent to train the non existent weight and then overwrite it retrospectively. This will result in a network that has a zero in this location as desired, but will negatively affect your training process as the back propagation calculation will not see the masking step as it is not part of a TensorFlow graph and so the gradient descent will follow a path which includes the assumption that this weight does have an affect on the outcome (it does not).
A better solution would be to include the masking step as a part of your TensorFlow graph, so that it can be factored into the gradient descent. Since the masking step is simply a element wise multiplication by your sparse, binary martix mask, you could just include the mask matrix as an elementwise matrix multiplicaiton in the graph definition using tf.multiply.
Sadly this means sying goodbye to the user friendly keras,layers methods and embracing a more nuts & bolts approach to TensorFlow. I can't see an obvious way to do it using the layers API.
See the implementation below, I have tried to provide comments explaining what is happening at each stage.
import tensorflow as tf
## Graph definition for model
# set up tf.placeholders for inputs x, and outputs y_
# these remain fixed during training and can have values fed to them during the session
with tf.name_scope("Placeholders"):
x = tf.placeholder(tf.float32, shape=[None, 2], name="x") # input layer
y_ = tf.placeholder(tf.float32, shape=[None, 2], name="y_") # output layer
# set up tf.Variables for the weights at each layer from l1 to l3, and setup feeding of initial values
# also set up mask as a variable and set it to be un-trianable
with tf.name_scope("Variables"):
w_l1_values = [[0, 0.25],[0.2,0.3]]
w_l1 = tf.Variable(w_l1_values, name="w_l1")
w_l2_values = [[0.4,0.5],[0.45, 0.55]]
w_l2 = tf.Variable(w_l2_values, name="w_l2")
mask_values = [[0., 1.], [1., 1.]]
mask = tf.Variable(mask_values, trainable=False, name="mask")
# link each set of weights as matrix multiplications in the graph. Inlcude an elementwise multiplication by mask.
# Sequence takes us from inputs x to output final_out, which will be compared to labels fed to placeholder y_
l1_out = tf.nn.relu(tf.matmul(x, tf.multiply(w_l1, mask)), name="l1_out")
final_out = tf.nn.relu(tf.matmul(l1_out, w_l2), name="output")
## define loss function and training operation
with tf.name_scope("Loss"):
# some loss defined as a function of graph output: final_out and labels: y_
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=final_out, labels=y_, name="loss")
with tf.name_scope("Train"):
# some optimisation strategy, arbitrary learning rate
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, name="optimizer_adam")
train_op = optimizer.minimize(loss, name="train_op")
# create session, initialise variables and train according to inputs and corresponding labels
# This should show that the values of the first layer weights change, but the one set to 0 remains at 0
with tf.Session() as sess:
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
inputs = [[0.05, 0.10]]
labels = [[0.01, 0.99]]
ans =, feed_dict={"Placeholders/x:0": inputs, "Placeholders/y_:0": labels})
train_steps = 1
for i in range(train_steps):
initial_l1_weights = sess.graph.get_tensor_by_name("Variables/w_l1:0")
Or use the answer provided by today for a keras friendly option.
You have multiple options here.
First, you could use the dynamic masking approach in your example. I believe this will work as expected since the gradients w.r.t. the masked-out parameters will be zero (the output is constant when you change the unused parameters). This approach is simple and it can be used even when your mask is not constant during the training.
Second, if you know beforehand which weights will be always zero, you can compose your weight matrix using tf.get_variable to get a submatrix, and then concatenate it with a tf.constant tensor, e.g.:
weights_sub = tf.get_variable("w", [dim_in, dim_out - 1])
zeros = tf.zeros([dim_in, 1])
weights = tf.concat([weights_sub, zeros], axis=1)
this example will make one column of your weight matrix to be always zero.
Finally, if your mask is more complex, you can use tf.get_variable on a flattened vector and then compose a tf.SparseTensor with the variable values on the used indices:
weights_used = tf.get_variable("w", [num_used_vars])
indices = ... # get your indices in a 2-D matrix of shape [num_used_vars, 2]
dense_shape = tf.constant([dim_in, dim_out]) # this is the final shape of the weight matrix
weights = tf.SparseTensor(indices, weights_used, dense_shape)
EDIT: This probably won't work in combination with Keras' set_weights method, as it expects Numpy arrays, not Tensors.

TensorFlow pass gradient unchaned

Say I have some custom operation binarizer used in a neural network. The operation takes a Tensor and constructs a new Tensor. I would like to modify that operation such that it is only used in the forward pass. In the backward pass, when gradients are calculated, it should just pass through the gradients reaching it.
More concretly, say binarizer is:
def binarizer(input):
prob = tf.truediv(tf.add(1.0, input), 2.0)
bernoulli = tf.contrib.distributions.Bernoulli(p=prob, dtype=tf.float32)
return 2 * bernoulli.sample() - 1
and I setup my network:
# ...
h1_before_my_op = tf.nn.tanh(tf.matmul(x, W) + bias_h1)
h1 = binarizer(h1_before_b)
# ...
loss = tf.reduce_mean(tf.square(y - y_true))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
How do I tell TensorFlow to skip gradient calculation in the backward pass?
I tried defining a custom operation as described in this answer, however: py_func cannot return Tensors, that's not what it is made for – I get:
UnimplementedError (see above for traceback): Unsupported object type Tensor
You're looking for tf.stop_gradient(input, name=None):
Stops gradient computation.
When executed in a graph, this op outputs its input tensor as-is.
h1 = binarizer(h1_before_b)
h1 = tf.stop_gradient(h1)
