I would like to implement a stopping condition based on the value of the gradient of the loss function w.r.t. the weights.
For example, let's say I have something like this:
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
train_op = optimizer.apply_gradients(grads_and_vars)
then I would like to run the graph with something like this:
for step in range(TotSteps):
output = sess.run([input], feed_dict=some_dict)
if(grad_taken_in_some_way < some_treshold):
print("Training finished.")
break
I am not sure what I should pass to sess.run() in order to get as output also the gradient (besides all other stuff I need). I am not even sure whether this is the correct approach or I should do it differently. I made some tries but I failed every time. Hope someone has some hints.
Thank you in advance!
EDIT: English correction
EDIT2: Answer by Iballes is exactly what I wanted to do. Still, I am not sure how to norm and sum all the gradients. Since I have different layer in my CNN and different weights with different shape, if I just do what you suggested, I get an error on the add_n() operation (since I am trying to add together matrices with different shapes). So probably I should do something like:
grad_norms = [tf.nn.l2_normalize(g[0], 0) for g in grads_and_vars]
grad_norm = [tf.reduce_sum(grads) for grads in grad_norms]
final_grad = tf.reduce_sum(grad_norm)
Can anyone confirm this?
Your line output = sess.run([input], feed_dict=some_dict) makes think that you have a little misunderstanding of the sess.run command. What you call [input] is supposed to be a list of tensors that are to be fetched by the sess.run command. Hence, it is an output rather than an input. To tackle your question, let's assume that you are doing something like output = sess.run(loss, feed_dict=some_dict) instead (in order to monitor the training loss).
Also, I suppose you want to formulate your stopping criterion using the norm of the gradient (the gradient itself is a multi-dimensional quantity). Hence, what you want to do is to fetch the norm of the gradient each time you execute the graph. To that end, you have to do two things. 1) Add the gradient norm to the computation graph. 2) Fetch it in each call to sess.run in your training loop.
Ad 1) You have added the gradients to the graph via
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
and now have the tensors holding the gradients in grads_and_vars (one for each trained variable in the graph). Let's take the norm of each gradient and then sum it up:
grad_norms = [tf.nn.l2_loss(g) for g, v in grads_and_vars]
grad_norm = tf.add_n(grad_norms)
There you have your gradient norm.
Ad 2) Inside your loop, fetch the gradient norm alongside the loss by telling the sess.run command to do so.
for step in range(TotSteps):
l, gn = sess.run([loss, grad_norm], feed_dict=some_dict)
if(gn < some_treshold):
print("Training finished.")
break
Related
I am working on a NN with Pytorch which simply maps points from the plane into real numbers, for example
model = nn.Sequential(nn.Linear(2,2),nn.ReLU(),nn.Linear(2,1))
What I want to do, since this network defines a map h:R^2->R, is to compute the gradient of this mapping h in the training loop. So for example
for it in range(epochs):
pred = model(X_train)
grad = torch.autograd.grad(pred,X_train)
....
The training set has been defined as a tensor requiring the gradient. My problem is that even if the output, for each fixed point, is a scalar, since I am propagating a set of N=100 points, the output is actually a Nx1 tensor. This brings to the error: autograd can compute the gradient just of scalar functions.
In fact, trying with the little change
pred = torch.sum(model(X_train))
everything works perfectly. However I am interested in all the single gradients so, is there a way to compute all these gradients together?
Actually computing the sum as presented above gives exactly the same result I expect of course, but I wanted to know if this is the only possiblity.
There are other possibilities but using .sum is the simplest way. Using .sum() on the final loss vector and computing dpred/dinput will give you the desired output. Here is why:
Since, pred = sum(loss) = sum (f(xi))
where i is the index of input x.
dpred/dinput will be a matrix [dpred/dx0, dpred/dx1, dpred/dx...]
Consider, dpred/dx0, it will be equal to df(x0)/dx0, since other df(xi)/dx0 is 0.
PS: Please excuse the crappy mathematical expressions... SO does not support latex/math expressions.
I'm interested in examining loss values per example, using a tf.keras.Model instance, with the following restrictions:
Assuming I already called model.compile(), the information about the loss function is there - I don't want to explicitly define loss like in this example from the docs
Without performing gradient descend (I am aware of the option of accessing the loss in tf.callback.Callbacks, but I don't want to perform GD)
Also, using a callback to get losses and then reloading the initial weights is not a valid solution, I want to avoid GD entirely.
So is there a way to achieve this? I would expect something that will look something like
model = tf.keras.Sequential([....])
model.compile(optimizer=..., loss=...)
single_loss_value = model.get_loss(single_x, single_y)
batch_loss_valujes = model.get_loss(x, y)
To get the loss of some samples you can use model.evaluate. There you can use x and y, as a single sample or a batch.
model = tf.keras.Sequential([....])
model.compile(optimizer=..., loss=...)
single_loss_value = model.evaluate(single_x, single_y)
batch_loss_values = model.evaluate(x, y)
After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.
I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.
I want to use Tensorflow to calculate the gradients of a function. However, if I use the tf.gradients function, it returns a single list of gradients. How to return a list for each point of the batch?
# in a tensorflow graph I have the following code
tf_x = tf.placeholder(dtype=tf.float32, shape=(None,N_in), name='x')
tf_net #... conveniently defined neural network
tf_y = tf.placeholder(dtype=tf.float32, shape=(None,1), name='y')
tf_cost = (tf_net(tf_x) - tf_y)**2 # this should have length N_samples because I did not apply a tf.reduce_mean
tf_cost_gradients = tf.gradients(tf_cost,tf_net.trainable_weights)
If we run it in a tensorflow session,
# suppose myx = np.random.randn(N_samples,N_in) and myy conveniently chosen
feed = {tf_x:myx, tx_y:myy}
sess.run(tf_cost_gradients,feed)
I get only one list, and not a list for each sample as I would like. I can use
for i in len(myx):
feed = {tf_x:myx[i], tx_y:myy[i]}
sess.run(tf_cost_gradients,feed)
but this is extremely slow! What can I do? Thank you
Although, there is an 'aggregation_method' parameter in tf.gradients, it is not easy to get the individual gradients.
aggregation_method: Specifies the method used to combine gradient terms.
Please see these threads:
https://github.com/tensorflow/tensorflow/issues/15760
https://github.com/tensorflow/tensorflow/issues/4897
In one of the threads(#4897), Ian Goodfellow makes the following suggestion to speed up individual gradient computation:
This is only pseudocode, but basic idea is:
examples = tf.split(batch)
weight_copies = [tf.identity(weights) for x in examples]
output = tf.stack(f(x, w) in zip(examples, weight_copies))
cost = cost_function(output)
per_example_gradients = tf.gradients(cost, weight_copies)
Using tf.nn.sparse_softmax_cross_entropy_with_logits in tensorflow, its possible to only calculate loss for specific rows by setting the class label to -1 (it is otherwise expected to be in the range 0->numclasses-1).
Unfortunately this breaks the gradient computations (as is mentioned in the comments in the source nn_ops.py).
What I would like to do is something like the following:
raw_classification_output1 = [0,1,0]
raw_classification_output2 = [0,0,1]
classification_output =tf.concat(0,[raw_classification_output1,raw_classification_output2])
classification_labels = [1,-1]
classification_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(classification_output,classification_labels)
total_loss = tf.reduce_sum(classification_loss) + tf.reduce_sum(other_loss)
optimizer = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(total_loss)
changed_grads_and_vars = #do something to 0 the incorrect gradients
optimizer.apply_gradients(changed_grads_and_vars)
What's the most straightforward way to zero those gradients?
The easiest method is to just multiply the classification loss by a similar tensor of 1's where the loss is desired, and zeros where it isn't. This is made easier by the fact that the loss is already zero where you don't want it to be updated. This is basically just a workaround for the fact that it still does some weird gradient behavior if you have loss zero for this sparse softmax.
adding this line after tf.nn.sparse_softmax_cross_entropy_with_logits:
classification_loss_zeroed = tf.mul(classification_loss,tf.to_float(tf.not_equal(classification_loss,0)))
It should zero out the gradients also.