How to use TensorFlow gradient descent optimizer to solve optimization problems

How to use TensorFlow gradient descent optimizer to solve optimization problems - python

I'm trying to use TensorFlow's Gradient Descent Optimizer to solve 2-dimension Rosenbrock function, but as I ran the program, the optimizer sometimes goes towards the infinity. Also sometime, without changing anything, it can find the right neighborhood but not pinpoint the optimal solution.
My code is as follows:
import tensorflow as tf
x1_data = tf.Variable(initial_value=tf.random_uniform([1], -10, 10),name='x1')
x2_data = tf.Variable(initial_value=tf.random_uniform([1], -10, 10), name='x2')
# Loss function
y = tf.add(tf.pow(tf.sub(1.0, x1_data), 2.0),
tf.mul(100.0, tf.pow(tf.sub(x2_data,tf.pow(x1_data, 2.0)), 2.0)), 'y')
opt = tf.train.GradientDescentOptimizer(0.0035)
train = opt.minimize(y)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
for step in xrange(200):
sess.run(train)
if step % 10 == 0:
print(step, sess.run(x1_data), sess.run(x2_data), sess.run(y))
The Rosenbrock problem is defined as y = (1 - x1)^2 + 100 * (x2 - x1^2)^2, giving the optimal solution on x1 = x2 = 1
What I'm doing wrong with this? Or have I completely misunderstood how to use TensorFlow?

If you decrease the variation of initial x1/x2 (e.g. use -3/3 instead of -10/10) and decrease the learning rate by a factor of 10, it shouldn't blow up as often. Decreasing learning rate when you see things diverging is often a good thing to try.
Also, the function you're optimizing is made for being difficult to find the global minimum, so no surprises there that it finds the valley but not the global optimum ;)

Yes, like #etarion says this is an optimization problem, your TensorFlow code is fine.
One way to make sure the gradients never explode is to clip them in the range [-10., 10.] for instance:
opt = tf.train.GradientDescentOptimizer(0.0001)
grads_and_vars = opt.compute_gradients(y, [x1_data, x2_data])
clipped_grads_and_vars = [(tf.clip_by_value(g, -10., 10.), v) for g, v in grads_and_vars]
train = opt.apply_gradients(clipped_grads_and_vars)

Related

pytorch, for the cross_entropy function, What if the input labels are all ignore_index?

for the nn.CrossEntropyLoss(), What if the input labels are all ignore_index ? I got a 'nan' loss, how to fix it ? thanks!
detail of my code is following:
code
notice that IGNORE_INDEX is -100
then, the loss_ent becomes 'nan', the input of func criterion() is following:
criterion's input
my python environment is following:
Python:3.7.11
torch:1.12.0+cu116
GPU:NVIDIA A100 80G
thank for your attention, Please let me know if there is anything else I need to provide.

There seem to be two possibilities.
This is the code to help understand the ignore index. If you set all inputs to ignore index, criterion makes nan as output because there is no value to compute.
import torch
import torch.nn as nn
x = torch.randn(5, 10, requires_grad=True)
y = torch.ones(5).long() * (-100)
criterion = nn.CrossEntropyLoss(ignore_index=-100)
loss = criterion(x, y)
print(loss)
tensor(nan, grad_fn=<NllLossBackward0>)
However, if there is even one value to calculate, the loss is calculated properly.
import torch
import torch.nn as nn
x = torch.randn(5, 10, requires_grad=True)
y = torch.ones(5).long() * (-100)
y[0] = 1
criterion = nn.CrossEntropyLoss(ignore_index=-100)
loss = criterion(x, y)
print(loss)
tensor(1.2483, grad_fn=<NllLossBackward0>)
It seems the learning rate is too large that cause diverge.
As you can see in the figure below, if the learning rate is large, the gradient gradually increases and diverges.
How about reducing the learning rate of the parameters of the model with is_gold_ent as output?
Usually, when reducing the learning rate, reduce it by 1/3. For example 0.01, 0.003, 0.001,...

Numerical equivalence of PyTorch backpropagation

After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.

I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.

Trying to understand this simple TensorFlow code

I'm interested in Deep Learning and recently found out about TenserFlow. I installed it and followed the tutorial found at https://www.tensorflow.org/get_started/get_started .
This is the code I came up with by following that tutorial:
import tensorflow as tf
W = tf.Variable(0.3, tf.float32)
b = tf.Variable(-0.3, tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = W * x + b
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
sess.run(init)
for i in range(1000):
sess.run(train, {x:[1,2,3,4], y:[0,-1,-2,-3]})
print(sess.run([W, b]))
For the time being, I'm only interested in the code before it does the training, as to not get overwhelmed.
Now, I understand (or at least I think I do) parts of this code. It produces the result as expected following the tutorial, but most lines in this code are confusing to me. It might be because I'm not familiar with the mathematics involved, but I don't know how much math is actually involved here so it's hard to tell if that's the problem.
Anyway, I understand the first 6 lines.
Then there's this line:
squared_deltas = tf.square(linear_model - y)
As I understand it, it simply returns the square of (linear_model - y)
However, y has no value yet.
Then, loss is assigned the value of tf.reduce_sum(squared_deltas). I understand that loss needs to be as low as possible.
How do I even interpret these last two lines?
I sort of understand tf.Session() and tf.global_variables_initializer() so I'm not too concerned with those two functions right now.
Bonus question: changing the value in the argument of tf.train.GradientDescentOptimizer() in either direction (increase or decrease) gives me the wrong result. How come 0.01 works when 0.1, 0.001 doesn't?
I appreciate any help I can get!
Thanks

As I understand it, it simply returns the square of (linear_model - y) However, y has no value yet.
Then, loss is assigned the value of tf.reduce_sum(squared_deltas). I understand that loss needs to be as low as possible.
How do I even interpret these last two lines?
You clearly need to go through TensorFlow documents. You are missing the core idea behind TF - that it defines computational graph, at this point there are no computations being involved, you are right - there is no "y" yet, no values at least - it is just a symbolic variable (placeholder) thus we say that our loss will be a mean of square of differences between predictions and true values (y), but we are not providing it yet. Actual values start "living" in the session, before that this is just a graph of computations, instructions for the TF so it knows "what to anticipate".
Bonus question: changing the value in the argument of tf.train.GradientDescentOptimizer() in either direction (increase or decrease) gives me the wrong result. How come 0.01 works when 0.1, 0.001 doesn't?
Linear regression (which you are working with) converges iff learning rate is small enough and you have enough iterations. 0.1 is probably just too big, 0.01 is fine and so is 0.001, you simply need more than a 1000 iterations for 0.001, but it will work (and so will any smaller value, but again - much slower).

TensorFlow Returning nan When Implementing Logistic Regression

I've been trying to implement Logistic Regression in TensorFlow following the MNIST example but with data from a CSV. Each row is one sample and has 12 dimensions. My code is the following:
batch_size = 5
learning_rate = .001
x = tf.placeholder(tf.float32,[None,12])
y = tf.placeholder(tf.float32,[None,2])
W = tf.Variable(tf.zeros([12,2]))
b = tf.Variable(tf.zeros([2]))
mult = tf.matmul(x,W)
pred = tf.nn.softmax(mult+b)
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
avg_cost = 0
total_batch = int(len(Xtrain)/batch_size)
for i in range(total_batch):
batch_xs = Xtrain[i*batch_size:batch_size*i+batch_size]
batch_ys = ytrain[i*batch_size:batch_size*i+batch_size]
_, c = sess.run([optimizer, cost], feed_dict={x: batch_xs,y: batch_ys})
print(c)
Xtrain is a 252x10 numpy array, and ytrain is a 252x2 one hot numpy array.
The Problem: the cost c gets calculated for the first iteration (value is 0.6931...), but for every iteration after, it returns 'nan.'
Things I've Tried: I made sure every component aspect of the model was working. The issue happens entirely after the first iteration. I've played around with the learning rate, but that doesn't do anything. I've tried initializing the weights as truncated_normal (which I shouldn't need to do for logistic regression anyway), but that doesn't help either.
So, any thoughts? I've spent around 3 hours trying to fix it and have run out of ideas. It seems like something just isn't working when TensorFlow goes to optimize the cost function.

The issue you are having is because log(pred) is not defined for pred = 0. The "hacky" way around this is to use tf.maximum(pred, 1e-15) or tf.clip_by_value(pred, 1e-15, 1.0).
An even better solution, however, is using tf.nn.softmax_cross_entropy_with_logits(pred) instead of applying softmax and cross-entropy separately, which deals with edge cases like this (hence all your problems) automatically!
For further reading, I'd recommend this great answer:
https://stackoverflow.com/a/34243720/5829427

stopping condition on gradient value tensorflow

I would like to implement a stopping condition based on the value of the gradient of the loss function w.r.t. the weights.
For example, let's say I have something like this:
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
train_op = optimizer.apply_gradients(grads_and_vars)
then I would like to run the graph with something like this:
for step in range(TotSteps):
output = sess.run([input], feed_dict=some_dict)
if(grad_taken_in_some_way < some_treshold):
print("Training finished.")
break
I am not sure what I should pass to sess.run() in order to get as output also the gradient (besides all other stuff I need). I am not even sure whether this is the correct approach or I should do it differently. I made some tries but I failed every time. Hope someone has some hints.
Thank you in advance!
EDIT: English correction
EDIT2: Answer by Iballes is exactly what I wanted to do. Still, I am not sure how to norm and sum all the gradients. Since I have different layer in my CNN and different weights with different shape, if I just do what you suggested, I get an error on the add_n() operation (since I am trying to add together matrices with different shapes). So probably I should do something like:
grad_norms = [tf.nn.l2_normalize(g[0], 0) for g in grads_and_vars]
grad_norm = [tf.reduce_sum(grads) for grads in grad_norms]
final_grad = tf.reduce_sum(grad_norm)
Can anyone confirm this?

Your line output = sess.run([input], feed_dict=some_dict) makes think that you have a little misunderstanding of the sess.run command. What you call [input] is supposed to be a list of tensors that are to be fetched by the sess.run command. Hence, it is an output rather than an input. To tackle your question, let's assume that you are doing something like output = sess.run(loss, feed_dict=some_dict) instead (in order to monitor the training loss).
Also, I suppose you want to formulate your stopping criterion using the norm of the gradient (the gradient itself is a multi-dimensional quantity). Hence, what you want to do is to fetch the norm of the gradient each time you execute the graph. To that end, you have to do two things. 1) Add the gradient norm to the computation graph. 2) Fetch it in each call to sess.run in your training loop.
Ad 1) You have added the gradients to the graph via
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
and now have the tensors holding the gradients in grads_and_vars (one for each trained variable in the graph). Let's take the norm of each gradient and then sum it up:
grad_norms = [tf.nn.l2_loss(g) for g, v in grads_and_vars]
grad_norm = tf.add_n(grad_norms)
There you have your gradient norm.
Ad 2) Inside your loop, fetch the gradient norm alongside the loss by telling the sess.run command to do so.
for step in range(TotSteps):
l, gn = sess.run([loss, grad_norm], feed_dict=some_dict)
if(gn < some_treshold):
print("Training finished.")
break

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use TensorFlow gradient descent optimizer to solve optimization problems - python

Related

pytorch, for the cross_entropy function, What if the input labels are all ignore_index?

Numerical equivalence of PyTorch backpropagation

Trying to understand this simple TensorFlow code

TensorFlow Returning nan When Implementing Logistic Regression

stopping condition on gradient value tensorflow

Categories

Resources