The question is, whether just changing the learning_rate argument in tf.train.AdamOptimizer actually results in any changes in behaviour:
Let's say the code looks like this:
myLearnRate = 0.001
...
output = tf.someDataFlowGraph
trainLoss = tf.losses.someLoss(output)
trainStep = tf.train.AdamOptimizer(learning_rate=myLearnRate).minimize(trainLoss)
with tf.Session() as session:
#first trainstep
session.run(trainStep, feed_dict = {input:someData, target:someTarget})
myLearnRate = myLearnRate * 0.1
#second trainstep
session.run(trainStep, feed_dict = {input:someData, target:someTarget})
Would the decreased myLearnRate now be applied in the second trainStep? This is, is the creation of the node trainStep only evaluated once:
trainStep = tf.train.AdamOptimizer(learning_rate=myLearnRate).minimize(trainLoss)
Or is it evaluated with every session.run(train_step)? How could I have checked in my AdamOptimizer in Tensorflow, whether it did change the Learnrate.
Disclaimer 1: I'm aware manually changing the LearnRate is bad practice.
Disclaimer 2: I'm aware there is a similar question, but it was solved with inputting a tensor as learnRate, which is updated in every trainStep (here). It makes me lean towards assuming it would only work with a tensor as input for the learning_rate in AdamOptimizer, but neither am I sure of that, nor can I understand the reasoning behind it.
The short answer is that no, your new learning rate is not applied. TF builds the graph when you first run it, and changing something on the Python side will not translate to a change in the graph at run time. You can, however, feed a new learning rate into your graph pretty easily:
# Use a placeholder in the graph for your user-defined learning rate instead
learning_rate = tf.placeholder(tf.float32)
# ...
trainStep = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(trainLoss)
applied_rate = 0.001 # we will update this every training step
with tf.Session() as session:
#first trainstep, feeding our applied rate to the graph
session.run(trainStep, feed_dict = {input: someData,
target: someTarget,
learning_rate: applied_rate})
applied_rate *= 0.1 # update the rate we feed to the graph
#second trainstep
session.run(trainStep, feed_dict = {input: someData,
target: someTarget,
learning_rate: applied_rate})
Yes, the optimizer is created only once:
tf.train.AdamOptimizer(learning_rate=myLearnRate)
It remembers the passed learning rate (in fact, it creates a tensor for it, if you pass a floating number) and your future changes of myLearnRate don't affect it.
Yes, you can create a placeholder and pass it to the session.run(), if you really want to. But, as you said, it's pretty uncommon and probably means you are solving your origin problem in the wrong way.
Related
After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.
I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.
I am trying to learn the dynamics of tensorflow2.0 by converting my tensorflow1.13 script (below) into a tensorflow2.0 script. However I am struggling to do this.
I think the main reason why I am struggling is because the examples of tensorflow2.0 I have seen train neural networks and so they have a model which they compile and fit. However in my simple example below I am not using a neural network so I can't see how to adapt this code to tensorflow2.0 (For example, how do I replace session?). Help is much appreciated and thanks in advance.
data = tf.placeholder(tf.int32)
theta = tf.Variable(np.zeros(100))
p_s = tf.nn.softmax(theta)
loss = tf.reduce_mean(-tf.log(tf.gather(p_s, data)))
train_step = tf.train.AdamOptimizer().minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(10):
for datum in sample_data(): #sample_data() is a list of integer datapoints
_ = sess.run([train_step], feed_dict={data:datum})
print(sess.run(p_s))
I have looked at this (which is most relavant) and so far I have come up with the below:
#data = tf.placeholder(tf.int32)
theta = tf.Variable(np.zeros(100))
p_s = tf.nn.softmax(theta)
loss = tf.reduce_mean(-tf.math.log(tf.gather(p_s, **data**)))
optimizer = tf.keras.optimizers.Adam()
for epoch in range(10):
for datum in sample_data():
optimizer.apply_gradients(loss)
print(p_s)
However the above obviously does not run because the placeholder data inside the loss function does not exist anymore - however I am not sure how to replace it. :S
Anyone? Note that I don't have a def forward(x) because my input datum isn't transformed - it is used directly to calculate the loss.
Instead of using the conversion tool (that exists, but I don't like it since it just prefixes (more or less) the API calls with tf.compat.v1 and uses the old Tensoflow 1.x API) I help you convert your code to the new version.
Sessions are disappeared, and so are the placeholders. The reason? The code is executed line by line - that is the Tensorflow eager mode.
To train a model you correctly have to use an optimizer. If you want to use the minimize method, in Tensorflowe 2.0 you have to define the function to minimize (the loss) as a Python callable.
# This is your "model"
theta = tf.Variable(np.zeros(100))
p_s = tf.nn.softmax(theta)
# Define the optimizer
optimizer = tf.keras.optimizers.Adam()
# Define the training loop with the loss inside (because we use the
# .minimnize method that requires a callable with no arguments)
trainable_variables = [theta]
for epoch in range(10):
for datum in sample_data():
# The loss must be callable and return the value to minimize
def loss_fn():
loss = tf.reduce_mean(-tf.math.log(tf.gather(p_s, datum)))
return loss
optimizer.minimize(loss_fn, var_list=trainable_variables)
tf.print("epoch ", epoch, " finished. ps: ", p_s)
Disclaimer: I haven't tested the code - but it should work (or at least give you an idea on how to implement what you're trying to achieve in TF 2)
I'm interested in Deep Learning and recently found out about TenserFlow. I installed it and followed the tutorial found at https://www.tensorflow.org/get_started/get_started .
This is the code I came up with by following that tutorial:
import tensorflow as tf
W = tf.Variable(0.3, tf.float32)
b = tf.Variable(-0.3, tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = W * x + b
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
sess.run(init)
for i in range(1000):
sess.run(train, {x:[1,2,3,4], y:[0,-1,-2,-3]})
print(sess.run([W, b]))
For the time being, I'm only interested in the code before it does the training, as to not get overwhelmed.
Now, I understand (or at least I think I do) parts of this code. It produces the result as expected following the tutorial, but most lines in this code are confusing to me. It might be because I'm not familiar with the mathematics involved, but I don't know how much math is actually involved here so it's hard to tell if that's the problem.
Anyway, I understand the first 6 lines.
Then there's this line:
squared_deltas = tf.square(linear_model - y)
As I understand it, it simply returns the square of (linear_model - y)
However, y has no value yet.
Then, loss is assigned the value of tf.reduce_sum(squared_deltas). I understand that loss needs to be as low as possible.
How do I even interpret these last two lines?
I sort of understand tf.Session() and tf.global_variables_initializer() so I'm not too concerned with those two functions right now.
Bonus question: changing the value in the argument of tf.train.GradientDescentOptimizer() in either direction (increase or decrease) gives me the wrong result. How come 0.01 works when 0.1, 0.001 doesn't?
I appreciate any help I can get!
Thanks
As I understand it, it simply returns the square of (linear_model - y) However, y has no value yet.
Then, loss is assigned the value of tf.reduce_sum(squared_deltas). I understand that loss needs to be as low as possible.
How do I even interpret these last two lines?
You clearly need to go through TensorFlow documents. You are missing the core idea behind TF - that it defines computational graph, at this point there are no computations being involved, you are right - there is no "y" yet, no values at least - it is just a symbolic variable (placeholder) thus we say that our loss will be a mean of square of differences between predictions and true values (y), but we are not providing it yet. Actual values start "living" in the session, before that this is just a graph of computations, instructions for the TF so it knows "what to anticipate".
Bonus question: changing the value in the argument of tf.train.GradientDescentOptimizer() in either direction (increase or decrease) gives me the wrong result. How come 0.01 works when 0.1, 0.001 doesn't?
Linear regression (which you are working with) converges iff learning rate is small enough and you have enough iterations. 0.1 is probably just too big, 0.01 is fine and so is 0.001, you simply need more than a 1000 iterations for 0.001, but it will work (and so will any smaller value, but again - much slower).
I have a standard experiment loop that looks like this:
cross_entropy_target = tf.reduce_mean(tf.reduce_mean(tf.square(target_pred - target)))
cost = cross_entropy_target
opt_target = tf.train.AdamOptimizer(learning_rate=0.00001).minimize(cost)
for epoch in range(num_epochs):
for mini_batch in range(num_samples / batch_size):
mb_train_x, mb_train_target = get_mini_batch_stuffs()
sess.run(opt_target, feed_dict={x: mb_train_x, target: mb_train_target})
This runs and converges to a good prediction loss. Now, same code with a slight modification:
cross_entropy_target = tf.reduce_mean(tf.reduce_mean(tf.square(target_pred - target)))
cross_entropy_target_variable = tf.Variable(0.0)
cost = cross_entropy_target_variable
opt_target = tf.train.AdamOptimizer(learning_rate=0.00001).minimize(cost)
for epoch in range(num_epochs):
for mini_batch in range(num_samples / batch_size):
mb_train_x, mb_train_target = get_mini_batch_stuffs()
new_target_cost = sess.run(cross_entropy_target, feed_dict={x: mb_train_x, time: mb_train_time, target: mb_train_target})
sess.run(tf.assign(cross_entropy_target_variable, new_target_cost))
sess.run(opt_target, feed_dict={x: mb_train_x, target: mb_train_target})
Now, instead of the cross_entropy_target being calculated as part of the opt_target graph, I am pre-calculating it, assigning it to a tensorflow variable, and expecting it to make use of that value. This doesn't work at all. The network's outputs never change.
I would expect these two code snippets to have equivalent outcomes. In both cases a feed forward is used to populate the values of target and target_pred, which is then reduced to the scalar value cross_entropy_target. This scalar value is used to inform the magnitude and direction of the gradient updates on the optimizer's .minimize().
In this toy example there is no advantage to my calculating the cross_entropy_target "out of graph" and then assigning it to an in-graph tf.Variable for use in the opt_target run. However, I have a real use case where my cost function is very complex and I have not been able to define it in terms of Tensorflow's existing tensor transforms. Either way, I'd like to understand why using a tf.Variable for an optimizer's cost is incorrect use.
An interesting oddity that may be a byproduct of the solution to this:
If I set cross_entropy_target_variable = tf.Variable(0.0, trainable=False), running the opt_target will crash. It requires that the cost value is modifiable. Indeed, printing out its value before and after running the opt_target produces different values:
cross_entropy_target before = 0.345796853304
cross_entropy_target after = 0.344796866179
Why does running minimize() modify the value of the cost variable?
In your tf.train.AdamOptimizer( line, it looks at cost, which is cross_entropy_target, which is a tf.Variable op, and creates an optimizer which does nothing, since cross_entropy_target doesn't depend on any variables. Modifying cross_entropy target later has no effect because the optimizer has already been created.
I would like to implement a stopping condition based on the value of the gradient of the loss function w.r.t. the weights.
For example, let's say I have something like this:
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
train_op = optimizer.apply_gradients(grads_and_vars)
then I would like to run the graph with something like this:
for step in range(TotSteps):
output = sess.run([input], feed_dict=some_dict)
if(grad_taken_in_some_way < some_treshold):
print("Training finished.")
break
I am not sure what I should pass to sess.run() in order to get as output also the gradient (besides all other stuff I need). I am not even sure whether this is the correct approach or I should do it differently. I made some tries but I failed every time. Hope someone has some hints.
Thank you in advance!
EDIT: English correction
EDIT2: Answer by Iballes is exactly what I wanted to do. Still, I am not sure how to norm and sum all the gradients. Since I have different layer in my CNN and different weights with different shape, if I just do what you suggested, I get an error on the add_n() operation (since I am trying to add together matrices with different shapes). So probably I should do something like:
grad_norms = [tf.nn.l2_normalize(g[0], 0) for g in grads_and_vars]
grad_norm = [tf.reduce_sum(grads) for grads in grad_norms]
final_grad = tf.reduce_sum(grad_norm)
Can anyone confirm this?
Your line output = sess.run([input], feed_dict=some_dict) makes think that you have a little misunderstanding of the sess.run command. What you call [input] is supposed to be a list of tensors that are to be fetched by the sess.run command. Hence, it is an output rather than an input. To tackle your question, let's assume that you are doing something like output = sess.run(loss, feed_dict=some_dict) instead (in order to monitor the training loss).
Also, I suppose you want to formulate your stopping criterion using the norm of the gradient (the gradient itself is a multi-dimensional quantity). Hence, what you want to do is to fetch the norm of the gradient each time you execute the graph. To that end, you have to do two things. 1) Add the gradient norm to the computation graph. 2) Fetch it in each call to sess.run in your training loop.
Ad 1) You have added the gradients to the graph via
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
and now have the tensors holding the gradients in grads_and_vars (one for each trained variable in the graph). Let's take the norm of each gradient and then sum it up:
grad_norms = [tf.nn.l2_loss(g) for g, v in grads_and_vars]
grad_norm = tf.add_n(grad_norms)
There you have your gradient norm.
Ad 2) Inside your loop, fetch the gradient norm alongside the loss by telling the sess.run command to do so.
for step in range(TotSteps):
l, gn = sess.run([loss, grad_norm], feed_dict=some_dict)
if(gn < some_treshold):
print("Training finished.")
break