I'm interested in Deep Learning and recently found out about TenserFlow. I installed it and followed the tutorial found at https://www.tensorflow.org/get_started/get_started .
This is the code I came up with by following that tutorial:
import tensorflow as tf
W = tf.Variable(0.3, tf.float32)
b = tf.Variable(-0.3, tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = W * x + b
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
sess.run(init)
for i in range(1000):
sess.run(train, {x:[1,2,3,4], y:[0,-1,-2,-3]})
print(sess.run([W, b]))
For the time being, I'm only interested in the code before it does the training, as to not get overwhelmed.
Now, I understand (or at least I think I do) parts of this code. It produces the result as expected following the tutorial, but most lines in this code are confusing to me. It might be because I'm not familiar with the mathematics involved, but I don't know how much math is actually involved here so it's hard to tell if that's the problem.
Anyway, I understand the first 6 lines.
Then there's this line:
squared_deltas = tf.square(linear_model - y)
As I understand it, it simply returns the square of (linear_model - y)
However, y has no value yet.
Then, loss is assigned the value of tf.reduce_sum(squared_deltas). I understand that loss needs to be as low as possible.
How do I even interpret these last two lines?
I sort of understand tf.Session() and tf.global_variables_initializer() so I'm not too concerned with those two functions right now.
Bonus question: changing the value in the argument of tf.train.GradientDescentOptimizer() in either direction (increase or decrease) gives me the wrong result. How come 0.01 works when 0.1, 0.001 doesn't?
I appreciate any help I can get!
Thanks
As I understand it, it simply returns the square of (linear_model - y) However, y has no value yet.
Then, loss is assigned the value of tf.reduce_sum(squared_deltas). I understand that loss needs to be as low as possible.
How do I even interpret these last two lines?
You clearly need to go through TensorFlow documents. You are missing the core idea behind TF - that it defines computational graph, at this point there are no computations being involved, you are right - there is no "y" yet, no values at least - it is just a symbolic variable (placeholder) thus we say that our loss will be a mean of square of differences between predictions and true values (y), but we are not providing it yet. Actual values start "living" in the session, before that this is just a graph of computations, instructions for the TF so it knows "what to anticipate".
Bonus question: changing the value in the argument of tf.train.GradientDescentOptimizer() in either direction (increase or decrease) gives me the wrong result. How come 0.01 works when 0.1, 0.001 doesn't?
Linear regression (which you are working with) converges iff learning rate is small enough and you have enough iterations. 0.1 is probably just too big, 0.01 is fine and so is 0.001, you simply need more than a 1000 iterations for 0.001, but it will work (and so will any smaller value, but again - much slower).
Related
After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.
I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.
The question is, whether just changing the learning_rate argument in tf.train.AdamOptimizer actually results in any changes in behaviour:
Let's say the code looks like this:
myLearnRate = 0.001
...
output = tf.someDataFlowGraph
trainLoss = tf.losses.someLoss(output)
trainStep = tf.train.AdamOptimizer(learning_rate=myLearnRate).minimize(trainLoss)
with tf.Session() as session:
#first trainstep
session.run(trainStep, feed_dict = {input:someData, target:someTarget})
myLearnRate = myLearnRate * 0.1
#second trainstep
session.run(trainStep, feed_dict = {input:someData, target:someTarget})
Would the decreased myLearnRate now be applied in the second trainStep? This is, is the creation of the node trainStep only evaluated once:
trainStep = tf.train.AdamOptimizer(learning_rate=myLearnRate).minimize(trainLoss)
Or is it evaluated with every session.run(train_step)? How could I have checked in my AdamOptimizer in Tensorflow, whether it did change the Learnrate.
Disclaimer 1: I'm aware manually changing the LearnRate is bad practice.
Disclaimer 2: I'm aware there is a similar question, but it was solved with inputting a tensor as learnRate, which is updated in every trainStep (here). It makes me lean towards assuming it would only work with a tensor as input for the learning_rate in AdamOptimizer, but neither am I sure of that, nor can I understand the reasoning behind it.
The short answer is that no, your new learning rate is not applied. TF builds the graph when you first run it, and changing something on the Python side will not translate to a change in the graph at run time. You can, however, feed a new learning rate into your graph pretty easily:
# Use a placeholder in the graph for your user-defined learning rate instead
learning_rate = tf.placeholder(tf.float32)
# ...
trainStep = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(trainLoss)
applied_rate = 0.001 # we will update this every training step
with tf.Session() as session:
#first trainstep, feeding our applied rate to the graph
session.run(trainStep, feed_dict = {input: someData,
target: someTarget,
learning_rate: applied_rate})
applied_rate *= 0.1 # update the rate we feed to the graph
#second trainstep
session.run(trainStep, feed_dict = {input: someData,
target: someTarget,
learning_rate: applied_rate})
Yes, the optimizer is created only once:
tf.train.AdamOptimizer(learning_rate=myLearnRate)
It remembers the passed learning rate (in fact, it creates a tensor for it, if you pass a floating number) and your future changes of myLearnRate don't affect it.
Yes, you can create a placeholder and pass it to the session.run(), if you really want to. But, as you said, it's pretty uncommon and probably means you are solving your origin problem in the wrong way.
I am new to Tensorflow and am working through the examples of regression examples given here tensorflow tutorials. Speicifically, I am working on the 3rd: "polynomial_regression.py"
I followed the linear regression example fine, and have now moved on to the polynomial regression.
However, I wanted to try substituting another set of data instead of that made up in the example. I did this by exchanging
xs = np.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
7.042,10.791,5.313,7.997,5.654,9.27,3.1], dtype=np.float32)
ys = np.asarray([1.7,2.76,-2.09,3.19,1.9,1.573,3.366,2.596,2.53,1.221,
2.827,-3.465,1.65,-2.1004,2.42,2.94,1.3], dtype=np.float32)
n_observations = xs.shape[0]
for
n_observations = 100
xs = np.linspace(-3, 3, n_observations)
ys = np.tan(xs) + np.random.uniform(-0.5, 0.5, n_observations)
I.e. the second was what was given in the example, and I wanted to try to run the same training with the new xs,ys, n_observation. These were the only lines I changed. I also tried changing the dtype of the array to be float64, but this did not change the output.
The output I am getting (which is from print(training_cost) is just a repeated nan. When I switch back to the original data, the network runs fine, and generates a fitting funciton.
Thank you for any ideas!
NaNs can be caused by many things, usually some form of numerical instability. Lowering the learning rate or using a more stable optimizer are good things to try.
I would like to implement a stopping condition based on the value of the gradient of the loss function w.r.t. the weights.
For example, let's say I have something like this:
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
train_op = optimizer.apply_gradients(grads_and_vars)
then I would like to run the graph with something like this:
for step in range(TotSteps):
output = sess.run([input], feed_dict=some_dict)
if(grad_taken_in_some_way < some_treshold):
print("Training finished.")
break
I am not sure what I should pass to sess.run() in order to get as output also the gradient (besides all other stuff I need). I am not even sure whether this is the correct approach or I should do it differently. I made some tries but I failed every time. Hope someone has some hints.
Thank you in advance!
EDIT: English correction
EDIT2: Answer by Iballes is exactly what I wanted to do. Still, I am not sure how to norm and sum all the gradients. Since I have different layer in my CNN and different weights with different shape, if I just do what you suggested, I get an error on the add_n() operation (since I am trying to add together matrices with different shapes). So probably I should do something like:
grad_norms = [tf.nn.l2_normalize(g[0], 0) for g in grads_and_vars]
grad_norm = [tf.reduce_sum(grads) for grads in grad_norms]
final_grad = tf.reduce_sum(grad_norm)
Can anyone confirm this?
Your line output = sess.run([input], feed_dict=some_dict) makes think that you have a little misunderstanding of the sess.run command. What you call [input] is supposed to be a list of tensors that are to be fetched by the sess.run command. Hence, it is an output rather than an input. To tackle your question, let's assume that you are doing something like output = sess.run(loss, feed_dict=some_dict) instead (in order to monitor the training loss).
Also, I suppose you want to formulate your stopping criterion using the norm of the gradient (the gradient itself is a multi-dimensional quantity). Hence, what you want to do is to fetch the norm of the gradient each time you execute the graph. To that end, you have to do two things. 1) Add the gradient norm to the computation graph. 2) Fetch it in each call to sess.run in your training loop.
Ad 1) You have added the gradients to the graph via
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
and now have the tensors holding the gradients in grads_and_vars (one for each trained variable in the graph). Let's take the norm of each gradient and then sum it up:
grad_norms = [tf.nn.l2_loss(g) for g, v in grads_and_vars]
grad_norm = tf.add_n(grad_norms)
There you have your gradient norm.
Ad 2) Inside your loop, fetch the gradient norm alongside the loss by telling the sess.run command to do so.
for step in range(TotSteps):
l, gn = sess.run([loss, grad_norm], feed_dict=some_dict)
if(gn < some_treshold):
print("Training finished.")
break
I'm trying to use TensorFlow's Gradient Descent Optimizer to solve 2-dimension Rosenbrock function, but as I ran the program, the optimizer sometimes goes towards the infinity. Also sometime, without changing anything, it can find the right neighborhood but not pinpoint the optimal solution.
My code is as follows:
import tensorflow as tf
x1_data = tf.Variable(initial_value=tf.random_uniform([1], -10, 10),name='x1')
x2_data = tf.Variable(initial_value=tf.random_uniform([1], -10, 10), name='x2')
# Loss function
y = tf.add(tf.pow(tf.sub(1.0, x1_data), 2.0),
tf.mul(100.0, tf.pow(tf.sub(x2_data,tf.pow(x1_data, 2.0)), 2.0)), 'y')
opt = tf.train.GradientDescentOptimizer(0.0035)
train = opt.minimize(y)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
for step in xrange(200):
sess.run(train)
if step % 10 == 0:
print(step, sess.run(x1_data), sess.run(x2_data), sess.run(y))
The Rosenbrock problem is defined as y = (1 - x1)^2 + 100 * (x2 - x1^2)^2, giving the optimal solution on x1 = x2 = 1
What I'm doing wrong with this? Or have I completely misunderstood how to use TensorFlow?
If you decrease the variation of initial x1/x2 (e.g. use -3/3 instead of -10/10) and decrease the learning rate by a factor of 10, it shouldn't blow up as often. Decreasing learning rate when you see things diverging is often a good thing to try.
Also, the function you're optimizing is made for being difficult to find the global minimum, so no surprises there that it finds the valley but not the global optimum ;)
Yes, like #etarion says this is an optimization problem, your TensorFlow code is fine.
One way to make sure the gradients never explode is to clip them in the range [-10., 10.] for instance:
opt = tf.train.GradientDescentOptimizer(0.0001)
grads_and_vars = opt.compute_gradients(y, [x1_data, x2_data])
clipped_grads_and_vars = [(tf.clip_by_value(g, -10., 10.), v) for g, v in grads_and_vars]
train = opt.apply_gradients(clipped_grads_and_vars)