Tensorflow minimise with respect to only some elements of a variable - python

Is it possible to minimise a loss function by changing only some elements of a variable? In other words, if I have a variable X of length 2, how can I minimise my loss function by changing X[0] and keeping X[1] constant?
Hopefully this code I have attempted will describe my problem:
import tensorflow as tf
import tensorflow.contrib.opt as opt
X = tf.Variable([1.0, 2.0])
X0 = tf.Variable([3.0])
Y = tf.constant([2.0, -3.0])
scatter = tf.scatter_update(X, [0], X0)
with tf.control_dependencies([scatter]):
loss = tf.reduce_sum(tf.squared_difference(X, Y))
opt = opt.ScipyOptimizerInterface(loss, [X0])
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
opt.minimize(sess)
print("X: {}".format(X.eval()))
print("X0: {}".format(X0.eval()))
which outputs:
INFO:tensorflow:Optimization terminated with:
Message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
Objective function value: 26.000000
Number of iterations: 0
Number of functions evaluations: 1
X: [3. 2.]
X0: [3.]
where I would like to to find the optimal value of X0 = 2 and thus X = [2, 2]
edit
Motivation for doing this: I would like to import a trained graph/model and then tweak various elements of some of the variables depending on some new data I have.

You can use this trick to restrict the gradient calculation to one index:
import tensorflow as tf
import tensorflow.contrib.opt as opt
X = tf.Variable([1.0, 2.0])
part_X = tf.scatter_nd([[0]], [X[0]], [2])
X_2 = part_X + tf.stop_gradient(-part_X + X)
Y = tf.constant([2.0, -3.0])
loss = tf.reduce_sum(tf.squared_difference(X_2, Y))
opt = opt.ScipyOptimizerInterface(loss, [X])
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
opt.minimize(sess)
print("X: {}".format(X.eval()))
part_X becomes the value you want to change in a one-hot vector of the same shape as X. part_X + tf.stop_gradient(-part_X + X) is the same as X in the forward pass, since part_X - part_X is 0. However in the backward pass the tf.stop_gradient prevents all unnecessary gradient calculations.

I'm not sure if it is possible with the SciPy optimizer interface, but using one of the regular tf.train.Optimizer subclasses you can do something like that by calling compute_gradients first, then masking the gradients and then calling apply_gradients,
instead of calling minimize (which, as the docs say, basically calls the previous ones).
import tensorflow as tf
X = tf.Variable([3.0, 2.0])
# Select updatable parameters
X_mask = tf.constant([True, False], dtype=tf.bool)
Y = tf.constant([2.0, -3.0])
loss = tf.reduce_sum(tf.squared_difference(X, Y))
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
# Get gradients and mask them
((X_grad, _),) = opt.compute_gradients(loss, var_list=[X])
X_grad_masked = X_grad * tf.cast(X_mask, dtype=X_grad.dtype)
# Apply masked gradients
train_step = opt.apply_gradients([(X_grad_masked, X)])
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(10):
_, X_val = sess.run([train_step, X])
print("Step {}: X = {}".format(i, X_val))
print("Final X = {}".format(X.eval()))
Output:
Step 0: X = [ 2.79999995 2. ]
Step 1: X = [ 2.63999987 2. ]
Step 2: X = [ 2.51199985 2. ]
Step 3: X = [ 2.40959978 2. ]
Step 4: X = [ 2.32767987 2. ]
Step 5: X = [ 2.26214385 2. ]
Step 6: X = [ 2.20971513 2. ]
Step 7: X = [ 2.16777205 2. ]
Step 8: X = [ 2.13421774 2. ]
Step 9: X = [ 2.10737419 2. ]
Final X = [ 2.10737419 2. ]

This should be pretty easy to do by using the var_list parameter of the minimize function.
trainable_var = X[0]
train_op = tf.train.GradientDescentOptimizer(learning_rate=1e-3).minimize(loss, var_list=[trainable_var])
You should note that by convention all trainable variables are added to the tensorflow default collection GraphKeys.TRAINABLE_VARIABLES, so you can get a list of all trainable variables using:
all_trainable_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
This is just a list of variables which you can manipulate as you see fit and use as the var_list parameter.
As a tangent to your question, if you ever want to take customizing the optimization process a step further you can also compute the gradients manually using grads = tf.gradients(loss, var_list) manipulate the gradients as you see fit, then call tf.train.GradientDescentOptimizer(...).apply_gradients(grads_and_vars_as_list_of_tuples). Under the hood minimize is just doing these two steps for you.
Also note that you are perfectly free to create different optimizers for different collections of variables. You could create an SGD optimizer with learning rate 1e-4 for some variables, and another Adam optimizer with learning rate 1e-2 for another set of variables. Not that there's any specific use case for this, I'm just pointing out the flexibility you now have.

The answer by Oren in the second link below calls a function (defined in the first link) that takes a Boolean hot matrix of the parameters to optimize and the tensor of parameters. It uses stop_gradient and works like a charm for a neural network I developed.
Update only part of the word embedding matrix in Tensorflow
https://github.com/tensorflow/tensorflow/issues/9162

Related

How to update weights with TensorFlow Eager Execution?

So I tried TensorFlow's eager execution and my implementation of it wasn't successful. I used gradient.tape, and while the program runs, there is no visible update in any of the weights. I've seen some sample algorithms and tutorials using optimizer.apply_gradients() in order to update all variables, but I'm assuming I'm not using it properly.
import tensorflow as tf
import tensorflow.contrib.eager as tfe
# emabling eager execution
tf.enable_eager_execution()
# establishing hyperparameters
LEARNING_RATE = 20
TRAINING_ITERATIONS = 3
# establishing all LABLES
LABELS = tf.constant(tf.random_normal([3, 1]))
# print(LABELS)
# stub statment for input
init = tf.Variable(tf.random_normal([3, 1]))
# declare and intialize all weights
weight1 = tfe.Variable(tf.random_normal([2, 3]))
bias1 = tfe.Variable(tf.random_normal([2, 1]))
weight2 = tfe.Variable(tf.random_normal([3, 2]))
bias2 = tfe.Variable(tf.random_normal([3, 1]))
weight3 = tfe.Variable(tf.random_normal([2, 3]))
bias3 = tfe.Variable(tf.random_normal([2, 1]))
weight4 = tfe.Variable(tf.random_normal([3, 2]))
bias4 = tfe.Variable(tf.random_normal([3, 1]))
weight5 = tfe.Variable(tf.random_normal([3, 3]))
bias5 = tfe.Variable(tf.random_normal([3, 1]))
VARIABLES = [weight1, bias1, weight2, bias2, weight3, bias3, weight4, bias4, weight5, bias5]
def thanouseEyes(input): # nn model aka: Thanouse's Eyes
layerResult = tf.nn.relu(tf.matmul(weight1, input) + bias1)
input = layerResult
layerResult = tf.nn.relu(tf.matmul(weight2, input) + bias2)
input = layerResult
layerResult = tf.nn.relu(tf.matmul(weight3, input) + bias3)
input = layerResult
layerResult = tf.nn.relu(tf.matmul(weight4, input) + bias4)
input = layerResult
layerResult = tf.nn.softmax(tf.matmul(weight5, input) + bias5)
return layerResult
# Begin training and update variables
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
with tf.GradientTape(persistent=True) as tape: # gradient calculation
for i in range(TRAINING_ITERATIONS):
COST = tf.reduce_sum(LABELS - thanouseEyes(init))
GRADIENTS = tape.gradient(COST, VARIABLES)
optimizer.apply_gradients(zip(GRADIENTS, VARIABLES))
print(weight1)
The usage of optimizer seems fine, however the computation defined by thanouseEyes() will always return [1., 1., 1.] irrespective of the variables, thus the gradients are always 0 and thus the variables will never be updated (print(thanouseEyes(init)) and print(GRADIENTS) should demonstrate that).
Digging in a bit more, tf.nn.softmax is applied to x = tf.matmul(weight5, input) + bias5 which has a shape of [3, 1]. So the tf.nn.softmax(x) is effectively computing [softmax(x[0]), softmax(x[1]), softmax(x[2])] as tf.nn.softmax applies (by default) on the last axis of the input. x[0], x[1], and x[2] are vectors with one element so softmax(x[i]) will always be 1.0.
Hope that helps.
Some additional points unrelated to your question that you may be interested in:
As of TensorFlow 1.11, you don't need the tf.contrib.eager module in your program. Replace all occurrences of tfe with tf (i.e., tf.Variable instead of tfe.Variable) and you'll get the same result
Computation performed inside the context of a GradientTape is "recorded", i.e., it holds on to intermediate tensors so that gradients can be computed later on. Long story short, you'd want to move the GradientTape inside the loop body:
-
for i in range(TRAINING_ITERATIONS):
with tf.GradientTape() as tape:
COST = tf.reduce_sum(LABELS - thanouseEyes(init))
GRADIENTS = tape.gradient(COST, VARIABLES)
optimizer.apply_gradients(zip(GRADIENTS, VARIABLES))

How to train a simple linear regression model with SGD in pytorch successfully?

I was trying to train a simple polynomial linear regression model in pytorch with SGD. I wrote some self contained (what I thought would be extremely simple code), however, for some reason my model does not train as I thought it should.
I have 5 points sampled from a sine curve and try to fit it with a polynomial of degree 4. This is a convex problem so GD or SGD should find a solution with zero train error eventually as long as we have enough iterations and small enough step size. For some reason however my model does not train well (even though it seems that it is changing the parameters of the model. Anyone have an idea why? Here is the code (I tried making it self contained and minimal):
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
import torch
from torch.autograd import Variable
from maps import NamedDict
from plotting_utils import *
def index_batch(X,batch_indices,dtype):
'''
returns the batch indexed/sliced batch
'''
if len(X.shape) == 1: # i.e. dimension (M,) just a vector
batch_xs = torch.FloatTensor(X[batch_indices]).type(dtype)
else:
batch_xs = torch.FloatTensor(X[batch_indices,:]).type(dtype)
return batch_xs
def get_batch2(X,Y,M,dtype):
'''
get batch for pytorch model
'''
# TODO fix and make it nicer, there is pytorch forum question
X,Y = X.data.numpy(), Y.data.numpy()
N = len(Y)
valid_indices = np.array( range(N) )
batch_indices = np.random.choice(valid_indices,size=M,replace=False)
batch_xs = index_batch(X,batch_indices,dtype)
batch_ys = index_batch(Y,batch_indices,dtype)
return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)
def get_sequential_lifted_mdl(nb_monomials,D_out, bias=False):
return torch.nn.Sequential(torch.nn.Linear(nb_monomials,D_out,bias=bias))
def train_SGD(mdl, M,eta,nb_iter,logging_freq ,dtype, X_train,Y_train):
##
N_train,_ = tuple( X_train.size() )
#print(N_train)
for i in range(nb_iter):
# Forward pass: compute predicted Y using operations on Variables
batch_xs, batch_ys = get_batch2(X_train,Y_train,M,dtype) # [M, D], [M, 1]
## FORWARD PASS
y_pred = mdl.forward(batch_xs)
## LOSS + Regularization
batch_loss = (1/M)*(y_pred - batch_ys).pow(2).sum()
## BACKARD PASS
batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
## SGD update
for W in mdl.parameters():
delta = eta*W.grad.data
W.data.copy_(W.data - delta)
## train stats
if i % (nb_iter/10) == 0 or i == 0:
current_train_loss = (1/N_train)*(mdl.forward(X_train) - Y_train).pow(2).sum().data.numpy()
print('i = {}, current_loss = {}'.format(i, current_train_loss ) )
## Manually zero the gradients after updating weights
mdl.zero_grad()
##
logging_freq = 100
dtype = torch.FloatTensor
## SGD params
M = 3
eta = 0.0002
nb_iter = 20*1000
##
lb,ub = 0,1
f_target = lambda x: np.sin(2*np.pi*x)
N_train = 5
X_train = np.linspace(lb,ub,N_train)
Y_train = f_target(X_train)
## degree of mdl
Degree_mdl = 4
## pseudo-inverse solution
c_pinv = np.polyfit( X_train, Y_train , Degree_mdl )[::-1]
## linear mdl to train with SGD
nb_terms = c_pinv.shape[0]
mdl_sgd = get_sequential_lifted_mdl(nb_monomials=nb_terms,D_out=1, bias=False)
## Make polynomial Kernel
poly_feat = PolynomialFeatures(degree=Degree_mdl)
Kern_train = poly_feat.fit_transform(X_train.reshape(N_train,1))
Kern_train_pt, Y_train_pt = Variable(torch.FloatTensor(Kern_train).type(dtype), requires_grad=False), Variable(torch.FloatTensor(Y_train).type(dtype), requires_grad=False)
train_SGD(mdl_sgd, M,eta,nb_iter,logging_freq ,dtype, Kern_train_pt,Y_train_pt)
the error seems to hover on 2ish:
i = 0, current_loss = [ 2.08996224]
i = 2000, current_loss = [ 2.03536892]
i = 4000, current_loss = [ 2.02014995]
i = 6000, current_loss = [ 2.01307297]
i = 8000, current_loss = [ 2.01300406]
i = 10000, current_loss = [ 2.01125693]
i = 12000, current_loss = [ 2.01162267]
i = 14000, current_loss = [ 2.01296973]
i = 16000, current_loss = [ 2.00951076]
i = 18000, current_loss = [ 2.00967121]
which is weird cuz it should be able to reach zero.
I also plotted the learned function:
the code for the plotting:
##
x_horizontal = np.linspace(lb,ub,1000).reshape(1000,1)
X_plot = poly_feat.fit_transform(x_horizontal)
X_plot_pytorch = Variable( torch.FloatTensor(X_plot), requires_grad=False)
##
fig1 = plt.figure()
#plots objs
p_sgd, = plt.plot(x_horizontal, [ float(f_val) for f_val in mdl_sgd.forward(X_plot_pytorch).data.numpy() ])
p_pinv, = plt.plot(x_horizontal, np.dot(X_plot,c_pinv))
p_data, = plt.plot(X_train,Y_train,'ro')
## legend
nb_terms = c_pinv.shape[0]
legend_mdl = f'SGD solution standard parametrization, number of monomials={nb_terms}, batch-size={M}, iterations={nb_iter}, step size={eta}'
plt.legend(
[p_sgd,p_pinv,p_data],
[legend_mdl,f'linear algebra soln, number of monomials={nb_terms}',f'data points = {N_train}']
)
##
plt.xlabel('x'), plt.ylabel('f(x)')
plt.show()
I actually went ahead and implemented a TensorFlow version. That one does seem to train the model. I tried having both of them match by giving them the same initialization:
mdl_sgd[0].weight.data.fill_(0)
but that still didn't work. Tensorflow code:
graph = tf.Graph()
with graph.as_default():
X = tf.placeholder(tf.float32, [None, nb_terms])
Y = tf.placeholder(tf.float32, [None,1])
w = tf.Variable( tf.zeros([nb_terms,1]) )
#w = tf.Variable( tf.truncated_normal([Degree_mdl,1],mean=0.0,stddev=1.0) )
#w = tf.Variable( 1000*tf.ones([Degree_mdl,1]) )
##
f = tf.matmul(X,w) # [N,1] = [N,D] x [D,1]
#loss = tf.reduce_sum(tf.square(Y - f))
loss = tf.reduce_sum( tf.reduce_mean(tf.square(Y-f), 0))
l2loss_tf = (1/N_train)*2*tf.nn.l2_loss(Y-f)
##
learning_rate = eta
#global_step = tf.Variable(0, trainable=False)
#learning_rate = tf.train.exponential_decay(learning_rate=eta, global_step=global_step,decay_steps=nb_iter/2, decay_rate=1, staircase=True)
train_step = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)
with tf.Session(graph=graph) as sess:
Y_train = Y_train.reshape(N_train,1)
tf.global_variables_initializer().run()
# Train
for i in range(nb_iter):
#if i % (nb_iter/10) == 0:
if i % (nb_iter/10) == 0 or i == 0:
current_loss = sess.run(fetches=loss, feed_dict={X: Kern_train, Y: Y_train})
print(f'i = {i}, current_loss = {current_loss}')
## train
batch_xs, batch_ys = get_batch(Kern_train,Y_train,M)
sess.run(train_step, feed_dict={X: batch_xs, Y: batch_ys})
I also tried changing the initialization but it didn't change anything, which makes sense cuz it shouldn't make a big difference:
mdl_sgd[0].weight.data.normal_(mean=0,std=0.001)
Original post:
https://discuss.pytorch.org/t/how-to-train-a-simple-linear-regression-model-with-sgd-in-pytorch-successfully/9620
This is how it should look like:
SOLUTION:
it seems that there is an issue with the result being returned as a vector instead of a number causing the issue. i.e. the following code fixed things:
y_pred = model(batch_xs).view(-1) # change this to "y_pred = model(batch_xs)" to get the incorrect results
loss = (y_pred - batch_ys).pow(2).mean()
which seems completely mysterious to me. Does someone know why this fixed the issue? it just seems like magic.
The bug is really subtle but essentially it's because pytorch is using numpy broadcasting rules. So when a column vector (3,1) and an array (i.e. dim is (3,) ) then what happens is that broadcasting produces a (3,3) matrix (note this wouldn't happen when you subtract a row vector (1,3) vector with a (3,) array, I guess arrays are treated as row vectors). This is really bad because it means that we compute the matrix of all pairwise differences between every label and every prediction. Of course this is nonsensical and produces a bug because we don't want the prediction of the first label point to match the prediction of every other label in the data set. Of course that won't produce anything sensible.
So it seems the answer is just to avoid wrong numpy broadcasting by either reshaping things during training or before the data is fed. Either one should work.
To avoid the error one can attach use this code:
def check_vectors_have_same_dimensions(Y,Y_):
'''
Checks that vector Y and Y_ have the same dimensions. If they don't
then there might be an error that could be caused due to wrong broadcasting.
'''
DY = tuple( Y.size() )
DY_ = tuple( Y_.size() )
if len(DY) != len(DY_):
return True
for i in range(len(DY)):
if DY[i] != DY_[i]:
return True
return False

Calculation of cost function by Iteration using Tensorflow

I have the following code extraction for calculation of cost function with iteration. Before that, feature scaling, resharp, lstm and training have been done and using the same set of data and variables to perform the calculation of cost function.
# learning parameter
learning_rate = 0.01
# iterative parameters
EPOCHS = 1000 # number of iterations
PRINT_STEP = 100 # the interval of printing validation result
# read data and data preprocessings
read_data_pd = pd.read_csv('./price.csv')
input_pd = read_data_pd.drop(['year','month','day'], axis=1)
temp_pd = feature_scaling(input_pd[feature_to_scale],sacling_meathod) # call the function feature scaling
input_pd[feature_to_scale] = temp_pd
x_ = tf.placeholder(tf.float32, [None, batch_size, n_features])
y_ = tf.placeholder(tf.float32, [None, 1])
# call the lstm-rnn function
lstm_output = lstm(x_, n_features, batch_size, n_lstm_layers, lstm_scope_name)
# linear regressor
# w is the weight vector
W = tf.Variable(tf.random_normal([n_features, n_pred_class]))
# b is the bias
b = tf.Variable(tf.random_normal([n_pred_class]))
# Y = WX + b
y = tf.matmul(lstm_output, W) + b
#define the cost function
cost_func = tf.reduce_mean(tf.square(y - y_))
train_op = tf.train.AdamOptimizer(learning_rate).minimize(cost_func)
# initialize all variables
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for ii in range(EPOCHS):
sess.run(train_op, feed_dict={x_:train_input_nparr, y_:train_target_nparr})
if ii % PRINT_STEP == 0:
cost = sess.run(cost_func, feed_dict={x_:train_input_nparr, y_:train_target_nparr})
print 'iteration =', ii, 'training cost:', cost
When I run the program, the print of calculation results of cost function is work. Yet, each time when the program runs, the results is different. For example, the results of 100 iteration sometimes will print 0.868856, but sometime might be 0.905526, which means the code have some problem.
One thing I noticed is the lines with initialize all variables: tf.initialize_all_variables() as message said that
initialize_all_variables is deprecated and will be removed after 2017-03-02.
Instructions for updating: Use `tf.global_variables_initializer` instead.
I follow the instruction, but it has no effect to amend the calculation error.
Therefore, I would like to know what is wrong with the codes and also how to correct it?

How to implement multivariate linear stochastic gradient descent algorithm in tensorflow?

I started with simple implementation of single variable linear gradient descent but don't know to extend it to multivariate stochastic gradient descent algorithm ?
Single variable linear regression
import tensorflow as tf
import numpy as np
# create random data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.5
# Find values for W that compute y_data = W * x_data
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
y = W * x_data
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# Before starting, initialize the variables
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
# Fit the line.
for step in xrange(2001):
sess.run(train)
if step % 200 == 0:
print(step, sess.run(W))
You have two part in your question:
How to change this problem to a higher dimension space.
How to change from the batch gradient descent to a stochastic gradient descent.
To get a higher dimensional setting, you can define your linear problem y = <x, w>. Then, you just need to change the dimension of your Variable W to match the one of w and replace the multiplication W*x_data by a scalar product tf.matmul(x_data, W) and your code should run just fine.
To change the learning method to a stochastic gradient descent, you need to abstract the input of your cost function by using tf.placeholder.
Once you have defined X and y_ to hold your input at each step, you can construct the same cost function. Then, you need to call your step by feeding the proper mini-batch of your data.
Here is an example of how you could implement such behavior and it should show that W quickly converges to w.
import tensorflow as tf
import numpy as np
# Define dimensions
d = 10 # Size of the parameter space
N = 1000 # Number of data sample
# create random data
w = .5*np.ones(d)
x_data = np.random.random((N, d)).astype(np.float32)
y_data = x_data.dot(w).reshape((-1, 1))
# Define placeholders to feed mini_batches
X = tf.placeholder(tf.float32, shape=[None, d], name='X')
y_ = tf.placeholder(tf.float32, shape=[None, 1], name='y')
# Find values for W that compute y_data = <x, W>
W = tf.Variable(tf.random_uniform([d, 1], -1.0, 1.0))
y = tf.matmul(X, W, name='y_pred')
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y_ - y))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# Before starting, initialize the variables
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
# Fit the line.
mini_batch_size = 100
n_batch = N // mini_batch_size + (N % mini_batch_size != 0)
for step in range(2001):
i_batch = (step % n_batch)*mini_batch_size
batch = x_data[i_batch:i_batch+mini_batch_size], y_data[i_batch:i_batch+mini_batch_size]
sess.run(train, feed_dict={X: batch[0], y_: batch[1]})
if step % 200 == 0:
print(step, sess.run(W))
Two side notes:
The implementation below is called a mini-batch gradient descent as at each step, the gradient is computed using a subset of our data of size mini_batch_size. This is a variant from the stochastic gradient descent that is usually used to stabilize the estimation of the gradient at each step. The stochastic gradient descent can be obtained by setting mini_batch_size = 1.
The dataset can be shuffle at every epoch to get an implementation closer to the theoretical consideration. Some recent work also consider only using one pass through your dataset as it prevent over-fitting. For a more mathematical and detailed explanation, you can see Bottou12. This can be easily change according to your problem setup and the statistic property your are looking for.

How to set layer-wise learning rate in Tensorflow?

I am wondering if there is a way that I can use different learning rate for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
It can be achieved quite easily with 2 optimizers:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2)
train_op = tf.group(train_op1, train_op2)
One disadvantage of this implementation is that it computes tf.gradients(.) twice inside the optimizers and thus it might not be optimal in terms of execution speed. This can be mitigated by explicitly calling tf.gradients(.), splitting the list into 2 and passing corresponding gradients to both optimizers.
Related question: Holding variables constant during optimizer
EDIT: Added more efficient but longer implementation:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
opt1 = tf.train.GradientDescentOptimizer(0.00001)
opt2 = tf.train.GradientDescentOptimizer(0.0001)
grads = tf.gradients(loss, var_list1 + var_list2)
grads1 = grads[:len(var_list1)]
grads2 = grads[len(var_list1):]
tran_op1 = opt1.apply_gradients(zip(grads1, var_list1))
train_op2 = opt2.apply_gradients(zip(grads2, var_list2))
train_op = tf.group(train_op1, train_op2)
You can use tf.trainable_variables() to get all training variables and decide to select from them.
The difference is that in the first implementation tf.gradients(.) is called twice inside the optimizers. This may cause some redundant operations to be executed (e.g. gradients on the first layer can reuse some computations for the gradients of the following layers).
Tensorflow 1.7 introduced tf.custom_gradient that greatly simplifies setting learning rate multipliers, in a way that is now compatible with any optimizer, including those accumulating gradient statistics. For example,
import tensorflow as tf
def lr_mult(alpha):
#tf.custom_gradient
def _lr_mult(x):
def grad(dy):
return dy * alpha * tf.ones_like(x)
return x, grad
return _lr_mult
x0 = tf.Variable(1.)
x1 = tf.Variable(1.)
loss = tf.square(x0) + tf.square(lr_mult(0.1)(x1))
step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()
for _ in range(5):
sess.run([step])
print(sess.run([x0, x1, loss]))
Update Jan 22: recipe below is only a good idea for GradientDescentOptimizer , other optimizers that keep a running average will apply learning rate before the parameter update, so recipe below won't affect that part of the equation
In addition to Rafal's approach, you could use compute_gradients, apply_gradients interface of Optimizer. For instance, here's a toy network where I use 2x the learning rate for second parameter
x = tf.Variable(tf.ones([]))
y = tf.Variable(tf.zeros([]))
loss = tf.square(x-y)
global_step = tf.Variable(0, name="global_step", trainable=False)
opt = tf.GradientDescentOptimizer(learning_rate=0.1)
grads_and_vars = opt.compute_gradients(loss, [x, y])
ygrad, _ = grads_and_vars[1]
train_op = opt.apply_gradients([grads_and_vars[0], (ygrad*2, y)], global_step=global_step)
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for i in range(5):
sess.run([train_op, loss, global_step])
print sess.run([x, y])
You should see
[0.80000001, 0.40000001]
[0.72000003, 0.56]
[0.68800002, 0.62400001]
[0.67520005, 0.64960003]
[0.67008007, 0.65984005]
Collect learning rate multipliers for each variable like:
self.lr_multipliers[var.op.name] = lr_mult
and then apply them during before applying the gradients like:
def _train_op(self):
tf.scalar_summary('learning_rate', self._lr_placeholder)
opt = tf.train.GradientDescentOptimizer(self._lr_placeholder)
grads_and_vars = opt.compute_gradients(self._loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self._network.lr_multipliers[var.op.name]
grads_and_vars_mult.append((grad, var))
tf.histogram_summary('variables/' + var.op.name, var)
tf.histogram_summary('gradients/' + var.op.name, grad)
return opt.apply_gradients(grads_and_vars_mult)
You can find the whole example here.
A slight variation of Sergey Demyanov answer, where you only have to specify the learning rates you would like to change
from collections import defaultdict
self.learning_rates = defaultdict(lambda: 1.0)
...
x = tf.layers.Dense(3)(x)
self.learning_rates[x.op.name] = 2.0
...
optimizer = tf.train.MomentumOptimizer(learning_rate=1e-3, momentum=0.9)
grads_and_vars = optimizer.compute_gradients(loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self.learning_rates[var.op.name]
grads_and_vars_mult.append((grad, var))
train_op = optimizer.apply_gradients(grads_and_vars_mult, tf.train.get_global_step())
The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
There is an easy way to do that using tf.stop_gradient.
Here is an example with 3 layers:
x = layer1(input)
x = layer2(x)
output = layer3(x)
You can shrink your gradient in the first two layers by a ratio of 1/100:
x = layer1(input)
x = layer2(x)
x = 1/100*x + (1-1/100)*tf.stop_gradient(x)
output = layer3(x)
On the layer2, the "flow" is split in two branches: one which has a contribution of 1/100 computes its gradient regularly but with a gradient magnitude shrinked by a proportion of 1/100, the other branch provides the remaining "flow" without contributing to the gradient because of the tf.stop_gradient operator. As a result, if you use a learning rate of 0.001 on your model optimizer, the first two layers will virtually have a learning rate of 0.00001.
If you happen to be using tf.slim + slim.learning.create_train_op there is a nice example here:
https://github.com/google-research/tf-slim/blob/master/tf_slim/learning.py#L65
# Create the train_op and scale the gradients by providing a map from variable
# name (or variable) to a scaling coefficient:
gradient_multipliers = {
'conv0/weights': 1.2,
'fc8/weights': 3.4,
}
train_op = slim.learning.create_train_op(
total_loss,
optimizer,
gradient_multipliers=gradient_multipliers)
Unfortunately it doesn't seem possible to use a tf.Variable instead of a float value if you want to gradually modify the multiplier.

Categories