Using stop_gradient with AdamOptimizer in TensorFlow - python

I am trying to implement a training/finetuning framework when in each backpropagation iteration a certain set of parameters stay fixed. I want to be able to change the set of updating or fixed parameters from iteration to iteration. TensorFlow method tf.stop_gradient, which apparently forces gradients of some parameters to stay zero, is very useful for this purpose and it works perfectly fine with different optimizers if the set of updating or fixed parameters do not change from iterations to iterations. It can also handle varying set of updating or fixed parameters if it is used with stochastic gradient descent. My problem is that tf.stop_gradient cannot handle such cases when being used with Adam optimizer. More specifically, it does keep the gradients of the fixed parameters at zero in the output of tf.compute_gradients, but when applying the gradients (tf.apply_gradients), value of the fixed parameters does change. I suppose this is because the optimiaztion step in Adam optimizer is not zero even if the gradient is zero (based on algorithm 1 in Kingma and Ba's paper). Is there a cheap way of freezing a variable set of parameters in each Adam iteration, without explicitly saving the previous iteration's values of the fixed parameters?
More Details:
Suppose I have a single-layer network with weight matrix variable W and a binary mask matrix placeholder MW that specifies which elements of W should get updated in each iteration (value 1 in the ). Instead of using W to write the input/output relationship of this layer, I modify it as below
masked_W = MW*W + tf.stop_gradient(tf.abs(1-MW)*W)
to mask certain elements of W from having non-zero gradients. Then I use masked_W to form the output of the layer and consequently the loss of the network depends on this masked variable. The point is that MW changes in each iteration. Suppose W is a vector of 4 elements initialized to all-zero vector. Here is what happens:
opt=tf.AdamOptimizer(1e-5)
sess.run(tf.global_variables_initializer())
grads_vars=opt.compute_gradients(loss, W)
# initial value of W=[0,0,0,0]
# first iteration:
MW_val = [0,1,1,0]
feed_dict={MW:MW_val, x: batch_of_data, y_:batch_of_labels}
sess.run(opt.apply_gradients(grads_vars), feed_dict=feed_dict))
# gradient of W=[0,xx,xx,0]
# new value of W=[0,a,b,0]
where xx are some non-zero gradient values, and a and b are new values of updating elements of W. In the second iteration, we change the value assigned to the binary mask matrix MW to [1,0,0,1], hence we expect to have fixed values for W[1] and W[2] and updating values for W[0] and W[3]. But this is what happens:
# second iteration
MW_val = [1,0,0,1]
feed_dict={MW:MW_val, x: batch_of_data, y_:batch_of_labels}
sess.run(opt.apply_gradients(grads_vars), feed_dict=feed_dict))
# gradient of W=[xx,0,0,xx]
# new value of W=[c,aa,bb,d]
That is, although the gradients of W[1] and W[2] are zero, they get new values (aa != a and bb != b). When changing the optimizer from Adam to SGD, the values of fixed parameters stay the same as expected.

I found a solution to my question and am sharing it here in case others will find it useful. After the first iteration, the moments of those parameters that had been updated in the first iteration are already non-zero. Therefore, even if one puts their gradients to zero in the second iteration, they will be updated because of their non-zero momentum tensors. In order to prevent the updates, only using tf.stop_gradient is not enough, we have to remove their momentum as well. In case of Adam optimizer, this can be done through get_slot method of the optimizer: opt.get_slot(par, 'm') and opt.get_slot(par,'v'), where the former and latter give access to the first and second momentum tensors of parameter par, respectively. In the example of the question, we have to add the following lines to freeze W[1] and W[2] in the second iteration:
# momentums of W in the first iteration
m_vals = opt.get_slot(W, 'm')
v_vals = opt.get_slot(W, 'v')
# mask them before running the second iteration
masked_m_vals[[1,2]]=0
masked_v_vals[[1,2]]=0
sess.run(opt.get_slot(W, 'm').assign(masked_m_vals))
sess.run(opt.get_slot(W, 'v').assign(masked_v_vals))
It is better to save the masked momentums, in example above m_vals[[1,2]] and v_vals[[1,2]], so that if in the third iteration we relax the fixing constraint of W[1] and W[2], we can restore their momentums to their original values in the first iteration.

Alternatively you can pass different subsets of the variables to apply_gradients when you want to update different subsets of variables.

Related

Tensorflow giving a ValueError: No gradients provided for any variable

I'm trying to implement a loss function that increases loss as the model ranks images from worst to best, to do this I've come up with an algorithm that sorts the predicted score array according to the true scores of the image batch, then starting from the largest predicted score, check how far away it is from the first position in the array and give it a loss based on that, for the second largest we will see how far away it is from the 2nd array position and give it a loss based on that
To do this, I'm using tf.nn.top_k and other functions that I looked up to all be differentiable to my knowledge, but I still get the No gradients provided error
Can someone please tell me what part am I doing wrong?
Please note that the global sub_tensor was a workaround (to replace correct_indices) I was doing to avoid using a range which I know is non-differentiable, an array from outside the function that is fixed to be a range of the length of the batch [0-32]. This still didn't work
sub_tensor = constant(np.array([np.arange(32)],dtype='int32'))
def get_ranking_loss(y_true,y_pred):
global sub_tensor
_, y_true_ind_k = tf.nn.top_k(y_true, y_true.shape[1])
sorted_y_pred = tf.gather(y_pred,y_true_ind_k)
_, y_pred_ind_k = tf.nn.top_k(sorted_y_pred, sorted_y_pred.shape[1])
# correct_indices = tf.range(0,sorted_y_pred.shape[1])
subtracted = tf.math.subtract(y_pred_ind_k,sub_tensor)
absolute = tf.abs(subtracted)
absolute = tf.cast(absolute, float64)
return tf.reduce_sum(absolute)
I tried to change almost all functions to be tf functions only, but no luck

How does tensorflow handle non differentiable nodes during gradient calculation?

I understood the concept of automatic differentiation, but couldn't find any explanation how tensorflow calculates the error gradient for non differentiable functions as for example tf.where in my loss function or tf.cond in my graph. It works just fine, but I would like to understand how tensorflow backpropagates the error through such nodes, since there is no formula to calculate the gradient from them.
In the case of tf.where, you have a function with three inputs, condition C, value on true T and value on false F, and one output Out. The gradient receives one value and has to return three values. Currently, no gradient is computed for the condition (that would hardly make sense), so you just need to do the gradients for T and F. Assuming the input and the outputs are vectors, imagine C[0] is True. Then Out[0] comes from T[0], and its gradient should propagate back. On the other hand, F[0] would have been discarded, so its gradient should be made zero. If Out[1] were False, then the gradient for F[1] should propagate but not for T[1]. So, in short, for T you should propagate the given gradient where C is True and make it zero where it is False, and the opposite for F. If you look at the implementation of the gradient of tf.where (Select operation), it does exactly that:
#ops.RegisterGradient("Select")
def _SelectGrad(op, grad):
c = op.inputs[0]
x = op.inputs[1]
zeros = array_ops.zeros_like(x)
return (None, array_ops.where(c, grad, zeros), array_ops.where(
c, zeros, grad))
Note the input values themselves are not used in the computation, that will be done by the gradients of the operation producing those inputs. For tf.cond, the code is a bit more complicated, because the same operation (Merge) is used in different contexts, and also tf.cond also uses Switch operations inside. However the idea is the same. Essentially, Switch operations are used for each input, so the input that was activated (the first if the condition was True and the second otherwise) gets the received gradient and the other input gets a "switched off" gradient (like None), and does not propagate back further.

Keras: Derivatives of output with respect to time with LSTM

I have been trying to model nonlinear dynamic systems with LSTM networks using Keras. I have had success by simply using the Keras LSTM networks, where I define my input/output relationship something like the following pseudo-code:
x[t] = NN(y[t-200:t],x[t-200-1:t-1])
Where y would be my forcing function and x is the variable I'm after. So I use the past 200 points to estimate the next point. I do this recursively by adding the newly predicted point to my "past outputs" vector.
Now I would like to add some information about the PDE that I'm solving to the loss function, so I need to compute derivatives with respect to time. I have read this answer and the related answers to get started but I can't seem to get that workflow to work with LSTMs. First of all, time is not an explicit variable in my workflow, so I would need to add it as an input to accommodate the workflow in that answer.
So I could add the time vector to the list of inputs, and then try to compute derivatives of the output with respect to the input:
_df1 = grad(model.output,model.input)
df1 = tf.Print( _df1, [ _df1 ], message = "df1" )
For reference, my input dimension is (?,200,3) and my output dimension is simply (?,1). The code above works and I get a (?,200,3) tensor. But when I try to compute the second derivative like so:
_df2 = grad(df1,model.input)
df2 = tf.Print( _df2, [ _df2 ], message = "df2" )
Then I get the error:
TypeError: Second-order gradient for while loops not supported.
Since I only need the derivatives at the current timestep (t), I have tried slicing the tensor, but that doesn't work either.
_df2 = grad(df1[:,-1,-1],model.input)
df2 = tf.Print( _df2, [ _df2 ], message = "df2" )
Even if I could do something like that, I am not too comfortable adding the time vector as an explicit input. I have considered computing the derivative numerically with diff() (given a constant dt), but I am not sure how to do that here when dealing with tensors.
So I'd appreciate any suggestions or ideas to help me solve this problem. Ultimately, I'd like to add the homogeneous portion of the PDE to the loss function. At this point, my equation only has derivatives with respect to time.
Thanks.

How to use tensorflow to approximate hessian matrix's norm

I wonder is there any method to recompute gradients with updated weights within a graph or if there is any better way to do this. For example, for estimating hessian norm, we need to compute
delta ~ N(0, I)
hessian_norm = 1/M \sum_{1}^{M} gradient(f(x+delta))- gradient(f(x-delta))/(2*delta)
we need to gradient value on x+delta. Currently we will get None type if we use tf.gradient on var+delta directly.
More specifally speaking, if we define
a = tf.Variable
b = some_function(a)
grad = tf.gradients(b, a)
that's a normal gradient computation but if we do
grad_delta = tf.gradients(b, a+delta)
it will return None. This feature seems to make it impossible to approximate the hessian norm using the above method.
b is not a function of a+delta, so you get Nones. You either need to create new value b2 which depends on a+delta, or just move your a variable by delta and eval again to get second value.
This is similar to how you do line search in TensorFlow.

Spark mllib predicting weird number or NaN

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)

Categories