Tensorflow gradients for every item of tensor - python

I have a network that takes as input an Nx3 matrix and produces an N-dimensional vector. Let's say batch size is 1 and N=1024, so the output would have the shape (1,1024). I want to compute the gradients for every dimension of the output, with respect to the input. That is, dy/dx for every y. However tensorflow's tf.gradients computes d sum(y)/dx, aggregate. I know there's no straightforward way to compute the gradients for every output dimension, so I finally decided to run tf.gradients 1024 times, because I only have to do this once in the project, and never again.
So I do this:
start = datetime.datetime.now()
output_code_split = tf.split(output_code,1024)
#output shape = (1024,)
grad_ops = []
for i in range(1024):
gr = tf.gradients(output_code_split[i],input)
#output shape = (1024,1,16,1024,3) , where 16= batch size
gr = tf.reduce_mean(gr,[0,1,2,3])
#output shape = (1024,)
grad_ops.append(gr)
present = datetime.datetime.now()
print(i,(present-start).seconds,flush=True)
#prints time taken to finish previous computation.
start = datetime.datetime.now()
When the code started running, the time between two iterations was 4 seconds, so I figured it'll run for roughly 4096 seconds. However, as the number of iterations increase, the time taken for subsequent runs keeps increasing. The gap, which was 4 seconds when the code started, eventually got to 30 seconds after about 500 iterations, which is too much.
Is the list holding the gradient ops grad_ops growing bigger and occupying more memory. I'm unfortunately not in the position to do a detailed memory profiling of this code..Any ideas about what causes the iteration time to blow up as time goes on?
(Note that in the code, I'm only creating the gradient ops and not actually evaluating them. That part comes later, but my code doesn't reach there on account of the extreme slowdown mentioned above)
Thanks.

What blows up your execution time is that you define a new operation on the graph in every iteration of your for loop. Every call to tf.gradient and tf.reduce_mean pushes a new node onto the graph. Then it needs to recompile to be run. What should actually work for you is to use tf.gather with an int32 placeholder, which supplies the dimension to your gradient operation. So something like this:
idx_placeholder = tf.placeholder(tf.int32, shape=(None,))
grad_operation = tf.gradients(tf.gather(output_code_split, idx_placeholder))
for i in range(1024):
sess.run(grad_operation, {idx_placeholder: np.array([i])})

Related

How to set up an infinite dimension shape for a gym environment observation space in order to accumulate observation frame?

I'm writing a custom gym environment in python.
For each step, environment catches observations as numpy array of floats and stack to previous steps observations.
So observations looks like :
Step 1 : array([[0.,0.,1.25,2.3]])
Step 2 : array([[0.,0.,1.25,2.3],
[0.,0.22,7.22,2.3]])
Step 3 : array([[0.,0.,1.25,2.3],
[0.,0.22,7.22,2.3],
[0.21,0.58,8.25,2.7]])
etc...
I defined observation as follows :
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape = (,4),dtype=np.float32)
but it doesn't work...
This snippet works if i don't stack observations over time and consider each time step as independant:
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape = (4,),dtype=np.float32)
The problem is I don't know the final number of steps to stack so I can't define 2nd dimension.
Can I define a kind of infinite / unknown dimension in for a shape or have I to set a deterministic (and huge) windows and fill with 0. for incoming steps ?
Thanks the for answer.

How to avoid too many ops from a loop in tensorflow?

** Edition: to avoid ambiguity, my problem is "too many" ops, "a large number" of ops. Not about the numerical value of a certain op.
My Old Title: How to avoid a large number of ops from a loop in
tensorflow?
** Following is the main text
Hi, Guys. I'm using tensorflow 1.12 to implement a network.
The background is:
In my loss function, I want to compute a large number (about 5000) of random point pairs that sampled from every input image, which costs me a lot of memory and time. The subscripts of point pairs are chosen by a certain method with randomness, so I think it's impossible to integrate the loop into a matrix. It appears like this: ( in my_loss_func() )
# gt & pred is tensors of [w, h]
# x1, y1, x2, y2 are lists of the point pairs' coordinates which were randomly sampled from input image
# for example, x1 = [1, 2, 3], y1 = [4, 5, 6], x2 = [7, 8, 9], y2 = [10, 11, 12]
# then it will compute loss between gt[1, 4] & pred[7, 10], gt[2, 5] & pred[8, 11] etc.
num_pairs = 5000
...
# compute loss for this batch
for i in range(num_pairs):
diff = gt[x1[i], y1[i]] - pred[x2[i], y2[i]]
loss = some_math_computation(diff)
return loss/num_pairs
The problem is:
If I build the graph like above, it will definitely create the loss computing op for 5k times, which costs me a lot. In fact, every time I run the program, I must wait for about 10 mins before the first batch of data's getting trained. In my DEBUG log, I found that the loop runs for only about 25 times within a second. Thus the large time cost is definitely from here. I feel myself so nooob.
Can you tell me how to avoid building the graph like this? Btw, I'm using tf.estimator, so the sess.run() operation is invalid.
My opinions:
Maybe I can load the total graph for once, and save this graph as file. Then for every time I run the model, I can load the total graph from that file, so I won't need to wait such a long time. Will it work?
Please tell me the correct way to implement this kind of loss. I think maybe this kind of loss is quite common in some fields of computer vision.

Question about terminology used in LSTM inputs - seq_length vs context_size for sliding window approach

I have time series of a sequence of n vectors that I need to feed to a LSTM with a sliding window approach.
In different resources that I read online, seq_length is often referred to as the window length [or the number of LSTM cells] and context_size is defined as the input size to LSTM at a timestep. (for eg. one input vector at a time step).
At any given point of time t, I want to pass points x_{t-m}, …, x_{t-1},…,x_{t} to LSTM followed by a dense layer and predict a categorical target attribute at every tilmestep.
If I want to follow the sliding window approach, do I need to explicitly split input data into windows of size m ?
For instance:
[x_{t-m}, …, x_{t-1},…,x_{t}], [x_{t-m+1}, …, x_{t},…,x_{t+1}], [x_{t-m+2}, …, x_{t+1}, x_{t+2}] etc.
Or, can I split the input data into non-overlapping chunks [x_{t-m}, …, x_{t-1},…,x_{t}], [x_{t+1}, …, x_{t+m-1},x_{t+m}] etc. and instead resize the context_size ?
embeds = embeds.unfold(1, context_size, 1) # Keeping the step size to be 1
embeds = embeds.view(embeds.size(0), embeds.size(1), -1)
Is there a better way to implement sliding window approach for timeseries data ?
LSTM by default recurses on data, in that the prediction at time t will depend on all of the past(depending on memory). In this case it seems like you want input at time t to depend on m+1 instances into the past. You do not need a recurrent net if so, you can simple use linear and feed in the sliding window at each instant. However, if you're using a recurrent net, you don't need to pass the same inputs again and again.

Using stop_gradient with AdamOptimizer in TensorFlow

I am trying to implement a training/finetuning framework when in each backpropagation iteration a certain set of parameters stay fixed. I want to be able to change the set of updating or fixed parameters from iteration to iteration. TensorFlow method tf.stop_gradient, which apparently forces gradients of some parameters to stay zero, is very useful for this purpose and it works perfectly fine with different optimizers if the set of updating or fixed parameters do not change from iterations to iterations. It can also handle varying set of updating or fixed parameters if it is used with stochastic gradient descent. My problem is that tf.stop_gradient cannot handle such cases when being used with Adam optimizer. More specifically, it does keep the gradients of the fixed parameters at zero in the output of tf.compute_gradients, but when applying the gradients (tf.apply_gradients), value of the fixed parameters does change. I suppose this is because the optimiaztion step in Adam optimizer is not zero even if the gradient is zero (based on algorithm 1 in Kingma and Ba's paper). Is there a cheap way of freezing a variable set of parameters in each Adam iteration, without explicitly saving the previous iteration's values of the fixed parameters?
More Details:
Suppose I have a single-layer network with weight matrix variable W and a binary mask matrix placeholder MW that specifies which elements of W should get updated in each iteration (value 1 in the ). Instead of using W to write the input/output relationship of this layer, I modify it as below
masked_W = MW*W + tf.stop_gradient(tf.abs(1-MW)*W)
to mask certain elements of W from having non-zero gradients. Then I use masked_W to form the output of the layer and consequently the loss of the network depends on this masked variable. The point is that MW changes in each iteration. Suppose W is a vector of 4 elements initialized to all-zero vector. Here is what happens:
opt=tf.AdamOptimizer(1e-5)
sess.run(tf.global_variables_initializer())
grads_vars=opt.compute_gradients(loss, W)
# initial value of W=[0,0,0,0]
# first iteration:
MW_val = [0,1,1,0]
feed_dict={MW:MW_val, x: batch_of_data, y_:batch_of_labels}
sess.run(opt.apply_gradients(grads_vars), feed_dict=feed_dict))
# gradient of W=[0,xx,xx,0]
# new value of W=[0,a,b,0]
where xx are some non-zero gradient values, and a and b are new values of updating elements of W. In the second iteration, we change the value assigned to the binary mask matrix MW to [1,0,0,1], hence we expect to have fixed values for W[1] and W[2] and updating values for W[0] and W[3]. But this is what happens:
# second iteration
MW_val = [1,0,0,1]
feed_dict={MW:MW_val, x: batch_of_data, y_:batch_of_labels}
sess.run(opt.apply_gradients(grads_vars), feed_dict=feed_dict))
# gradient of W=[xx,0,0,xx]
# new value of W=[c,aa,bb,d]
That is, although the gradients of W[1] and W[2] are zero, they get new values (aa != a and bb != b). When changing the optimizer from Adam to SGD, the values of fixed parameters stay the same as expected.
I found a solution to my question and am sharing it here in case others will find it useful. After the first iteration, the moments of those parameters that had been updated in the first iteration are already non-zero. Therefore, even if one puts their gradients to zero in the second iteration, they will be updated because of their non-zero momentum tensors. In order to prevent the updates, only using tf.stop_gradient is not enough, we have to remove their momentum as well. In case of Adam optimizer, this can be done through get_slot method of the optimizer: opt.get_slot(par, 'm') and opt.get_slot(par,'v'), where the former and latter give access to the first and second momentum tensors of parameter par, respectively. In the example of the question, we have to add the following lines to freeze W[1] and W[2] in the second iteration:
# momentums of W in the first iteration
m_vals = opt.get_slot(W, 'm')
v_vals = opt.get_slot(W, 'v')
# mask them before running the second iteration
masked_m_vals[[1,2]]=0
masked_v_vals[[1,2]]=0
sess.run(opt.get_slot(W, 'm').assign(masked_m_vals))
sess.run(opt.get_slot(W, 'v').assign(masked_v_vals))
It is better to save the masked momentums, in example above m_vals[[1,2]] and v_vals[[1,2]], so that if in the third iteration we relax the fixing constraint of W[1] and W[2], we can restore their momentums to their original values in the first iteration.
Alternatively you can pass different subsets of the variables to apply_gradients when you want to update different subsets of variables.

Rearanging numpy array without memory error

I am currently trying to run a Monte Carlo simulation for which I would need more than 5,000,000 iterations (the outputs are still not consistent at that point).
When I try and run it for more than 5 million however I get a memory error when re-arranging my array to obtain the data sorted in a way I can easily plot it.
the error occurs at
np.array([np.array([run_single_regression(inputs) for x in xrange(iterations)])]).transpose()
This is the function I run:
def Monte_Carlo_regressions(filename, iterations, do_plot = False):
inputs = data_assignment_regression(filename)
total_pow, total_energy = np.array([np.array([run_single_regression(inputs) for x in xrange(iterations)])]).transpose()
if do_plot:
plot(total_pow, 'Total Power Capacity (GW)')
plot(total_energy, 'Total Energy Storage Capacity (TWh)')
return total_pow.mean(0), total_pow.std(0), total_energy.mean(0), total_energy.std(0)
The data_assignment_regression(filename) function returns a set of 1D arrays assigned to the inputs.
The run_single_regression(inputs) function estimates the power and energy outputs for that iteration and return a numpy array containing the power and energy for that iteration.
How can I avoid the memory error ? Is there a way I could re-arrange the array without having to store all the value ?

Categories