How to window or reset streaming operations in tensorflow? - python

Tensorflow provides all sorts of nice streaming operations to aggregate statistics along batches, such as tf.metrics.mean.
However I find that accumulating all values since the beginning often does not make a lot of sense. For example, one could rather want to have statistics per epoch, or any other time window that makes sense in a given context.
Is there any way to restrict the history of such streaming statistics, for example by reseting streaming operations so that they start over the accumulation?
Work-arounds:
accumulate by hand accross batch
use a "soft" sliding window using EMA

One way to do it is to call the initializer of the relevant variables in the streaming op. For example,
import tensorflow as tf
x = tf.random_normal(())
mean_x, update_op = tf.metrics.mean(x, name='mean_x')
# get the initializers of the local variables (total and count)
my_metric_variables = [v for v in tf.local_variables() if v.name.startswith('mean_x/')]
# or maybe just
# my_metric_variables = tf.get_collection('metric_variables')
reset_ops = [v.initializer for v in my_metric_variables]
with tf.Session() as sess:
tf.local_variables_initializer().run()
for _ in range(100):
for _ in range(100):
sess.run(update_op)
print(sess.run(mean_x))
# if you comment the following out, the estimate of the mean converges to 0
sess.run(reset_ops)

The metrics in tf.contrib.eager.metrics (which work both with and without eager execution) have a init_variable() op you can call if you want to reset their internal variables.

Related

Vectorized beam search decoder is not faster on GPU - Tensorflow 2

I'm trying to run a RNN beam search on a tf.keras.Model in a vectorized way to have it work completely on GPU. However, despite having everything as tf.function, as vectorized as I can make it, it runs exactly the same speed with or without a GPU. Attached is a minimal example with a fake model. In reality, for n=32, k=32, steps=128 which is what I would want to work with, this takes 20s (per n=32 samples) to decode, both on CPU and on GPU!
I must be missing something. When I train the model, on GPU a training iteration (128 steps) with batch size 512 takes 100ms, and on CPU a training iteration with batch size 32 takes 1 sec. The GPU isn't saturated at batch size 512. I get that I have overhead from doing the steps individually and doing a blocking operation per step, but in terms of computation my overhead is negligible compared to the rest of the model.
I also get that using a tf.keras.Model in this way is probably not ideal, but is there another way to wire output tensors via a function back to the input tensors, and particularly also rewire the states?
Full working example:
https://gist.github.com/meowcat/e3eaa4b8543a7c8444f4a74a9074b9ae
#tf.function
def decode_beam(states_init, scores_init, y_init, steps, k, n):
states = states_init
scores = scores_init
xstep = embed_y_to_x(y_init)
# Keep the results in TensorArrays
y_chain = tf.TensorArray(dtype="int32", size=steps)
sequences_chain = tf.TensorArray(dtype="int32", size=steps)
scores_chain = tf.TensorArray(dtype="float32", size=steps)
for i in range(steps):
# model_decode is the trained model with 3.5 million trainable params.
# Run a single step of the RNN model.
y, states = model_decode([xstep, states])
# Add scores of step n to previous scores
# (I left out the sequence end killer for this demo)
scores_y = tf.expand_dims(tf.reshape(scores, y.shape[:-1]), 2) + tm.log(y)
# Reshape into (n,k,tokens) and find the best k sequences to continue for each of n candidates
scores_y = tf.reshape(scores_y, [n, -1])
top_k = tm.top_k(scores_y, k, sorted=False)
# Transform the indices. I was using tf.unravel_index but
# `tf.debugging.set_log_device_placement(True)` indicated that this would be placed on the CPU
# thus I rewrote it
top_k_index = tf.reshape(
top_k[1] + tf.reshape(tf.range(n), (-1, 1)) * scores_y.shape[1], [-1])
ysequence = top_k_index // y.shape[2]
ymax = top_k_index % y.shape[2]
# this gives us two (n*k,) tensors with parent sequence (ysequence)
# and chosen character (ymax) per sequence.
# For continuation, pick the states, and "return" the scores
states = tf.gather(states, ysequence)
scores = tf.reshape(top_k[0], [-1])
# Write the results into the TensorArrays,
# and embed for the next step
xstep = embed_y_to_x(ymax)
y_chain = y_chain.write(i, ymax)
sequences_chain = sequences_chain.write(i, ysequence)
scores_chain = scores_chain.write(i, scores)
# Done: Stack up the results and return them
sequences_final = sequences_chain.stack()
y_final = y_chain.stack()
scores_final = scores_chain.stack()
return sequences_final, y_final, scores_final
There was a lot going on here. I will comment on it because it might help others to resolve TensorFlow performance issues.
Profiling
The GPU profiler library (cupti) was not loading correctly on the cluster, stopping me from doing any useful profiling on the GPU. That was fixed, so I get useful profiles of the GPU now.
Note this very useful answer (the only one on the web) that shows how to profile arbitrary TensorFlow 2 code, rather than Keras training:
https://stackoverflow.com/a/56698035/1259675
logdir = "log"
writer = tf.summary.create_file_writer(logdir)
tf.summary.trace_on(graph=True, profiler=True)
# run any #tf.function decorated functions here
sequences, y, scores = decode_beam_steps(
y_init, states_init, scores_init,
steps = steps, k = k, n = n, pad_mask = pad_mask)
with writer.as_default():
tf.summary.trace_export(name="model_trace", step=0, profiler_outdir=logdir)
tf.summary.trace_off()
Note that an old Chromium version is needed to look at the profiling results, since at the time (4-17-20) this fails in current Chrome/Chromium.
Small optimizations
The graph was made a bit lighter but not significantly faster by using unroll=True in the LSTM cells used by the model (not shown here), since only one step is needed so the symbolic loop only adds clutter. This significantly slashes time for the first iteration of the function above, when AutoGraph builds the graph. Note that this time is enormous (see below).
unroll=False (the default) builds in 300 seconds, unroll=True builds in 100 seconds. Note that the performance itself stays the same (15-20 sec/iteration for n=32, k=32).
implementation=1 made it slightly slower, so I stayed with the default of implementation=2.
Using tf.while_loop instead of relying on AutoGraph
The for i in range(steps) loop. I had this both in the (above shown) inlined version, and in a modularized one:
for i in range(steps):
ystep, states = model_decode([xstep, states])
ymax, ysequence, states, scores = model_beam_step(
ystep, states, scores, k, n, pad_mask)
xstep = model_rtox(ymax)
y_chain = y_chain.write(i, ymax)
sequences_chain = sequences_chain.write(i, ysequence)
scores_chain = scores_chain.write(i, scores)
where model_beam_step does all the beam search math. Unsurprisingly, both performed exactly equally bad, and in particular, both took ~100/300 seconds on the first run when AutoGraph traced the graph. Further, tracing the graph with the profiler gives a crazy 30-50mb file that won't easily load on Tensorboard and more or less crash it. The profile had dozens of parallel GPU streams with a single operation each.
Substituting this with a tf.while_loop slashed the setup time to zero (back_prop=False makes only very little difference), and produces a nice 500kb graph that can easily be looked at in TensorBoard and profiled in an useful way with 4 GPU streams.
beam_steps_cond = lambda i, y_, seq_, sc_, xstep, states, scores: i < steps
def decode_beam_steps_body(i, y_, seq_, sc_, xstep, states, scores):
y, states = model_decode([xstep, states])
ymax, ysequence, states, scores = model_beam_step(
y, states, scores, k, n, pad_mask)
xstep = model_rtox(ymax)
y_ = y_.write(i, ymax)
seq_ = seq_.write(i, ysequence)
sc_= sc_.write(i, scores)
i = i + 1
return i, y_, seq_, sc_, xstep, states, scores
_, y_chain, sequences_chain, scores_chain, _, _, _ = \
tf.while_loop(
cond = beam_steps_cond,
body = decode_beam_steps_body,
loop_vars = [i, y_chain, sequences_chain, scores_chain,
xstep, states, scores],
back_prop = False
)
Finally, the real problem
That I was actually able to look at the profile in a meaningful way showed me that the real issue was an output postprocessing function that runs on CPU. I didn't suspect it because it was running fast earlier, but I ignored that a beam search modification I made leads to >>>k sequences per candidate, which massively slows processing down. Thus, it was slashing every benefit I could gain from being efficient on GPU with the decoding step. Without this postprocessing, GPU runs >2 iterations / sec. Refactoring the postprocessing (which is extremely fast if done right) into TensorFlow resolved the issue.

Only save transformed parameters in PyMC3

I want to shift my variables in PyMC3. I'm currently just using deterministic transforms, but when I perform the inference, it saves both the original samples and the shifted samples (which is expected behavior).
Example code:
import pymc3 as pm
x_lower = -3
with pm.Model():
x = pm.Gamma('x', alpha=2., beta=1.5)
x_shift = pm.Deterministic("x_shift", x + x_lower)
trace = pm.sample(1000, tune=1000)
trace.remove_values("x") # my current solution
tp = pm.traceplot(trace)
# other analysis...
Now trace stores all x samples and all x_shift samples, which is clearly a waste when the numbers of variables and samples increase. I can do trace.remove_values("x") before continuing with analysis, but I would prefer to simply not savex at all.
Another option is to not save x_shift at all, but I can't find how to add x_lower on to the samples after inference. So this isn't really a solution if I want to use the in-built analysis tools.
Can I save only the x_shift samples, and not the x samples, when I sample?
You can specify exactly what you want to save by setting the trace argument in the pm.sample() function, e.g.,
trace = pm.sample(1000, tune=1000, trace=[x_shift])
In case anyone finds this...
I went for a still-unsatsifying solution: I'm not using Deterministic transformations any more. I still transform the variables, but don't save them. I just transform the saved (original) samples after sampling.
The above code now looks like this:
import pymc3 as pm
x_lower = -3
def f_x(x):
return x + x_lower
with pm.Model():
x = pm.Gamma('x', alpha=2., beta=1.5)
x_shift = f_x(x)
#keep = {"x": f_x} # something like this for more variables
trace = pm.sample(1000, tune=1000)
trace = {"x": f_x(trace["x"])} # note this merges all chains
#trace = {varname: f(trace[varname]) for varname, f in keep.items()}
tp = pm.traceplot(trace)
# other analysis...
I think the pm.traceplot(trace) still works with trace in this form, but otherwise just import arviz and use it directly --- it does work with dicts like this.
Note: take care with more complicated transformations. e.g. you'll have to use pm.math.exp in the model, but np.exp in the post-sampling transformation.

Resetting kernel hyperparameter values in GPflow

My use case is this: I have a function that takes in a kernel of the user's choice then I will iterate through every date in the dataset and use a Gaussian Process Regression to estimate the model using the specified kernel. However, since I'm pointing to the kernel object, I need to reset it to the default values before I run the next iteration.
import gpflow
class WrapperClass(object):
def __init__(self, kernel):
super().__init__()
self.kernel = kernel
def fit(self, X, y):
m = gpflow.models.GPR(X, y, self.kernel) # I need to reset the kernel here
# some code later
def some_function(Xs, ys, ts, f):
for t in ts:
X = Xs.loc[t] # pandas dataframe
y = ys.loc[t] # pandas
f.fit(X, y)
k1 = gpflow.kernels.RBF(1)
k2 = gpflow.kernels.White(0.1)
k = k1 + k2
f = WrapperClass(k)
sume_function(Xs, ys, ts, f)
I've found the method read_trainables() on the kernel so one strategy is to save the settings the user has provided, but there doesn't seem to be any way to set them?
In [7]: k1.read_trainables()
Out[7]: {'Sum/rbf/lengthscales': array(1.), 'Sum/rbf/variance': array(1.)}
Cheers,
Steve
You can set the parameters of Parameterized objects (models, kernels, likelihoods etc) using assign(): k1.assign(k1.read_trainables()) (or some other dict of path-value pairs). You might as well create a new kernel object, though!
Note that each time you create new parameterized objects - this applies both to kernels and models, as in your fit() method - you add operations to the tensorflow graph, which can slow down graph computation significantly if it grows a lot. You probably want to look into manually handling tf.Graph() and tf.Session() to keep them distinct for each model. (See the notebooks on session handling and further tips and tricks in the new GPflow documentation.)

How does one read TensorBoard histograms for a 1D example in TensorFlow?

I made the simplest 1D example for TensorBoard (tracking the minimization of a quadratic) but I get plots that don't make sense to me and I can't figure out why. Is it my own implementation or is TensorBoard buggy?
Here are the plots:
HISTOGRAM:
Usually I think of histograms as bar graphs that encode probability distributions (or frequency counts). I assume that the y-axis say the values and the x-axis the count? Since my numbers of steps is 120 that seemed reasonable guess.
and Scalar plot:
why is there a strange line going through my plots?
The code that produced it (you should be able to copy paste it and run it):
## run cmd to collect model: python playground.py --logdir=/tmp/playground_tmp
## show board on browser run cmd: tensorboard --logdir=/tmp/playground_tmp
## browser: http://localhost:6006/
import tensorflow as tf
# x variable
x = tf.Variable(10.0,name='x')
# b placeholder (simualtes the "data" part of the training)
b = tf.placeholder(tf.float32)
# make model (1/2)(x-b)^2
xx_b = 0.5*tf.pow(x-b,2)
y=xx_b
learning_rate = 1.0
# get optimizer
opt = tf.train.GradientDescentOptimizer(learning_rate)
# gradient variable list = [ (gradient,variable) ]
gv = opt.compute_gradients(y,[x])
# transformed gradient variable list = [ (T(gradient),variable) ]
decay = 0.9 # decay the gradient for the sake of the example
# apply transformed gradients
tgv = [ (decay*g, v) for (g,v) in gv] #list [(grad,var)]
apply_transform_op = opt.apply_gradients(tgv)
# track value of x
x_scalar_summary = tf.scalar_summary("x", x)
x_histogram_sumarry = tf.histogram_summary('x_his', x)
with tf.Session() as sess:
merged = tf.merge_all_summaries()
tensorboard_data_dump = '/tmp/playground_tmp'
writer = tf.train.SummaryWriter(tensorboard_data_dump, sess.graph)
sess.run(tf.initialize_all_variables())
epochs = 120
for i in range(epochs):
b_val = 1.0 #fake data (in SGD it would be different on every epoch)
# applies the gradients
[summary_str_apply_transform,_] = sess.run([merged,apply_transform_op], feed_dict={b: b_val})
writer.add_summary(summary_str_apply_transform, i)
I also met the same problem where multiple lines occurred in the Instance tab in tensor board (even I tried your codes and Board service shows the duplicated warning and only present one curve, better than me)
WARNING:tensorflow:Found more than one graph event per run. Overwriting the graph with the newest event.
nevertheless, the solution hold the same as #Olivier Moindrot mentioned, delete the old logs, while sometimes the board may cache some results so you may want to reboot the board services.
The way to make sure we present the newest summary, as the MINIST example shown, is to log at a new folder:
if tf.gfile.Exists(FLAGS.summaries_dir):
tf.gfile.DeleteRecursively(FLAGS.summaries_dir)
tf.gfile.MakeDirs(FLAGS.summaries_dir)
Link to full source, with TF version r0.10: https://github.com/tensorflow/tensorflow/blob/r0.10/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py

Avoid cluttering the tensorflow graph with assign operations

I have to run something like the following code
import tensorflow as tf
sess = tf.Session()
x = tf.Variable(42.)
for i in range(10000):
sess.run(x.assign(42.))
sess.run(x)
print(i)
several times. The actual code is much more complicated and uses more variables.
The problem is that the TensorFlow graph grows with each instantiated assign op, which makes the graph grow, eventually slowing down the computation.
I could use feed_dict= to set the value, but I would like to keep my state in the graph, so that I can easily query it in other places.
Is there some way of avoiding cluttering the current graph in this case?
I think I've found a good solution for this:
I define a placeholder y and create an op that assigns the value of y to x.
I can then use that op repeatedly, using feed_dict={y: value} to assign a new value to x.
This doesn't add another op to the graph.
It turns out that the loop runs much more quickly than before as well.
import tensorflow as tf
sess = tf.Session()
x = tf.Variable(42.)
y = tf.placeholder(dtype=tf.float32)
assign = x.assign(y)
sess.run(tf.initialize_all_variables())
for i in range(10000):
sess.run(assign, feed_dict={y: i})
print(i, sess.run(x))
Each time you call sess.run(x.assign(42.))
two things happen: (i) a new assign operation is added to the computational graph sess.graph, (ii) the newly added operation executes. No wonder the graph gets pretty large if loop repeats many times. If you define assignment operation before execution (asgnmnt_operation in example below), just a single operation is added to the graph so the performance is great:
import tensorflow as tf
x = tf.Variable(42.)
c = tf.constant(42.)
asgnmnt_operation = x.assign(c)
sess = tf.Session()
for i in range(10000):
sess.run(asgnmnt_operation)
sess.run(x)
print(i)

Categories