PyCaffe got different gradients for each run of net.backward? - python

Today, I got a really weird thing.
I load a caffe model, feed input, net.forward, check the output data, perfect.
Then, I feed labels to the bottom layer blobs.diff, net.backward, then check the gradients (params.diff) with the result from same model caffe c++ program. They were different.
Further, when I continued to run net.backward several times at python, each time I got different gradients. This is not the case for C++ programs, they keep the same no matter how many time you run net.backward, as long as you did not change the bottom diff.
I check the bottom layer's blobs and diff, they kept unchanged both in python and C++ programs, and weights were also unchanged. This was really weird.
Anyone can provide some hints? I can provide codes if it is necessary.
Here is part of the codes :
def train_one_step(X, y, lr) :
net.blobs['data'].data[...] = X
#Forward, to get the softmax output
output = net.forward()
prob = output['prob']
#Calculate the loss of cross entropy loss function
net.blobs['prob'].diff[:] = y[:] - prob[:]
#Calculate the gradients of net parameter
#Renew weights based on gradients and learning rate
for key in net.params:
net.params[key][0].data[:] += lr * net.params[key][0].diff[:]
net.params[key][1].data[:] += lr * net.params[key][1].diff[:]
return loss, prob
I just want to dig out my own step function (the step of solver), so I can make some trick on the loss before it backwards, and something else. I know this is quite low efficient, data between GPU, CPU exchanged a lot.
In order to test it, I kept input the same sample(same X, y), you get different diff data. That means this function cannot work.


Issue with keras fit_generator epoch

I'm creating an LSTM Model for Text generation using Keras. As the dataset(around 25 novels,which has around 1.4 million words) I'm using can't be processed at once(An Memory issue with converting my outputs to_Categorical()) I created a custom generator function to read the Data in.
# Data generator for fit and evaluate
def generator(batch_size):
start = 0
end = batch_size
while True:
x = sequences[start:end,:-1]
y = sequences[start:end,-1]
y = to_categorical(y, num_classes=vocab_size)
yield x, y
if batch_size == len(lines):
start += batch_size
end += batch_size
when i excecute the method, after 1 epoch is done training the following error is thrown.
UnknownError: [_Derived_] CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/ 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
[[{{node CudnnRNN}}]]
[[sequential/lstm/StatefulPartitionedCall]] [Op:__inference_train_function_25138]
Function call stack:
train_function -> train_function -> train_function
does anyone know how to solve this issue ? Thanks
From many sources in the Internet, this issue seems to occur while using LSTM Layer along with Masking Layer and while training on GPU.
Mentioned below can be the workarounds for this problem:
If you can compromise on speed, you can Train your Model on CPU rather than on GPU. It works without any error.
As per this comment, please check if your Input Sequences comprises of all Zeros, as the Masking Layer may mask all the Inputs
If possible, you can Disable the Eager Execution. As per this comment, it works without any error.
Instead of using a Masking Layer, you can try the alternatives mentioned in this link
a. Adding the argument, mask_zero = True to the Embedding Layer. or
b. Pass a mask argument manually when calling layers that support this argument
Last solution can be to remove Masking Layer, if that is possible.
If none of the above workaround solves your problem, Google Tensorflow Team is working to resolve this error. We may have to wait till that is fixed.
Hope this information helps. Happy Learning!

In tensorflow-probability, how do I update a learnable prior used only in KL-divergence?

I'm working on a variational auto-encoder and I'd like the prior used in the KL-divergence regularization of the latent distribution to have its loc (mean) and scale (stddev) updated.
The below snippet is a contrived minimal example demonstrating what I'm trying to achieve. This starts to work but then just freezes after some random number of epochs (sometimes 1, sometimes 200, but usually around 7 or 8). There's no error message or anything.
loc = tf.Variable(tf.random.normal([ndim], stddev=0.1, dtype=tf.float32))
scale = tfp.util.TransformedVariable(
tf.random.normal([ndim], mean=1.0, stddev=0.1, dtype=tf.float32),
bijector=tfb.Chain([tfb.Shift(1e-5), tfb.Softplus(), tfb.Shift(0.5413)]))
prior = tfd.Independent(tfd.Normal(loc=loc, scale=scale), reinterpreted_batch_ndims=1)
_input = tfkl.Input(shape=(1,))
_loc = tfkl.Dense(ndim, name="loc_params")(_input)
_scale = tfkl.Dense(ndim, name="untransformed_scale_params")(_input)
_scale = tf.math.softplus(_scale + np.log(np.exp(1) - 1)) + 1e-5
_output = tfpl.DistributionLambda(
make_distribution_fn=lambda t: tfd.Independent(tfd.Normal(loc=t[0], scale=t[1])),
activity_regularizer=tfpl.KLDivergenceRegularizer(prior, use_exact_kl=True, weight=0.1)
)([_loc, _scale])
model = tf.keras.Model(_input, _output)
model.compile(optimizer='adam', loss=lambda y_true, model_out: -model_out.log_prob(y_true))
hist =, epochs=N_EPOCHS, verbose=2)
I have a runnable gist here.
A more concrete example, and an architecture close to what I'm trying to update and simplify, is the tfp example for disentangled_vae. In its manual training loop, a new tfd.MultivariateNormalDiag is instantiated on every loop, though it is parameterized using persistent tf.Variables. I'm trying my best to avoid manual training loops, and I'm also trying to move to more Keras-like syntax, so I'd rather not do a direct port of this example.
Any advice is greatly appreciated. Thanks!
Edit: The activity_regularizer seems to work fine when attached to a latent (bottleneck) distribution. I have a more complete example in this Colab notebook. As this works in my architecture, I'm no longer in need of an answer.
However, I highly doubt having model fitting freeze is desirable behaviour, so this remains a problem.
As the machinery works in most circumstances, just not the contrived freezing example above, I no longer consider this a question that needs an answer.
I have reported the errorless freezing behaviour via the tensorflow-probability repository issues page. See here.

Why does keras model predict slower after compile?

In theory, the prediction should be constant as the weights have a fixed size. How do I get my speed back after compile (without the need to remove optimizer)?
See associated experiment:
UPDATE - 1/15/2020: the current best practice for small batch sizes should be to feed inputs to the model directly - i.e. preds = model(x), and if layers behave differently at train / inference, model(x, training=False). Per latest commit, this is now documented.
I haven't benchmarked these, but per the Git discussion, it's also worth trying predict_on_batch() - especially with improvements in TF 2.1.
ULTIMATE CULPRIT: self._experimental_run_tf_function = True. It's experimental. But it's not actually bad.
To any TensorFlow devs reading: clean up your code. It's a mess. And it violates important coding practices, such as one function does one thing; _process_inputs does a lot more than "process inputs", same for _standardize_user_data. "I'm not paid enough" - but you do pay, in extra time spent understanding your own stuff, and in users filling your Issues page with bugs easier resolved with a clearer code.
SUMMARY: it's only a little slower with compile().
compile() sets an internal flag which assigns a different prediction function to predict. This function constructs a new graph upon each call, slowing it down relative to uncompiled. However, the difference is only pronounced when train time is much shorter than data processing time. If we increase the model size to at least mid-sized, the two become equal. See code at the bottom.
This slight increase in data processing time is more than compensated by amplified graph capability. Since it's more efficient to keep only one model graph around, the one pre-compile is discarded. Nonetheless: if your model is small relative to data, you are better off without compile() for model inference. See my other answer for a workaround.
Compare model performance compiled vs uncompiled as I have in code at the bottom.
Compiled is faster: run predict on a compiled model.
Compiled is slower: run predict on an uncompiled model.
Yes, both are possible, and it will depend on (1) data size; (2) model size; (3) hardware. Code at the bottom actually shows compiled model being faster, but 10 iterations is a small sample. See "workarounds" in my other answer for the "how-to".
This took a while to debug, but was fun. Below I describe the key culprits I discovered, cite some relevant documentation, and show profiler results that led to the ultimate bottleneck.
(FLAG == self.experimental_run_tf_function, for brevity)
Model by default instantiates with FLAG=False. compile() sets it to True.
predict() involves acquiring the prediction function, func = self._select_training_loop(x)
Without any special kwargs passed to predict and compile, all other flags are such that:
(A) FLAG==True --> func = training_v2.Loop()
(B) FLAG==False --> func = training_arrays.ArrayLikeTrainingLoop()
From source code docstring, (A) is heavily graph-reliant, uses more distribution strategy, and ops are prone to creating & destroying graph elements, which "may" (do) impact performance.
True culprit: _process_inputs(), accounting for 81% of runtime. Its major component? _create_graph_function(), 72% of runtime. This method does not even exist for (B). Using a mid-sized model, however, _process_inputs comprises less than 1% of runtime. Code at bottom, and profiling results follow.
(A): <class 'tensorflow.python.keras.engine.data_adapter.TensorLikeDataAdapter'>, used in _process_inputs() . Relevant source code
(B): numpy.ndarray, returned by convert_eager_tensors_to_numpy. Relevant source code, and here
(A): distribution function, and here
(B): distribution function (different), and here
PROFILER: results for code in my other answer, "tiny model", and in this answer, "medium model":
Tiny model: 1000 iterations, compile()
Tiny model: 1000 iterations, no compile()
Medium model: 10 iterations
DOCUMENTATION (indirectly) on effects of compile(): source
Unlike other TensorFlow operations, we don't convert python
numerical inputs to tensors. Moreover, a new graph is generated for each
distinct python numerical value, for example calling g(2) and g(3) will
generate two new graphs
function instantiates a separate graph for every unique set of input
shapes and datatypes. For example, the following code snippet will result
in three distinct graphs being traced, as each input has a different
A single tf.function object might need to map to multiple computation graphs
under the hood. This should be visible only as performance (tracing graphs has
a nonzero computational and memory cost) but should not affect the correctness
of the program
from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from tensorflow.keras.layers import Flatten, Dropout
from tensorflow.keras.models import Model
import numpy as np
from time import time
def timeit(func, arg, iterations):
t0 = time()
for _ in range(iterations):
print("%.4f sec" % (time() - t0))
batch_size = 32
batch_shape = (batch_size, 400, 16)
ipt = Input(batch_shape=batch_shape)
x = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x = LSTM(512, activation='relu', return_sequences=True)(ipt)
x = Conv1D(128, 400, 1, padding='same')(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
X = np.random.randn(*batch_shape)
timeit(model.predict, X, 10)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 10)
34.8542 sec
34.7435 sec
UPDATE: see actual answer posted as a separate answer; this post contains supplemental info
.compile() sets up the majority of TF/Keras graph, including losses, metrics, gradients, and partly the optimizer and its weights - which guarantees a notable slowdown.
What is unexpected is the extent of slowdown - 10-fold on my own experiment, and for predict(), which doesn't update any weights. Looking into TF2's source code, graph elements appear tightly intertwined, with resources not necessarily being allocated "fairly".
Possible overlook by developers on predict's performance for an uncompiled model, as models are typically used compiled - but in practice, this is an unacceptable difference. It's also possible it's a "necessary evil", as there is a simple workaround (see below).
This isn't a complete answer, and I hope someone can provide it here - if not, I'd suggest opening a Github issue on TensorFlow. (OP has; here)
Workaround: train a model, save its weights, re-build the model without compiling, load the weights. Do not save the entire model (e.g., as it'll load compiled - instead use model.save_weights() and model.load_weights().
Workaround 2: above, but use load_model(path, compile=False); suggestion credit: D. Möller
UPDATE: to clarify, optimizer is not fully instantiated with compile, including its weights and updates tensors - this is done when the first call to a fitting function is made (fit, train_on_batch, etc), via model._make_train_function().
The observed behavior is thus even more strange. Worse yet, building the optimizer does not elicit any further slowdowns (see below) - suggesting "graph size" is not the main explanation here.
EDIT: on some models, a 30x slowdown. TensorFlow, what have you done. Example below:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np
from time import time
def timeit(func, arg, iterations):
t0 = time()
for _ in range(iterations):
print("%.4f sec" % (time() - t0))
ipt = Input(shape=(4,))
x = Dense(2, activation='relu')(ipt)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
X = np.random.randn(32,4)
timeit(model.predict, X, 1000)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 1000)
model._make_train_function() # build optimizer
timeit(model.predict, X, 1000)
0.9891 sec
29.785 sec
29.521 sec

free(): invalid pointer Aborted (core dumped)

I-m trying to run my python program it seems that it should run smoothly however I encounter an error that I haven't seen before it says:
free(): invalid pointer
Aborted (core dumped)
However I'm not sure how to try and fix error since it doesn't give me too much information about the problem itself.
At first I thought it should be a problem with the sizes of the tensor in my network however they are completely fine. I've google the problem a little and found that I can see that is a problem with allocating memory where I shouldn't, but I don't know how to fix this problem
My code is divided in two different files, and I use two libraries to be able to use Sinkhorn loss function and make sample randomly a mesh.
import argparse
import point_cloud_utils as pcu
import time
import numpy as np
import torch
import torch.nn as nn
from fml.nn import SinkhornLoss
import common
def main():
# x is a tensor of shape [n, 3] containing the positions of the vertices that
x = torch._C.from_numpy(common.loadpointcloud("sphere.txt"))
# t is a tensor of shape [n, 3] containing a set of nicely distributed samples in the unit cube
v, f = common.unit_cube()
t = torch._C.sample_mesh_lloyd(pcu.lloyd(v,f,x.shape[0]).astype(np.float32)) # sample randomly a point cloud (cube for now?)
# The model is a simple fully connected network mapping a 3D parameter point to 3D
phi = common.MLP(in_dim=3, out_dim=3)
# Eps is 1/lambda and max_iters is the maximum number of Sinkhorn iterations to do
emd_loss_fun = SinkhornLoss(eps=1e-3, max_iters=20,
stop_thresh=1e-3, return_transport_matrix=True)
mse_loss_fun = torch.nn.MSELoss()
# Adam optimizer at first
optimizer = torch.optim.Adam(phi.parameters(), lr= 10e-3)
fit_start_time = time.time()
for epoch in range(100):
# Do the forward pass of the neural net, evaluating the function at the parametric points
y = phi(t)
# Compute the Sinkhorn divergence between the reconstruction*(using the francis library) and the target
# NOTE: The Sinkhorn function expects a batch of b point sets (i.e. tensors of shape [b, n, 3])
# since we only have 1, we unsqueeze so x and y have dimension [1, n, 3]
with torch.no_grad():
_, P = emd_loss_fun(phi(t).unsqueeze(0), x.unsqueeze(0))
# Project the transport matrix onto the space of permutation matrices and compute the L-2 loss
# between the permuted points
loss = mse_loss_fun(y[P.squeeze().max(0)[1], :], x)
# loss = mse_loss_fun(P.squeeze() # y, x) # Use the transport matrix directly
# Take an optimizer step
print("Epoch %d, loss = %f" % (epoch, loss.item()))
fit_end_time = time.time()
print("Total time = %f" % (fit_end_time - fit_start_time))
# Plot the ground truth, reconstructed points, and a mesh representing the fitted function, phi
if __name__ == "__main__":
The error message is:
free(): invalid pointer
Aborted (core dumped)
That again doesn't help me that much. I'll appreciate it a lot if someone has any idea what is happening or if you know more about this error.
Edit: The cause is actually known. The recommended solution is to build both packages from source.
There is a known issue with importing both open3d and PyTorch. The cause is unknown.
A few possible workarounds exist:
(1) Some people have found that changing the order in which you import the two packages can resolve the issue, though in my personal testing both ways crash.
(2) Other people have found compiling both packages from source to help.
(3) Still others have found that moving open3d and PyTorch to be called from separate scripts resolves the issue.
Note for future readers: This bug was filed as issue #21018.
This is not a problem in your Python code. It is a bug in PyTorch (probably) or in Python itself (unlikely, but possible).
free(3) is a C function that releases dynamically allocated memory when it is no longer needed. You cannot (easily) call it from Python, because memory management is a low-level implementation detail normally handled by the Python interpreter. However, you are also using PyTorch, which is written in C++ and C, and does have the ability to directly allocate and free memory.
In this case, some C code has tried to release a block of memory, but the block of memory it tried to release was not dynamically allocated in the first place, which is an error. You should report this behavior to the PyTorch developers. Include as much detail as possible, including the shortest code you can find that reproduces the problem, and the complete output of that program.

Tensorflow compute_gradients and apply_gradients running out of memory

I have the following lines as part of a program:
tensor_gradients = optimizer.compute_gradients(cross_entropy)
with tf.Session() as session:
for step in range(20000):
batch = mnist.train.next_batch(train_batch_size)
feed = {input_x: batch[0], input_y: batch[1]}
gradients =[tensor_gradients], feed)[0]
for i in range(len(gradients)):
gradients[i] = (gradients[i][0], tensor_gradients[i][1])
... computation on gradients ...
training_step = optimizer.apply_gradients(gradients)
training =[training_step], feed)
The reason I'm doing this is because I want to modify the gradients using numpy. The above code runs out of memory around step 800. However, if you replace the optimizer.apply_gradients step by tensor_gradients, then the code does not run out of memory.
training_step = optimizer.apply_gradients(tensor_gradients)
Any ideas at what might be happening? The rest of the code remains the same except for the line above. Is it possible that the numpy arrays in gradients is not being garbage collected because they are being passed into the apply_gradients step? I have no idea where the memory leak could be or if I'm inadvertently adding to the tensorflow graph by passing modified gradients (in numpy array form) back into apply_gradients.
Any ideas at what might be happening?
OOM happens because you're constructing the graph inside the loop: This builds a graph with 20,000x nodes, and running it may need more memory than you have.
Move all TF operations that build the graph outside the loop, i.e. everything except feed_dict construction and calls.
Reply to comments
Apply gradients builds the graph?
Yes, if you look in the docs:
An `Operation` that applies the specified gradients. If `global_step`
was not None, that operation also increments `global_step`.
