I-m trying to run my python program it seems that it should run smoothly however I encounter an error that I haven't seen before it says:
free(): invalid pointer
Aborted (core dumped)
However I'm not sure how to try and fix error since it doesn't give me too much information about the problem itself.
At first I thought it should be a problem with the sizes of the tensor in my network however they are completely fine. I've google the problem a little and found that I can see that is a problem with allocating memory where I shouldn't, but I don't know how to fix this problem
My code is divided in two different files, and I use two libraries to be able to use Sinkhorn loss function and make sample randomly a mesh.
import argparse
import point_cloud_utils as pcu
import time
import numpy as np
import torch
import torch.nn as nn
from fml.nn import SinkhornLoss
import common
def main():
# x is a tensor of shape [n, 3] containing the positions of the vertices that
x = torch._C.from_numpy(common.loadpointcloud("sphere.txt"))
# t is a tensor of shape [n, 3] containing a set of nicely distributed samples in the unit cube
v, f = common.unit_cube()
t = torch._C.sample_mesh_lloyd(pcu.lloyd(v,f,x.shape[0]).astype(np.float32)) # sample randomly a point cloud (cube for now?)
# The model is a simple fully connected network mapping a 3D parameter point to 3D
phi = common.MLP(in_dim=3, out_dim=3)
# Eps is 1/lambda and max_iters is the maximum number of Sinkhorn iterations to do
emd_loss_fun = SinkhornLoss(eps=1e-3, max_iters=20,
stop_thresh=1e-3, return_transport_matrix=True)
mse_loss_fun = torch.nn.MSELoss()
# Adam optimizer at first
optimizer = torch.optim.Adam(phi.parameters(), lr= 10e-3)
fit_start_time = time.time()
for epoch in range(100):
optimizer.zero_grad()
# Do the forward pass of the neural net, evaluating the function at the parametric points
y = phi(t)
# Compute the Sinkhorn divergence between the reconstruction*(using the francis library) and the target
# NOTE: The Sinkhorn function expects a batch of b point sets (i.e. tensors of shape [b, n, 3])
# since we only have 1, we unsqueeze so x and y have dimension [1, n, 3]
with torch.no_grad():
_, P = emd_loss_fun(phi(t).unsqueeze(0), x.unsqueeze(0))
# Project the transport matrix onto the space of permutation matrices and compute the L-2 loss
# between the permuted points
loss = mse_loss_fun(y[P.squeeze().max(0)[1], :], x)
# loss = mse_loss_fun(P.squeeze() # y, x) # Use the transport matrix directly
# Take an optimizer step
loss.backward()
optimizer.step()
print("Epoch %d, loss = %f" % (epoch, loss.item()))
fit_end_time = time.time()
print("Total time = %f" % (fit_end_time - fit_start_time))
# Plot the ground truth, reconstructed points, and a mesh representing the fitted function, phi
common.visualitation(x,t,phi)
if __name__ == "__main__":
main()
The error message is:
free(): invalid pointer
Aborted (core dumped)
That again doesn't help me that much. I'll appreciate it a lot if someone has any idea what is happening or if you know more about this error.
Edit: The cause is actually known. The recommended solution is to build both packages from source.
There is a known issue with importing both open3d and PyTorch. The cause is unknown. https://github.com/pytorch/pytorch/issues/19739
A few possible workarounds exist:
(1) Some people have found that changing the order in which you import the two packages can resolve the issue, though in my personal testing both ways crash.
(2) Other people have found compiling both packages from source to help.
(3) Still others have found that moving open3d and PyTorch to be called from separate scripts resolves the issue.
Note for future readers: This bug was filed as issue #21018.
This is not a problem in your Python code. It is a bug in PyTorch (probably) or in Python itself (unlikely, but possible).
free(3) is a C function that releases dynamically allocated memory when it is no longer needed. You cannot (easily) call it from Python, because memory management is a low-level implementation detail normally handled by the Python interpreter. However, you are also using PyTorch, which is written in C++ and C, and does have the ability to directly allocate and free memory.
In this case, some C code has tried to release a block of memory, but the block of memory it tried to release was not dynamically allocated in the first place, which is an error. You should report this behavior to the PyTorch developers. Include as much detail as possible, including the shortest code you can find that reproduces the problem, and the complete output of that program.
Related
I'm working on a variational auto-encoder and I'd like the prior used in the KL-divergence regularization of the latent distribution to have its loc (mean) and scale (stddev) updated.
The below snippet is a contrived minimal example demonstrating what I'm trying to achieve. This starts to work but then just freezes after some random number of epochs (sometimes 1, sometimes 200, but usually around 7 or 8). There's no error message or anything.
loc = tf.Variable(tf.random.normal([ndim], stddev=0.1, dtype=tf.float32))
scale = tfp.util.TransformedVariable(
tf.random.normal([ndim], mean=1.0, stddev=0.1, dtype=tf.float32),
bijector=tfb.Chain([tfb.Shift(1e-5), tfb.Softplus(), tfb.Shift(0.5413)]))
prior = tfd.Independent(tfd.Normal(loc=loc, scale=scale), reinterpreted_batch_ndims=1)
_input = tfkl.Input(shape=(1,))
_loc = tfkl.Dense(ndim, name="loc_params")(_input)
_scale = tfkl.Dense(ndim, name="untransformed_scale_params")(_input)
_scale = tf.math.softplus(_scale + np.log(np.exp(1) - 1)) + 1e-5
_output = tfpl.DistributionLambda(
make_distribution_fn=lambda t: tfd.Independent(tfd.Normal(loc=t[0], scale=t[1])),
activity_regularizer=tfpl.KLDivergenceRegularizer(prior, use_exact_kl=True, weight=0.1)
)([_loc, _scale])
model = tf.keras.Model(_input, _output)
model.compile(optimizer='adam', loss=lambda y_true, model_out: -model_out.log_prob(y_true))
hist = model.fit(ds, epochs=N_EPOCHS, verbose=2)
I have a runnable gist here.
A more concrete example, and an architecture close to what I'm trying to update and simplify, is the tfp example for disentangled_vae. In its manual training loop, a new tfd.MultivariateNormalDiag is instantiated on every loop, though it is parameterized using persistent tf.Variables. I'm trying my best to avoid manual training loops, and I'm also trying to move to more Keras-like syntax, so I'd rather not do a direct port of this example.
Any advice is greatly appreciated. Thanks!
Edit: The activity_regularizer seems to work fine when attached to a latent (bottleneck) distribution. I have a more complete example in this Colab notebook. As this works in my architecture, I'm no longer in need of an answer.
However, I highly doubt having model fitting freeze is desirable behaviour, so this remains a problem.
As the machinery works in most circumstances, just not the contrived freezing example above, I no longer consider this a question that needs an answer.
I have reported the errorless freezing behaviour via the tensorflow-probability repository issues page. See here.
In theory, the prediction should be constant as the weights have a fixed size. How do I get my speed back after compile (without the need to remove optimizer)?
See associated experiment: https://nbviewer.jupyter.org/github/off99555/TensorFlowExperiments/blob/master/test-prediction-speed-after-compile.ipynb?flush_cache=true
UPDATE - 1/15/2020: the current best practice for small batch sizes should be to feed inputs to the model directly - i.e. preds = model(x), and if layers behave differently at train / inference, model(x, training=False). Per latest commit, this is now documented.
I haven't benchmarked these, but per the Git discussion, it's also worth trying predict_on_batch() - especially with improvements in TF 2.1.
ULTIMATE CULPRIT: self._experimental_run_tf_function = True. It's experimental. But it's not actually bad.
To any TensorFlow devs reading: clean up your code. It's a mess. And it violates important coding practices, such as one function does one thing; _process_inputs does a lot more than "process inputs", same for _standardize_user_data. "I'm not paid enough" - but you do pay, in extra time spent understanding your own stuff, and in users filling your Issues page with bugs easier resolved with a clearer code.
SUMMARY: it's only a little slower with compile().
compile() sets an internal flag which assigns a different prediction function to predict. This function constructs a new graph upon each call, slowing it down relative to uncompiled. However, the difference is only pronounced when train time is much shorter than data processing time. If we increase the model size to at least mid-sized, the two become equal. See code at the bottom.
This slight increase in data processing time is more than compensated by amplified graph capability. Since it's more efficient to keep only one model graph around, the one pre-compile is discarded. Nonetheless: if your model is small relative to data, you are better off without compile() for model inference. See my other answer for a workaround.
WHAT SHOULD I DO?
Compare model performance compiled vs uncompiled as I have in code at the bottom.
Compiled is faster: run predict on a compiled model.
Compiled is slower: run predict on an uncompiled model.
Yes, both are possible, and it will depend on (1) data size; (2) model size; (3) hardware. Code at the bottom actually shows compiled model being faster, but 10 iterations is a small sample. See "workarounds" in my other answer for the "how-to".
DETAILS:
This took a while to debug, but was fun. Below I describe the key culprits I discovered, cite some relevant documentation, and show profiler results that led to the ultimate bottleneck.
(FLAG == self.experimental_run_tf_function, for brevity)
Model by default instantiates with FLAG=False. compile() sets it to True.
predict() involves acquiring the prediction function, func = self._select_training_loop(x)
Without any special kwargs passed to predict and compile, all other flags are such that:
(A) FLAG==True --> func = training_v2.Loop()
(B) FLAG==False --> func = training_arrays.ArrayLikeTrainingLoop()
From source code docstring, (A) is heavily graph-reliant, uses more distribution strategy, and ops are prone to creating & destroying graph elements, which "may" (do) impact performance.
True culprit: _process_inputs(), accounting for 81% of runtime. Its major component? _create_graph_function(), 72% of runtime. This method does not even exist for (B). Using a mid-sized model, however, _process_inputs comprises less than 1% of runtime. Code at bottom, and profiling results follow.
DATA PROCESSORS:
(A): <class 'tensorflow.python.keras.engine.data_adapter.TensorLikeDataAdapter'>, used in _process_inputs() . Relevant source code
(B): numpy.ndarray, returned by convert_eager_tensors_to_numpy. Relevant source code, and here
MODEL EXECUTION FUNCTION (e.g. predict)
(A): distribution function, and here
(B): distribution function (different), and here
PROFILER: results for code in my other answer, "tiny model", and in this answer, "medium model":
Tiny model: 1000 iterations, compile()
Tiny model: 1000 iterations, no compile()
Medium model: 10 iterations
DOCUMENTATION (indirectly) on effects of compile(): source
Unlike other TensorFlow operations, we don't convert python
numerical inputs to tensors. Moreover, a new graph is generated for each
distinct python numerical value, for example calling g(2) and g(3) will
generate two new graphs
function instantiates a separate graph for every unique set of input
shapes and datatypes. For example, the following code snippet will result
in three distinct graphs being traced, as each input has a different
shape
A single tf.function object might need to map to multiple computation graphs
under the hood. This should be visible only as performance (tracing graphs has
a nonzero computational and memory cost) but should not affect the correctness
of the program
COUNTEREXAMPLE:
from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from tensorflow.keras.layers import Flatten, Dropout
from tensorflow.keras.models import Model
import numpy as np
from time import time
def timeit(func, arg, iterations):
t0 = time()
for _ in range(iterations):
func(arg)
print("%.4f sec" % (time() - t0))
batch_size = 32
batch_shape = (batch_size, 400, 16)
ipt = Input(batch_shape=batch_shape)
x = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x = LSTM(512, activation='relu', return_sequences=True)(ipt)
x = Conv1D(128, 400, 1, padding='same')(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
X = np.random.randn(*batch_shape)
timeit(model.predict, X, 10)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 10)
Outputs:
34.8542 sec
34.7435 sec
UPDATE: see actual answer posted as a separate answer; this post contains supplemental info
.compile() sets up the majority of TF/Keras graph, including losses, metrics, gradients, and partly the optimizer and its weights - which guarantees a notable slowdown.
What is unexpected is the extent of slowdown - 10-fold on my own experiment, and for predict(), which doesn't update any weights. Looking into TF2's source code, graph elements appear tightly intertwined, with resources not necessarily being allocated "fairly".
Possible overlook by developers on predict's performance for an uncompiled model, as models are typically used compiled - but in practice, this is an unacceptable difference. It's also possible it's a "necessary evil", as there is a simple workaround (see below).
This isn't a complete answer, and I hope someone can provide it here - if not, I'd suggest opening a Github issue on TensorFlow. (OP has; here)
Workaround: train a model, save its weights, re-build the model without compiling, load the weights. Do not save the entire model (e.g. model.save()), as it'll load compiled - instead use model.save_weights() and model.load_weights().
Workaround 2: above, but use load_model(path, compile=False); suggestion credit: D. Möller
UPDATE: to clarify, optimizer is not fully instantiated with compile, including its weights and updates tensors - this is done when the first call to a fitting function is made (fit, train_on_batch, etc), via model._make_train_function().
The observed behavior is thus even more strange. Worse yet, building the optimizer does not elicit any further slowdowns (see below) - suggesting "graph size" is not the main explanation here.
EDIT: on some models, a 30x slowdown. TensorFlow, what have you done. Example below:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np
from time import time
def timeit(func, arg, iterations):
t0 = time()
for _ in range(iterations):
func(arg)
print("%.4f sec" % (time() - t0))
ipt = Input(shape=(4,))
x = Dense(2, activation='relu')(ipt)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
X = np.random.randn(32,4)
timeit(model.predict, X, 1000)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 1000)
model._make_train_function() # build optimizer
timeit(model.predict, X, 1000)
Outputs:
0.9891 sec
29.785 sec
29.521 sec
I have doing an analysis trying to see the relation between training time and maximum iteration in SVC. The data I use is some randomly generated number and I plotted the training time against max_iter of SVC fit. I checked logs and each binary classifier has reached the max_iter (I output all console logs which showed detailed warning for each binary classifier and count them). However, I was assuming that the training time will be strictly linear to the iteration but actually, in the case that the training data has many labels e.g. say 40, then the plot does not show it's linear.
It seems as the maximum iteration goes up, each iteration takes slight less time than before. While if we changed label_size to be 2 (which means each fit contains only 1 binary classifier), the line is straight.
What causes that to happen?
Here is my source code:
# -*- coding: utf-8 -*-
import numpy as np
from sklearn.svm import SVC
import time
import pandas as pd
def main(row_size, label_size):
np.random.seed(2019)
y = np.array([i for i in range(label_size) for j in range(row_size
/ label_size)])
if len(y) < row_size:
y = np.append(y, [y[-1]] * (row_size - len(y)))
X = np.random.rand(row_size, 300)
print X.shape, y.shape
return (X, y)
def train_svm(X, y, max_iter):
best_params = {'C': 1}
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=True,
)
start = time.time()
clf.fit(X, y)
end = time.time()
return end - start
if __name__ == '__main__':
row_size = 20000
m_iter = range(10, 401, 20)
label_size = [40]
data = {
'label_size': [],
'max_iter': [],
'row_size': [],
'time': [],
}
for it in m_iter:
for l in label_size:
(X, y) = main(row_size, l)
t = train_svm(X, y, max_iter=it)
data['label_size'].append(l)
data['max_iter'].append(it)
data['row_size'].append(row_size)
data['time'].append(t)
df = pd.DataFrame(data)
df.to_csv('svc_iter.csv', index=None)
Well, there could be loads of reasons for that "very slight change". Scikit-Learn doesn't operate natively, it's built upon different libraries and it may be using loads of optimizers..etc!
Besides, your first graph is very close to linear!
Nevertheless, a big noticeable reasonable factor that contributed in those tiny changes is the Decomposition Method in Support Vector Machine.
The idea of decomposition methodology for classification tasks is to break down a complex classification task into several simpler and more manageable sub-tasks that are solvable by using existing induction methods, then joining their solutions together in order to solve the original problem.
This method is an iterative process and in each iteration only few variables are updated.
For more details about the mathematical approach, please refer to this paper, section 6.2 The Decomposition Method..
Moreover and specifically speaking, SVM implements two tricks called shrinking and caching for the decomposition method.
The Shrinking idea is that an optimal solution α of the SVM dual problem may contain some bounded elements (i.e., αi = 0 or C). These elements may have already been bounded in the middle of the decomposition iterations. To save the training time, the shrinking technique tries to identify and remove some bounded elements, so a smaller optimization problem is solved.
The Caching idea is an effective technique for reducing the computational time of the decomposition method, so elements are calculated as needed. We can use available memory (called kernel cache) to store some recently used permutation of the matrix Qij. Then, some kernel elements may not need to be recalculate.
For more details about the mathematical approach, please refer to this paper, section 5 Shrinking and Caching.
Technical Proof:
I repeated your experiment (that's way I asked for your code to follow the same exact approach), with and without using the shrinking and caching optimization.
Using Shrinking and Caching
The default value of the parameter shrinking in sklearn SVC is set to True, keeping that as it is, produced the following output:
Plotting it gives:
Note how at some point, the time drops noticeably reflecting the effect of shrinking and caching.
Without Using Shrinking and Caching
Using the same exact approach but this time, setting the parameter shrinking explicitly to False as follows:
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=True,
shrinking=False
)
Produced the following output:
Plotting it gives:
Note how unlike previous plot, there is no noticeable drop in time at some point, it's rather just a very tiny fluctuations along with the entire plot.
Comparing Pearson Correlations
In conclusion:
Without using the Shrinking and Caching (updated later with caching), the linearity improved, although it's not 100% linear, but if you take into account that Scikit-Learn internally uses libsvm library to handle all computations. And this library is wrapped using C and Cython, you would have a higher tolerance to your definition about 'Linear' of the relationship between maximum iterations and time. Also, here is a cool discussion about why algorithms may not give the exact same precise definite running time every time.
And that would be even clearer to you if you plot the interval times, so you can see clearly how the drops happen suddenly noticeably in more than one place.
While it keeps almost the same flow without using the optimization tricks.
Important Update
It turned out that the aforementioned reason for this issue (i.e. Shrinking and Caching) is correct, or more precisely, it's a very big factor of that phenomenon.
But the thing I missed is the following:
I was talking about Shrinking and Caching but I missed the later parameter for caching which is set by default to 200 MB.
Repeating the same simulations more than one time and setting the cache_size parameter to a very small number (because zero is not acceptable and throws an error) in addition to shrinking=False, resulted in an extremely-close-to linear pattern between max_iter and time:
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=False,
shrinking=False,
cache_size = 0.000000001
)
By the way, you don't need to set verbose=True, you can check if it reached the maximum iteration via the ConvergenceWarning, so you can redirect those warnings to a file and it'll be million times easier to follow, just add this code:
import warnings, sys
def customwarn(message, category, filename, lineno, file=None, line=None):
with open('warnings.txt', 'a') as the_file:
the_file.write(warnings.formatwarning(message, category, filename, lineno))
warnings.showwarning = customwarn
Also you don't need to re-generate the dataset after each iteration, so take it out the loop like this:
(X, y) = main(row_size, 40)
for it in m_iter:
....
....
Final Conclusion
Shrinking and Caching tricks coming from Decomposition Method in SVM play a big significant role in improving the execution time as the number of iterations increases. Besides, there are other small players that may be contributing in this matter such as internal usage of libsvm library to handle all computations which is wrapped using C and Cython.
I have the following lines as part of a program:
tensor_gradients = optimizer.compute_gradients(cross_entropy)
with tf.Session() as session:
for step in range(20000):
batch = mnist.train.next_batch(train_batch_size)
feed = {input_x: batch[0], input_y: batch[1]}
gradients = session.run([tensor_gradients], feed)[0]
for i in range(len(gradients)):
gradients[i] = (gradients[i][0], tensor_gradients[i][1])
... computation on gradients ...
training_step = optimizer.apply_gradients(gradients)
training = session.run([training_step], feed)
The reason I'm doing this is because I want to modify the gradients using numpy. The above code runs out of memory around step 800. However, if you replace the optimizer.apply_gradients step by tensor_gradients, then the code does not run out of memory.
training_step = optimizer.apply_gradients(tensor_gradients)
Any ideas at what might be happening? The rest of the code remains the same except for the line above. Is it possible that the numpy arrays in gradients is not being garbage collected because they are being passed into the apply_gradients step? I have no idea where the memory leak could be or if I'm inadvertently adding to the tensorflow graph by passing modified gradients (in numpy array form) back into apply_gradients.
Any ideas at what might be happening?
OOM happens because you're constructing the graph inside the loop: This builds a graph with 20,000x nodes, and running it may need more memory than you have.
Move all TF operations that build the graph outside the loop, i.e. everything except feed_dict construction and sess.run calls.
Reply to comments
Apply gradients builds the graph?
Yes, if you look in the docs:
Returns:
An `Operation` that applies the specified gradients. If `global_step`
was not None, that operation also increments `global_step`.
Today, I got a really weird thing.
I load a caffe model, feed input, net.forward, check the output data, perfect.
Then, I feed labels to the bottom layer blobs.diff, net.backward, then check the gradients (params.diff) with the result from same model caffe c++ program. They were different.
Further, when I continued to run net.backward several times at python, each time I got different gradients. This is not the case for C++ programs, they keep the same no matter how many time you run net.backward, as long as you did not change the bottom diff.
I check the bottom layer's blobs and diff, they kept unchanged both in python and C++ programs, and weights were also unchanged. This was really weird.
Anyone can provide some hints? I can provide codes if it is necessary.
Here is part of the codes :
def train_one_step(X, y, lr) :
net.blobs['data'].data[...] = X
#Forward, to get the softmax output
output = net.forward()
prob = output['prob']
#Calculate the loss of cross entropy loss function
net.blobs['prob'].diff[:] = y[:] - prob[:]
#Calculate the gradients of net parameter
net.backward()
#Renew weights based on gradients and learning rate
for key in net.params:
net.params[key][0].data[:] += lr * net.params[key][0].diff[:]
net.params[key][1].data[:] += lr * net.params[key][1].diff[:]
return loss, prob
I just want to dig out my own step function (the step of solver), so I can make some trick on the loss before it backwards, and something else. I know this is quite low efficient, data between GPU, CPU exchanged a lot.
In order to test it, I kept input the same sample(same X, y), you get different diff data. That means this function cannot work.