Getting the autograd counter of a tensor in PyTorch - python

I am using PyTorch for training a network. I was going through the autograd documentation and here it is mentioned that for each tensor there is a counter that the autograd implements to track the "version" of any tensor. How can I get this counter for any tensor in the graph?
Reason why I need it.
I have encountered the autograd error
[torch.cuda.FloatTensor [x, y, z]], which is output 0 of torch::autograd::CopySlices, is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
This is not new to me and I have been successful in handling it before. This time around I am not able to see why the tensor would be at version 7 instead of being at 6. To answer this, I would want to know the version at any given point in the run.
Thanks.

It can be obtained through the command tensor_name._version.
As an example of how to use it, following MSE is provided.
import torch
a = torch.zeros(10, 5)
print(a._version) # prints 0
a[:, 1] = 1
print(a._version) # prints 1

Related

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: PyTorch error

I am trying to run some code in PyTorch but I got stacked at this point:
At first iteration, both backward operations, for Discriminator and Generator are running well
....
self.G_loss.backward(retain_graph=True)
self.D_loss.backward()
...
At the second iteration, when self.G_loss.backward(retain_graph=True) executes, I get this error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8192, 512]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
According to torch.autograd.set_detect_anomaly, the last of the following lines in the Discriminator network, is responsible for this:
bottleneck = bottleneck[:-1]
self.embedding = x.view(x.size(0), -1)
self.logit = self.layers[-1](self.embedding)
The strange thing is that I have used that network architecture in other code where it worked properly. Any suggestions?
The full error:
site-packages\torch\autograd\__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8192, 512]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Solved by removing code with loss += loss_val lines

Pytorch Multi-GPU Issue

I want to train my model with 2 GPU(id 5, 6), so I run my code with CUDA_VISIBLE_DEVICES=5,6 train.py. However, when I printed torch.cuda.current_device I still got the id 0 rather than 5,6. But torch.cuda.device_count is 2, which semms right. How can I use GPU5,6 correctly?
It is most likely correct. PyTorch only sees two GPUs (therefore indexed 0 and 1) which are actually your GPU 5 and 6.
Check the actual usage with nvidia-smi. If it is still inconsistent, you might need to set an environment variable:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
(See Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName())
you can check the device name to verify whether that is the correct name of that GPU. However, I think when you set the Cuda_Visible outside, you have forced torch to look only at that 2 gpu. So torch will manually set index for them as 0 and 1. Because of this, when you check the current_device, it will output 0

Issue with keras fit_generator epoch

I'm creating an LSTM Model for Text generation using Keras. As the dataset(around 25 novels,which has around 1.4 million words) I'm using can't be processed at once(An Memory issue with converting my outputs to_Categorical()) I created a custom generator function to read the Data in.
# Data generator for fit and evaluate
def generator(batch_size):
start = 0
end = batch_size
while True:
x = sequences[start:end,:-1]
#print(x)
y = sequences[start:end,-1]
y = to_categorical(y, num_classes=vocab_size)
#print(y)
yield x, y
if batch_size == len(lines):
break;
else:
start += batch_size
end += batch_size
when i excecute the model.fit() method, after 1 epoch is done training the following error is thrown.
UnknownError: [_Derived_] CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1459): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
[[{{node CudnnRNN}}]]
[[sequential/lstm/StatefulPartitionedCall]] [Op:__inference_train_function_25138]
Function call stack:
train_function -> train_function -> train_function
does anyone know how to solve this issue ? Thanks
From many sources in the Internet, this issue seems to occur while using LSTM Layer along with Masking Layer and while training on GPU.
Mentioned below can be the workarounds for this problem:
If you can compromise on speed, you can Train your Model on CPU rather than on GPU. It works without any error.
As per this comment, please check if your Input Sequences comprises of all Zeros, as the Masking Layer may mask all the Inputs
If possible, you can Disable the Eager Execution. As per this comment, it works without any error.
Instead of using a Masking Layer, you can try the alternatives mentioned in this link
a. Adding the argument, mask_zero = True to the Embedding Layer. or
b. Pass a mask argument manually when calling layers that support this argument
Last solution can be to remove Masking Layer, if that is possible.
If none of the above workaround solves your problem, Google Tensorflow Team is working to resolve this error. We may have to wait till that is fixed.
Hope this information helps. Happy Learning!

free(): invalid pointer Aborted (core dumped)

I-m trying to run my python program it seems that it should run smoothly however I encounter an error that I haven't seen before it says:
free(): invalid pointer
Aborted (core dumped)
However I'm not sure how to try and fix error since it doesn't give me too much information about the problem itself.
At first I thought it should be a problem with the sizes of the tensor in my network however they are completely fine. I've google the problem a little and found that I can see that is a problem with allocating memory where I shouldn't, but I don't know how to fix this problem
My code is divided in two different files, and I use two libraries to be able to use Sinkhorn loss function and make sample randomly a mesh.
import argparse
import point_cloud_utils as pcu
import time
import numpy as np
import torch
import torch.nn as nn
from fml.nn import SinkhornLoss
import common
def main():
# x is a tensor of shape [n, 3] containing the positions of the vertices that
x = torch._C.from_numpy(common.loadpointcloud("sphere.txt"))
# t is a tensor of shape [n, 3] containing a set of nicely distributed samples in the unit cube
v, f = common.unit_cube()
t = torch._C.sample_mesh_lloyd(pcu.lloyd(v,f,x.shape[0]).astype(np.float32)) # sample randomly a point cloud (cube for now?)
# The model is a simple fully connected network mapping a 3D parameter point to 3D
phi = common.MLP(in_dim=3, out_dim=3)
# Eps is 1/lambda and max_iters is the maximum number of Sinkhorn iterations to do
emd_loss_fun = SinkhornLoss(eps=1e-3, max_iters=20,
stop_thresh=1e-3, return_transport_matrix=True)
mse_loss_fun = torch.nn.MSELoss()
# Adam optimizer at first
optimizer = torch.optim.Adam(phi.parameters(), lr= 10e-3)
fit_start_time = time.time()
for epoch in range(100):
optimizer.zero_grad()
# Do the forward pass of the neural net, evaluating the function at the parametric points
y = phi(t)
# Compute the Sinkhorn divergence between the reconstruction*(using the francis library) and the target
# NOTE: The Sinkhorn function expects a batch of b point sets (i.e. tensors of shape [b, n, 3])
# since we only have 1, we unsqueeze so x and y have dimension [1, n, 3]
with torch.no_grad():
_, P = emd_loss_fun(phi(t).unsqueeze(0), x.unsqueeze(0))
# Project the transport matrix onto the space of permutation matrices and compute the L-2 loss
# between the permuted points
loss = mse_loss_fun(y[P.squeeze().max(0)[1], :], x)
# loss = mse_loss_fun(P.squeeze() # y, x) # Use the transport matrix directly
# Take an optimizer step
loss.backward()
optimizer.step()
print("Epoch %d, loss = %f" % (epoch, loss.item()))
fit_end_time = time.time()
print("Total time = %f" % (fit_end_time - fit_start_time))
# Plot the ground truth, reconstructed points, and a mesh representing the fitted function, phi
common.visualitation(x,t,phi)
if __name__ == "__main__":
main()
The error message is:
free(): invalid pointer
Aborted (core dumped)
That again doesn't help me that much. I'll appreciate it a lot if someone has any idea what is happening or if you know more about this error.
Edit: The cause is actually known. The recommended solution is to build both packages from source.
There is a known issue with importing both open3d and PyTorch. The cause is unknown. https://github.com/pytorch/pytorch/issues/19739
A few possible workarounds exist:
(1) Some people have found that changing the order in which you import the two packages can resolve the issue, though in my personal testing both ways crash.
(2) Other people have found compiling both packages from source to help.
(3) Still others have found that moving open3d and PyTorch to be called from separate scripts resolves the issue.
Note for future readers: This bug was filed as issue #21018.
This is not a problem in your Python code. It is a bug in PyTorch (probably) or in Python itself (unlikely, but possible).
free(3) is a C function that releases dynamically allocated memory when it is no longer needed. You cannot (easily) call it from Python, because memory management is a low-level implementation detail normally handled by the Python interpreter. However, you are also using PyTorch, which is written in C++ and C, and does have the ability to directly allocate and free memory.
In this case, some C code has tried to release a block of memory, but the block of memory it tried to release was not dynamically allocated in the first place, which is an error. You should report this behavior to the PyTorch developers. Include as much detail as possible, including the shortest code you can find that reproduces the problem, and the complete output of that program.

SVD calculation error from lapack function when using scikit-learn's Linear Discriminant Analysis class

I'm classifying 2-class, 1-D data using scikit-learn's LDA classifier in a machine learning pipeline I've created. The following exception occurred:
ValueError: Internal work array size computation failed: -10
at the following line:
LinearDiscriminantAnalysis.fit(X,y)
where X = [-5e15, -5e15, -5e15, 5.7e16] and y = [0, 0, 0, 1], both float64 data-type
Additionally the following error was printed to console:
Intel MKL ERROR: Parameter 10 was incorrect on entry to DGESDD
After a quick Google search, dgesdd is a function in LAPACK which scikit-learn relies upon. The dgesdd documentation tells us that the function computes the singular value decomposition (SVD) of a real M-by-N matrix A.
Going back to the original exception, I found it was raised in scipy.linalg.lapack.py at the _compute_lwork function. This function takes as input a function, which in this case I believe is the dgesdd function. CTRL-F "-10" on the dgesdd documentation page gives the logic behind this error code, but I don't know Fortran so I'm not exactly sure what it means.
I want to bet that the SVD calculation is failing due to either (1) large values in X array, or (2) the fact that the 3 of the values in the X array are the exact same number.
I will keep reading into SVD and its limitations. Any insight on how to avoid this error would be tremendously appreciated.
Here is a screenshot of the error
This is the definition of DGESDD:
subroutine dgesdd (JOBZ, M, N, A, LDA, S, U, LDU, VT, LDVT, WORK, LWORK, IWORK, INFO)
The error, that you have, indicates that the value that is passed to MKL's implementation of the routine for the 10th parameter, LDVT, the leading dimension of the V**T matrix does not comply with expecation of said routing.
This could be a bug in Intels implementation, rather unlikely, assuming that there is a battery on testing stress testing this routines, but not impossible. Which version of MKL is this? Or it's a bug in the LDA code, rather likely:
LDVT is INTEGER
The leading dimension of the array VT. LDVT >= 1;
if JOBZ = 'A' or JOBZ = 'O' and M >= N, LDVT >= N;
if JOBZ = 'S', LDVT >= min(M,N).
Would you please print M, N, LDA, LDU and LDVT?
If you set LDVT properly the workspace analysis will run just fine.
regard to Intel MKL ERROR: Parameter 10 was incorrect on entry to DGESDD problem. Actually this problem has been fixed in MKL v.2018 u4 ( Sep 2018). Here is the link to MKL 2018 bug fix list.
You may easier to check version of MKL you use by setting env variable MKL_VERBOSE=1 to the system environments and look at the output which will contain such kind info.
E.x:
MKL_VERBOSE Intel(R) MKL 2019.0 Update 2 Product build 20190118 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors, Lnx 2.80GHz lp64 intel_thread
MKL_VERBOSE ZGETRF(85,85,0x13e66f0,85,0x13e1080,0) 6.18ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20

Categories