Bug when using TensorFlow-GPU + Python multiprocessing? - python

I have noticed a strange behavior when I use TensorFlow-GPU + Python multiprocessing.
I have implemented a DCGAN with some customizations and my own dataset. Since I am conditioning the DCGAN to certain features, I have training data and also test data for evaluation.
Due to the size of my datasets, I have written data loaders that run concurrently and preload into a queue using Python's multiprocessing.
The structure of the code roughly looks like this:
class ConcurrentLoader:
def __init__(self, dataset):
...
class DCGAN
...
net = DCGAN()
training_data = ConcurrentLoader(path_to_training_data)
test_data = ConcurrentLoader(path_to_test_data)
This code runs fine on TensorFlow-CPU and on TensorFlow-GPU <= 1.3.0 using CUDA 8.0, but when I run the exact same code with TensorFlow-GPU 1.4.1 and CUDA 9 (latest releases of TF & CUDA as of Dec 2017) it crashes:
2017-12-20 01:15:39.524761: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.527795: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.529548: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.535341: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-12-20 01:15:39.535383: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-12-20 01:15:39.535397: F tensorflow/core/kernels/conv_ops.cc:667] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
[1] 32299 abort (core dumped) python dcgan.py --mode train --save_path ~/tf_run_dir/test --epochs 1
What really confuses me is that if I just remove test_data the error does not occur. Thus, for some strange reason, TensorFlow-GPU 1.4.1 & CUDA 9 work with just a single ConcurrentLoader, but crash when multiple loaders are initialized.
Even more interesting is that (after the exception) I have to manually shut down the python processes, because the GPU's VRAM, the system's RAM and even the python processes stay alive after the script crashes.
Furthermore, it has to have some weird connection to Python's multiprocessing module, because when I implement the same model in Keras (using TF backend!) the code also runs just fine, with 2 concurrent loaders. I guess Keras is somehow creating an layer of abstraction in between that keeps TF from crashing.
Where could I possibly have screwed up with the multiprocessing module that it causes crashes like this one?
These are the parts of the code that use multiprocessing inside the ConcurrentLoader:
def __init__(self, dataset):
...
self._q = mp.Queue(64)
self._file_cycler = cycle(img_files)
self._worker = mp.Process(target=self._worker_func, daemon=True)
self._worker.start()
def _worker_func(self):
while True:
... # gets next filepaths from self._file_cycler
buffer = list()
for im_path in paths:
... # uses OpenCV to load each image & puts it into the buffer
self._q.put(np.array(buffer).astype(np.float32))
...and this is it.
Where have I written "unstable" or "non-pythonic" multiprocessing code? I thought daemon=True should ensure that every process gets killed as soon as the main process dies? Unfortunately, this is not the case for this specific error.
Did I misuse the default multiprocessing.Process or multiprocessing.Queue here? I thought simply writing a class where I store batches of images inside a Queue and make it accessible through methods / instance variables should be just fine.

I am coming with the same error when trying to use tensorflow and multiprocessing
E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
but in different environment tf1.4 + cuda 8.0 + cudnn 6.0.
matrixMulCUBLAS in sample codes works fine.
I wonder the correct solution too!
And the reference failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED on a AWS p2.xlarge instance did not work for me.

Related

TensorRT Error:[context.cpp::setStream::121] Error Code 1: Cudnn (CUDNN_STATUS_MAPPING_ERROR)

When I use the tensorRT inference code officially provided by NVIDIA
# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
Everytime when i run the code here,
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
i will get the error message
[TRT] [E] 1: [context.cpp::setStream::121] Error Code 1: Cudnn (CUDNN_STATUS_MAPPING_ERROR)
Do you have this error when you start running?
tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
If so, downgrade your Tensorflow from 2.10 to 2.9.2.
See more at: https://github.com/google-research/multinerf/issues/47
Hope it helps!

How does torch.cuda.synchronize() behave?

According to the PyTorch documentation torch.cuda.synchronize "Waits for all kernels in all streams on a CUDA device to complete.". Questions:
Should this say "Waits for all kernels in all streams initiated by this Python session on a CUDA device to complete"? In other words, if Python session A is running CUDA operations, and I call torch.cuda.synchronize() in Python session B, that won't care about what's happening in Python session A right?
Surely if we don't call torch.cuda.synchronize(), but try to work with any python code referencing the tensors in the computation graph, then it's like implicitly calling it right?
Q2 in code:
output = model(inputs) # cuda starts working here
a = 1 + 1 # cuda might still be running the previous line. This line can run at the same time
other_model(output) # This implicitly does the same thing as torch.cuda.synchronize() then does a forward pass of other_model
b = a + a # This line can't happen until cuda is done and the previous line has been executed

CUDA Illegal Memory Access error when using torch.cat

I was playing around with pytorch concatenate and wanted to see if I could use an output tensor that had a different device to the input tensors, here is the code:
import torch
a = torch.ones(4)
b = torch.ones(4)
c = torch.zeros(8).cuda()
print(c)
ab = torch.cat([a,b], out=c)
print(c)
I am running this inside a jupyter notebook. pytorch version: 1.7.1
I get the following error:
...
\Anaconda3\envs\...\lib\site-packages\torch\_tensor_str.py in __init__(self, tensor)
87
88 else:
---> 89 nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
90
91 if nonzero_finite_vals.numel() == 0:
RuntimeError: CUDA error: an illegal memory access was encountered
It happens if you try to access the tensors c (in this case with a print).
I couldnt find anything in the documentation that said I couldn't do this, other than perhaps this line:
" ... any python sequence of tensors of the same type ... "
The error is kind of curious though... any ideas?
It appears that the behaviors changes according to the version of pytorch. With the version 1.3.0 I get the error expected object of backend CUDA but got CPU, but the version 1.5.0 I do indeed get the same error as you do. This would probably be worth mentioning on their github, because I believe the former error is more useful than the latter.
Anyway, both errors come from the fact that you concatenate cpu tensors into a GPU one. You can solve it very easily :
# Move the tensors to the GPU prior to concatenating
ab = torch.cat([a.cuda(),b.cuda()], out=c)
or
# Move the tensor after concatenating
c.copy_(torch.cat([a,b]).cuda())
I don't have a notebook but I believe you will have to restart your kernel, the error you get seems to break it down really bad. My python shell just cannot compute anything anymore after getting the illegal memory access.
I faced a similar issue and reproduced the error as above with minor differences:
# 080521 debug RuntimeError: CUDA error: an illegal memory access was encountered
# https://stackoverflow.com/questions/66985008/cuda-illegal-memory-access-error-when-using-torch-cat
import torch
a = torch.ones(4)
b = torch.ones(4)
c = torch.zeros(8).cuda()
print(c)
ab = torch.cat([a,b], out=c) # throws error below:
print(c)
# RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
# (when checking arugment for argument tensors in method wrapper__cat_out_out)
# i.e. 'expected object of backend CUDA but got CPU'
Applying this logic: Using CUDA with pytorch? (setting tensor type to cuda) solved the error:
import torch
torch.set_default_tensor_type('torch.cuda.FloatTensor')
a = torch.ones(4)
b = torch.ones(4)
c = torch.zeros(8).cuda()
print(c)
ab = torch.cat([a,b], out=c)
print(c)

Simple PytorchBenchmark from transformers lib of Huggingface script gives CUDA initialisation error

I am trying to run a simple benchmark script from the transformers lib from Huggingface, but it fails due to a CUDA error, which leads to another error:
1 / 1
Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Traceback (most recent call last):
File "/home/cbarkhof/code-thesis/Experimentation/Benchmarking/benchmarking-models-claartje.py", line 12, in <module>
benchmark.run()
File "/home/cbarkhof/.local/lib/python3.6/site-packages/transformers/benchmark/benchmark_utils.py", line 674, in run
memory, inference_summary = self.inference_memory(model_name, batch_size, sequence_length)
ValueError: too many values to unpack (expected 2)
The script simply follows the example like shown on this page:
from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
benchmark_args = PyTorchBenchmarkArguments(models=["bert-base-uncased"],
batch_sizes=[8],
sequence_lengths=[8, 32, 128, 512],
save_to_csv=True,
log_filename='log',
env_info_csv_file='env_info')
benchmark = PyTorchBenchmark(benchmark_args)
benchmark.run()
If anyone can point me to why this might be happening. Please let me know :). Cheers!
This error is due to the CUDA runtime not supporting process forking which is the default method of multiprocessing in PyTorch. The suggested workaround is to call torch.multiprocessing.set_start_method('spawn') before importing transformers, but in my experiments this raises an AttributeError: Can't pickle local object error.
If you're happy with the script running as a single process when using GPUs you can pass the multi_process=False arg, or --no_multi_process flag if using the command line script. Alternatively, if running this on CPU only without CUDA, you should be able to make use of multiple cores and passing the cuda=False arg or --no_cuda flag should also work.

How do I check if PyTorch is using the GPU?

How do I check if PyTorch is using the GPU? The nvidia-smi command can detect GPU activity, but I want to check it directly from inside a Python script.
These functions should help:
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.device(0)
<torch.cuda.device at 0x7efce0b03be0>
>>> torch.cuda.get_device_name(0)
'GeForce GTX 950M'
This tells us:
CUDA is available and can be used by one device.
Device 0 refers to the GPU GeForce GTX 950M, and it is currently chosen by PyTorch.
As it hasn't been proposed here, I'm adding a method using torch.device, as this is quite handy, also when initializing tensors on the correct device.
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()
#Additional Info when using cuda
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
print('Memory Usage:')
print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Cached: ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
Edit: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved. So use memory_cached for older versions.
Output:
Using device: cuda
Tesla K80
Memory Usage:
Allocated: 0.3 GB
Cached: 0.6 GB
As mentioned above, using device it is possible to:
To move tensors to the respective device:
torch.rand(10).to(device)
To create a tensor directly on the device:
torch.rand(10, device=device)
Which makes switching between CPU and GPU comfortable without changing the actual code.
Edit:
As there has been some questions and confusion about the cached and allocated memory I'm adding some additional information about it:
torch.cuda.max_memory_cached(device=None) Returns the maximum GPU memory managed by the caching allocator in bytes for a
given device.
torch.cuda.memory_allocated(device=None) Returns the current GPU memory usage by tensors in bytes for a given device.
You can either directly hand over a device as specified further above in the post or you can leave it None and it will use the current_device().
Additional note: Old graphic cards with Cuda compute capability 3.0 or lower may be visible but cannot be used by Pytorch! Thanks to hekimgil for pointing this out! - "Found GPU0 GeForce GT 750M which is of cuda capability 3.0. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability that we support is 3.5."
After you start running the training loop, if you want to manually watch it from the terminal whether your program is utilizing the GPU resources and to what extent, then you can simply use watch as in:
$ watch -n 2 nvidia-smi
This will continuously update the usage stats for every 2 seconds until you press ctrl+c
If you need more control on more GPU stats you might need, you can use more sophisticated version of nvidia-smi with --query-gpu=.... Below is a simple illustration of this:
$ watch -n 3 nvidia-smi --query-gpu=index,gpu_name,memory.total,memory.used,memory.free,temperature.gpu,pstate,utilization.gpu,utilization.memory --format=csv
which would output the stats something like:
Note: There should not be any space between the comma separated query names in --query-gpu=.... Else those values will be ignored and no stats are returned.
Also, you can check whether your installation of PyTorch detects your CUDA installation correctly by doing:
In [13]: import torch
In [14]: torch.cuda.is_available()
Out[14]: True
True status means that PyTorch is configured correctly and is using the GPU although you have to move/place the tensors with necessary statements in your code.
If you want to do this inside Python code, then look into this module:
https://github.com/jonsafari/nvidia-ml-py or in pypi here: https://pypi.python.org/pypi/nvidia-ml-py/
From practical standpoint just one minor digression:
import torch
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
This dev now knows if cuda or cpu.
And there is a difference in how you deal with models and with tensors when moving to cuda. It is a bit strange at first.
import torch
import torch.nn as nn
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
t1 = torch.randn(1,2)
t2 = torch.randn(1,2).to(dev)
print(t1) # tensor([[-0.2678, 1.9252]])
print(t2) # tensor([[ 0.5117, -3.6247]], device='cuda:0')
t1.to(dev)
print(t1) # tensor([[-0.2678, 1.9252]])
print(t1.is_cuda) # False
t1 = t1.to(dev)
print(t1) # tensor([[-0.2678, 1.9252]], device='cuda:0')
print(t1.is_cuda) # True
class M(nn.Module):
def __init__(self):
super().__init__()
self.l1 = nn.Linear(1,2)
def forward(self, x):
x = self.l1(x)
return x
model = M() # not on cuda
model.to(dev) # is on cuda (all parameters)
print(next(model.parameters()).is_cuda) # True
This all is tricky and understanding it once, helps you to deal fast with less debugging.
From the official site's get started page, you can check if the GPU is available for PyTorch like so:
import torch
torch.cuda.is_available()
Reference: PyTorch | Get Started
Query
Command
Does PyTorch see any GPUs?
torch.cuda.is_available()
Are tensors stored on GPU by default?
torch.rand(10).device
Set default tensor type to CUDA:
torch.set_default_tensor_type(torch.cuda.FloatTensor)
Is this tensor a GPU tensor?
my_tensor.is_cuda
Is this model stored on the GPU?
all(p.is_cuda for p in my_model.parameters())
To check if there is a GPU available:
torch.cuda.is_available()
If the above function returns False,
you either have no GPU,
or the Nvidia drivers have not been installed so the OS does not see the GPU,
or the GPU is being hidden by the environmental variable CUDA_VISIBLE_DEVICES. When the value of CUDA_VISIBLE_DEVICES is -1, then all your devices are being hidden. You can check that value in code with this line: os.environ['CUDA_VISIBLE_DEVICES']
If the above function returns True that does not necessarily mean that you are using the GPU. In Pytorch you can allocate tensors to devices when you create them. By default, tensors get allocated to the cpu. To check where your tensor is allocated do:
# assuming that 'a' is a tensor created somewhere else
a.device # returns the device where the tensor is allocated
Note that you cannot operate on tensors allocated in different devices. To see how to allocate a tensor to the GPU, see here: https://pytorch.org/docs/stable/notes/cuda.html
Almost all answers here reference torch.cuda.is_available(). However, that's only one part of the coin. It tells you whether the GPU (actually CUDA) is available, not whether it's actually being used. In a typical setup, you would set your device with something like this:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
but in larger environments (e.g. research) it is also common to give the user more options, so based on input they can disable CUDA, specify CUDA IDs, and so on. In such case, whether or not the GPU is used is not only based on whether it is available or not. After the device has been set to a torch device, you can get its type property to verify whether it's CUDA or not.
if device.type == 'cuda':
# do something
Simply from command prompt or Linux environment run the following command.
python -c 'import torch; print(torch.cuda.is_available())'
The above should print True
python -c 'import torch; print(torch.rand(2,3).cuda())'
This one should print the following:
tensor([[0.7997, 0.6170, 0.7042], [0.4174, 0.1494, 0.0516]], device='cuda:0')
If you are here because your pytorch always gives False for torch.cuda.is_available() that's probably because you installed your pytorch version without GPU support. (Eg: you coded up in laptop then testing on server).
The solution is to uninstall and install pytorch again with the right command from pytorch downloads page. Also refer this pytorch issue.
It is possible for
torch.cuda.is_available()
to return True but to get the following error when running
>>> torch.rand(10).to(device)
as suggested by MBT:
RuntimeError: CUDA error: no kernel image is available for execution on the device
This link explains that
... torch.cuda.is_available only checks whether your driver is compatible with the version of cuda we used in the binary. So it means that CUDA 10.1 is compatible with your driver. But when you do computation with CUDA, it couldn't find the code for your arch.
If you are using Linux I suggest to install nvtop
https://github.com/Syllo/nvtop
You will get something like this:
For a MacBook M1 system:
import torch
print(torch.backends.mps.is_available(), torch.backends.mps.is_built())
And both should be True.
Create a tensor on the GPU as follows:
$ python
>>> import torch
>>> print(torch.rand(3,3).cuda())
Do not quit, open another terminal and check if the python process is using the GPU using:
$ nvidia-smi
Using the code below
import torch
torch.cuda.is_available()
will only display whether the GPU is present and detected by pytorch or not.
But in the "task manager-> performance" the GPU utilization will be very few percent.
Which means you are actually running using CPU.
To solve the above issue check and change:
Graphics setting --> Turn on Hardware accelerated GPU settings, restart.
Open NVIDIA control panel --> Desktop --> Display GPU in the notification area
[Note: If you have newly installed windows then you also have to agree the terms and conditions in NVIDIA control panel]
This should work!
step 1: import torch library
import torch
#step 2: create tensor
tensor = torch.tensor([5, 6])
#step 3: find the device type
#output 1: in the below, the output we can get the size(tensor.shape), dimension(tensor.ndim),
#and device on which the tensor is processed
tensor, tensor.device, tensor.ndim, tensor.shape
(tensor([5, 6]), device(type='cpu'), 1, torch.Size([2]))
#or
#output 2: in the below, the output we can get the only device type
tensor.device
device(type='cpu')
#As my system using cpu processor "11th Gen Intel(R) Core(TM) i5-1135G7 # 2.40GHz 2.42 GHz"
#find, if the tensor processed GPU?
print(tensor, torch.cuda.is_available()
# the output will be
tensor([5, 6]) False
#above output is false, hence it is not on gpu
#happy coding :)

Categories