How does torch.cuda.synchronize() behave?

How does torch.cuda.synchronize() behave? - python

According to the PyTorch documentation torch.cuda.synchronize "Waits for all kernels in all streams on a CUDA device to complete.". Questions:
Should this say "Waits for all kernels in all streams initiated by this Python session on a CUDA device to complete"? In other words, if Python session A is running CUDA operations, and I call torch.cuda.synchronize() in Python session B, that won't care about what's happening in Python session A right?
Surely if we don't call torch.cuda.synchronize(), but try to work with any python code referencing the tensors in the computation graph, then it's like implicitly calling it right?
Q2 in code:
output = model(inputs) # cuda starts working here
a = 1 + 1 # cuda might still be running the previous line. This line can run at the same time
other_model(output) # This implicitly does the same thing as torch.cuda.synchronize() then does a forward pass of other_model
b = a + a # This line can't happen until cuda is done and the previous line has been executed

Related

In pytorch what happens when I move a module to cuda device

In an example code I see this:
self.models["pose_encoder"] = \
networks.ResnetEncoder(18, self.opt.weights_init == "pretrained",
num_input_images=self.num_pose_frames)
self.models["pose_encoder"].to("cuda:4")
with ResnetEncoder defined by
class ResnetEncoder(nn.Module):
"""Pytorch module for a resnet encoder
"""
def __init__(self, num_layers, pretrained, num_input_images=1, **kwargs):
super(ResnetEncoder, self).__init__()
I am confused about what happens when the to(cuda:4) part happens to the module. Do the whole tensors defined in the module move to cuda:4?
What cause me error now is that I have a member function in the module, and in that function I define a tensor:
def A():
self.mytensor = []
# after some operation this is not empty any more
self.mytensor.cuda()
and this error occur:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cuda:0!
I know the cuda:0 is caused by the .cuda() operation in the last line of code. But I don't know any way to move 'self.mytensor' to cuda:4. I cam pass device as a parameter in the module's constructor, but I guess there is a better way to do this. and I want the device to change during runtime, so I don't want use os.environ also. Is there any way to do this?

Follow the document, to.device('name_device') is the special function of torch.Tensor which use to move your Tensor to a different device.
The name_device can be replaced by: cuda:num_gpu or cpu, to.device('cuda:4') will move the tensor to GPU number 4 (you can check the number of GPU by command nvidia-smi on cmd/terminal).
Any operation between two tensors will require they stand in the same device. The bug raises because your self.mycuda is on GPU 0, you can move it to GPU 4 by self.mycuda.to_device('cuda:4')
Note: by using .cuda() = .to_device('cuda:0'). You can specify the device by self.mycuda.cuda(4)

CUDA Illegal Memory Access error when using torch.cat

I was playing around with pytorch concatenate and wanted to see if I could use an output tensor that had a different device to the input tensors, here is the code:
import torch
a = torch.ones(4)
b = torch.ones(4)
c = torch.zeros(8).cuda()
print(c)
ab = torch.cat([a,b], out=c)
print(c)
I am running this inside a jupyter notebook. pytorch version: 1.7.1
I get the following error:
...
\Anaconda3\envs\...\lib\site-packages\torch\_tensor_str.py in __init__(self, tensor)
87
88 else:
---> 89 nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
90
91 if nonzero_finite_vals.numel() == 0:
RuntimeError: CUDA error: an illegal memory access was encountered
It happens if you try to access the tensors c (in this case with a print).
I couldnt find anything in the documentation that said I couldn't do this, other than perhaps this line:
" ... any python sequence of tensors of the same type ... "
The error is kind of curious though... any ideas?

It appears that the behaviors changes according to the version of pytorch. With the version 1.3.0 I get the error expected object of backend CUDA but got CPU, but the version 1.5.0 I do indeed get the same error as you do. This would probably be worth mentioning on their github, because I believe the former error is more useful than the latter.
Anyway, both errors come from the fact that you concatenate cpu tensors into a GPU one. You can solve it very easily :
# Move the tensors to the GPU prior to concatenating
ab = torch.cat([a.cuda(),b.cuda()], out=c)
or
# Move the tensor after concatenating
c.copy_(torch.cat([a,b]).cuda())
I don't have a notebook but I believe you will have to restart your kernel, the error you get seems to break it down really bad. My python shell just cannot compute anything anymore after getting the illegal memory access.

I faced a similar issue and reproduced the error as above with minor differences:
# 080521 debug RuntimeError: CUDA error: an illegal memory access was encountered
# https://stackoverflow.com/questions/66985008/cuda-illegal-memory-access-error-when-using-torch-cat
import torch
a = torch.ones(4)
b = torch.ones(4)
c = torch.zeros(8).cuda()
print(c)
ab = torch.cat([a,b], out=c) # throws error below:
print(c)
# RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
# (when checking arugment for argument tensors in method wrapper__cat_out_out)
# i.e. 'expected object of backend CUDA but got CPU'
Applying this logic: Using CUDA with pytorch? (setting tensor type to cuda) solved the error:
import torch
torch.set_default_tensor_type('torch.cuda.FloatTensor')
a = torch.ones(4)
b = torch.ones(4)
c = torch.zeros(8).cuda()
print(c)
ab = torch.cat([a,b], out=c)
print(c)

Bug when using TensorFlow-GPU + Python multiprocessing?

I have noticed a strange behavior when I use TensorFlow-GPU + Python multiprocessing.
I have implemented a DCGAN with some customizations and my own dataset. Since I am conditioning the DCGAN to certain features, I have training data and also test data for evaluation.
Due to the size of my datasets, I have written data loaders that run concurrently and preload into a queue using Python's multiprocessing.
The structure of the code roughly looks like this:
class ConcurrentLoader:
def __init__(self, dataset):
...
class DCGAN
...
net = DCGAN()
training_data = ConcurrentLoader(path_to_training_data)
test_data = ConcurrentLoader(path_to_test_data)
This code runs fine on TensorFlow-CPU and on TensorFlow-GPU <= 1.3.0 using CUDA 8.0, but when I run the exact same code with TensorFlow-GPU 1.4.1 and CUDA 9 (latest releases of TF & CUDA as of Dec 2017) it crashes:
2017-12-20 01:15:39.524761: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.527795: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.529548: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.535341: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-12-20 01:15:39.535383: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-12-20 01:15:39.535397: F tensorflow/core/kernels/conv_ops.cc:667] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
[1] 32299 abort (core dumped) python dcgan.py --mode train --save_path ~/tf_run_dir/test --epochs 1
What really confuses me is that if I just remove test_data the error does not occur. Thus, for some strange reason, TensorFlow-GPU 1.4.1 & CUDA 9 work with just a single ConcurrentLoader, but crash when multiple loaders are initialized.
Even more interesting is that (after the exception) I have to manually shut down the python processes, because the GPU's VRAM, the system's RAM and even the python processes stay alive after the script crashes.
Furthermore, it has to have some weird connection to Python's multiprocessing module, because when I implement the same model in Keras (using TF backend!) the code also runs just fine, with 2 concurrent loaders. I guess Keras is somehow creating an layer of abstraction in between that keeps TF from crashing.
Where could I possibly have screwed up with the multiprocessing module that it causes crashes like this one?
These are the parts of the code that use multiprocessing inside the ConcurrentLoader:
def __init__(self, dataset):
...
self._q = mp.Queue(64)
self._file_cycler = cycle(img_files)
self._worker = mp.Process(target=self._worker_func, daemon=True)
self._worker.start()
def _worker_func(self):
while True:
... # gets next filepaths from self._file_cycler
buffer = list()
for im_path in paths:
... # uses OpenCV to load each image & puts it into the buffer
self._q.put(np.array(buffer).astype(np.float32))
...and this is it.
Where have I written "unstable" or "non-pythonic" multiprocessing code? I thought daemon=True should ensure that every process gets killed as soon as the main process dies? Unfortunately, this is not the case for this specific error.
Did I misuse the default multiprocessing.Process or multiprocessing.Queue here? I thought simply writing a class where I store batches of images inside a Queue and make it accessible through methods / instance variables should be just fine.

I am coming with the same error when trying to use tensorflow and multiprocessing
E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
but in different environment tf1.4 + cuda 8.0 + cudnn 6.0.
matrixMulCUBLAS in sample codes works fine.
I wonder the correct solution too!
And the reference failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED on a AWS p2.xlarge instance did not work for me.

How do I check if PyTorch is using the GPU?

How do I check if PyTorch is using the GPU? The nvidia-smi command can detect GPU activity, but I want to check it directly from inside a Python script.

These functions should help:
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.device(0)
<torch.cuda.device at 0x7efce0b03be0>
>>> torch.cuda.get_device_name(0)
'GeForce GTX 950M'
This tells us:
CUDA is available and can be used by one device.
Device 0 refers to the GPU GeForce GTX 950M, and it is currently chosen by PyTorch.

As it hasn't been proposed here, I'm adding a method using torch.device, as this is quite handy, also when initializing tensors on the correct device.
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()
#Additional Info when using cuda
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
print('Memory Usage:')
print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Cached: ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
Edit: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved. So use memory_cached for older versions.
Output:
Using device: cuda
Tesla K80
Memory Usage:
Allocated: 0.3 GB
Cached: 0.6 GB
As mentioned above, using device it is possible to:
To move tensors to the respective device:
torch.rand(10).to(device)
To create a tensor directly on the device:
torch.rand(10, device=device)
Which makes switching between CPU and GPU comfortable without changing the actual code.
Edit:
As there has been some questions and confusion about the cached and allocated memory I'm adding some additional information about it:
torch.cuda.max_memory_cached(device=None) Returns the maximum GPU memory managed by the caching allocator in bytes for a
given device.
torch.cuda.memory_allocated(device=None) Returns the current GPU memory usage by tensors in bytes for a given device.
You can either directly hand over a device as specified further above in the post or you can leave it None and it will use the current_device().
Additional note: Old graphic cards with Cuda compute capability 3.0 or lower may be visible but cannot be used by Pytorch! Thanks to hekimgil for pointing this out! - "Found GPU0 GeForce GT 750M which is of cuda capability 3.0. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability that we support is 3.5."

After you start running the training loop, if you want to manually watch it from the terminal whether your program is utilizing the GPU resources and to what extent, then you can simply use watch as in:
$ watch -n 2 nvidia-smi
This will continuously update the usage stats for every 2 seconds until you press ctrl+c
If you need more control on more GPU stats you might need, you can use more sophisticated version of nvidia-smi with --query-gpu=.... Below is a simple illustration of this:
$ watch -n 3 nvidia-smi --query-gpu=index,gpu_name,memory.total,memory.used,memory.free,temperature.gpu,pstate,utilization.gpu,utilization.memory --format=csv
which would output the stats something like:
Note: There should not be any space between the comma separated query names in --query-gpu=.... Else those values will be ignored and no stats are returned.
Also, you can check whether your installation of PyTorch detects your CUDA installation correctly by doing:
In [13]: import torch
In [14]: torch.cuda.is_available()
Out[14]: True
True status means that PyTorch is configured correctly and is using the GPU although you have to move/place the tensors with necessary statements in your code.
If you want to do this inside Python code, then look into this module:
https://github.com/jonsafari/nvidia-ml-py or in pypi here: https://pypi.python.org/pypi/nvidia-ml-py/

From practical standpoint just one minor digression:
import torch
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
This dev now knows if cuda or cpu.
And there is a difference in how you deal with models and with tensors when moving to cuda. It is a bit strange at first.
import torch
import torch.nn as nn
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
t1 = torch.randn(1,2)
t2 = torch.randn(1,2).to(dev)
print(t1) # tensor([[-0.2678, 1.9252]])
print(t2) # tensor([[ 0.5117, -3.6247]], device='cuda:0')
t1.to(dev)
print(t1) # tensor([[-0.2678, 1.9252]])
print(t1.is_cuda) # False
t1 = t1.to(dev)
print(t1) # tensor([[-0.2678, 1.9252]], device='cuda:0')
print(t1.is_cuda) # True
class M(nn.Module):
def __init__(self):
super().__init__()
self.l1 = nn.Linear(1,2)
def forward(self, x):
x = self.l1(x)
return x
model = M() # not on cuda
model.to(dev) # is on cuda (all parameters)
print(next(model.parameters()).is_cuda) # True
This all is tricky and understanding it once, helps you to deal fast with less debugging.

From the official site's get started page, you can check if the GPU is available for PyTorch like so:
import torch
torch.cuda.is_available()
Reference: PyTorch | Get Started

Query
Command
Does PyTorch see any GPUs?
torch.cuda.is_available()
Are tensors stored on GPU by default?
torch.rand(10).device
Set default tensor type to CUDA:
torch.set_default_tensor_type(torch.cuda.FloatTensor)
Is this tensor a GPU tensor?
my_tensor.is_cuda
Is this model stored on the GPU?
all(p.is_cuda for p in my_model.parameters())

To check if there is a GPU available:
torch.cuda.is_available()
If the above function returns False,
you either have no GPU,
or the Nvidia drivers have not been installed so the OS does not see the GPU,
or the GPU is being hidden by the environmental variable CUDA_VISIBLE_DEVICES. When the value of CUDA_VISIBLE_DEVICES is -1, then all your devices are being hidden. You can check that value in code with this line: os.environ['CUDA_VISIBLE_DEVICES']
If the above function returns True that does not necessarily mean that you are using the GPU. In Pytorch you can allocate tensors to devices when you create them. By default, tensors get allocated to the cpu. To check where your tensor is allocated do:
# assuming that 'a' is a tensor created somewhere else
a.device # returns the device where the tensor is allocated
Note that you cannot operate on tensors allocated in different devices. To see how to allocate a tensor to the GPU, see here: https://pytorch.org/docs/stable/notes/cuda.html

Almost all answers here reference torch.cuda.is_available(). However, that's only one part of the coin. It tells you whether the GPU (actually CUDA) is available, not whether it's actually being used. In a typical setup, you would set your device with something like this:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
but in larger environments (e.g. research) it is also common to give the user more options, so based on input they can disable CUDA, specify CUDA IDs, and so on. In such case, whether or not the GPU is used is not only based on whether it is available or not. After the device has been set to a torch device, you can get its type property to verify whether it's CUDA or not.
if device.type == 'cuda':
# do something

Simply from command prompt or Linux environment run the following command.
python -c 'import torch; print(torch.cuda.is_available())'
The above should print True
python -c 'import torch; print(torch.rand(2,3).cuda())'
This one should print the following:
tensor([[0.7997, 0.6170, 0.7042], [0.4174, 0.1494, 0.0516]], device='cuda:0')

If you are here because your pytorch always gives False for torch.cuda.is_available() that's probably because you installed your pytorch version without GPU support. (Eg: you coded up in laptop then testing on server).
The solution is to uninstall and install pytorch again with the right command from pytorch downloads page. Also refer this pytorch issue.

It is possible for
torch.cuda.is_available()
to return True but to get the following error when running
>>> torch.rand(10).to(device)
as suggested by MBT:
RuntimeError: CUDA error: no kernel image is available for execution on the device
This link explains that
... torch.cuda.is_available only checks whether your driver is compatible with the version of cuda we used in the binary. So it means that CUDA 10.1 is compatible with your driver. But when you do computation with CUDA, it couldn't find the code for your arch.

If you are using Linux I suggest to install nvtop
https://github.com/Syllo/nvtop
You will get something like this:

For a MacBook M1 system:
import torch
print(torch.backends.mps.is_available(), torch.backends.mps.is_built())
And both should be True.

Create a tensor on the GPU as follows:
$ python
>>> import torch
>>> print(torch.rand(3,3).cuda())
Do not quit, open another terminal and check if the python process is using the GPU using:
$ nvidia-smi

Using the code below
import torch
torch.cuda.is_available()
will only display whether the GPU is present and detected by pytorch or not.
But in the "task manager-> performance" the GPU utilization will be very few percent.
Which means you are actually running using CPU.
To solve the above issue check and change:
Graphics setting --> Turn on Hardware accelerated GPU settings, restart.
Open NVIDIA control panel --> Desktop --> Display GPU in the notification area
[Note: If you have newly installed windows then you also have to agree the terms and conditions in NVIDIA control panel]
This should work!

step 1: import torch library
import torch
#step 2: create tensor
tensor = torch.tensor([5, 6])
#step 3: find the device type
#output 1: in the below, the output we can get the size(tensor.shape), dimension(tensor.ndim),
#and device on which the tensor is processed
tensor, tensor.device, tensor.ndim, tensor.shape
(tensor([5, 6]), device(type='cpu'), 1, torch.Size([2]))
#or
#output 2: in the below, the output we can get the only device type
tensor.device
device(type='cpu')
#As my system using cpu processor "11th Gen Intel(R) Core(TM) i5-1135G7 # 2.40GHz 2.42 GHz"
#find, if the tensor processed GPU?
print(tensor, torch.cuda.is_available()
# the output will be
tensor([5, 6]) False
#above output is false, hence it is not on gpu
#happy coding :)

TensorFlow Distributed Runtime Model Parallel CIFAR-10

I have tried to modify the CIFAR-10 example to run on the new TensorFlow distributed runtime. However, I get the following error when trying to run the program:
InvalidArgumentError: Cannot assign a device to node 'softmax_linear/biases/ExponentialMovingAverage':
Could not satisfy explicit device specification '/job:local/task:0/device:CPU:0'
I start the cluster using the following commands. On the first node I run:
bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='local|10.31.101.101:7777;10.31.101.224:7778' --job_name=local --task_id=0
...and on the second node I run:
bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='local|10.31.101.101:7777;10.31.101.224:7778' --job_name=local --task_id=1
For the CIFAR-10 multi-GPU code, I make the simple modifications, replacing two lines in the train() function. The following line:
with tf.Graph().as_default(), tf.device('/cpu:0'):
...is replaced with:
with tf.Graph().as_default(), tf.device('/job:local/task:0/cpu:0'):
and the following line:
with tf.device('/gpu:%d' % i):
...is replaced with:
with tf.device('/job:local/task:0/gpu:%d' % i):
In my understanding, the second substitution should take care of the model substitution. Running a simpler example, like the code below, works fine:
with tf.device('/job:local/task:0/cpu:0'):
c = tf.constant("Hello, distributed TensorFlow!")
sess.run(c)
print(c)

I can't tell from your program, but my guess is that you also have to modify the line that creates the session to specify the address of one of your worker tasks. For example, given your configuration above, you might write:
sess = tf.Session(
"grpc://10.31.101.101:7777",
config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
As it happens, we've been trying to improve that error message to make it less confusing. If you update to the latest version in GitHub and run the same code, you should see an error message that explains why the device specification could not be satisfied.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does torch.cuda.synchronize() behave? - python

Related

In pytorch what happens when I move a module to cuda device

CUDA Illegal Memory Access error when using torch.cat

Bug when using TensorFlow-GPU + Python multiprocessing?

How do I check if PyTorch is using the GPU?

TensorFlow Distributed Runtime Model Parallel CIFAR-10

Categories

Resources