How do I check if PyTorch is using the GPU?

How do I check if PyTorch is using the GPU? - python

How do I check if PyTorch is using the GPU? The nvidia-smi command can detect GPU activity, but I want to check it directly from inside a Python script.

These functions should help:
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.device(0)
<torch.cuda.device at 0x7efce0b03be0>
>>> torch.cuda.get_device_name(0)
'GeForce GTX 950M'
This tells us:
CUDA is available and can be used by one device.
Device 0 refers to the GPU GeForce GTX 950M, and it is currently chosen by PyTorch.

As it hasn't been proposed here, I'm adding a method using torch.device, as this is quite handy, also when initializing tensors on the correct device.
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()
#Additional Info when using cuda
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
print('Memory Usage:')
print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Cached: ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
Edit: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved. So use memory_cached for older versions.
Output:
Using device: cuda
Tesla K80
Memory Usage:
Allocated: 0.3 GB
Cached: 0.6 GB
As mentioned above, using device it is possible to:
To move tensors to the respective device:
torch.rand(10).to(device)
To create a tensor directly on the device:
torch.rand(10, device=device)
Which makes switching between CPU and GPU comfortable without changing the actual code.
Edit:
As there has been some questions and confusion about the cached and allocated memory I'm adding some additional information about it:
torch.cuda.max_memory_cached(device=None) Returns the maximum GPU memory managed by the caching allocator in bytes for a
given device.
torch.cuda.memory_allocated(device=None) Returns the current GPU memory usage by tensors in bytes for a given device.
You can either directly hand over a device as specified further above in the post or you can leave it None and it will use the current_device().
Additional note: Old graphic cards with Cuda compute capability 3.0 or lower may be visible but cannot be used by Pytorch! Thanks to hekimgil for pointing this out! - "Found GPU0 GeForce GT 750M which is of cuda capability 3.0. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability that we support is 3.5."

After you start running the training loop, if you want to manually watch it from the terminal whether your program is utilizing the GPU resources and to what extent, then you can simply use watch as in:
$ watch -n 2 nvidia-smi
This will continuously update the usage stats for every 2 seconds until you press ctrl+c
If you need more control on more GPU stats you might need, you can use more sophisticated version of nvidia-smi with --query-gpu=.... Below is a simple illustration of this:
$ watch -n 3 nvidia-smi --query-gpu=index,gpu_name,memory.total,memory.used,memory.free,temperature.gpu,pstate,utilization.gpu,utilization.memory --format=csv
which would output the stats something like:
Note: There should not be any space between the comma separated query names in --query-gpu=.... Else those values will be ignored and no stats are returned.
Also, you can check whether your installation of PyTorch detects your CUDA installation correctly by doing:
In [13]: import torch
In [14]: torch.cuda.is_available()
Out[14]: True
True status means that PyTorch is configured correctly and is using the GPU although you have to move/place the tensors with necessary statements in your code.
If you want to do this inside Python code, then look into this module:
https://github.com/jonsafari/nvidia-ml-py or in pypi here: https://pypi.python.org/pypi/nvidia-ml-py/

From practical standpoint just one minor digression:
import torch
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
This dev now knows if cuda or cpu.
And there is a difference in how you deal with models and with tensors when moving to cuda. It is a bit strange at first.
import torch
import torch.nn as nn
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
t1 = torch.randn(1,2)
t2 = torch.randn(1,2).to(dev)
print(t1) # tensor([[-0.2678, 1.9252]])
print(t2) # tensor([[ 0.5117, -3.6247]], device='cuda:0')
t1.to(dev)
print(t1) # tensor([[-0.2678, 1.9252]])
print(t1.is_cuda) # False
t1 = t1.to(dev)
print(t1) # tensor([[-0.2678, 1.9252]], device='cuda:0')
print(t1.is_cuda) # True
class M(nn.Module):
def __init__(self):
super().__init__()
self.l1 = nn.Linear(1,2)
def forward(self, x):
x = self.l1(x)
return x
model = M() # not on cuda
model.to(dev) # is on cuda (all parameters)
print(next(model.parameters()).is_cuda) # True
This all is tricky and understanding it once, helps you to deal fast with less debugging.

From the official site's get started page, you can check if the GPU is available for PyTorch like so:
import torch
torch.cuda.is_available()
Reference: PyTorch | Get Started

Query
Command
Does PyTorch see any GPUs?
torch.cuda.is_available()
Are tensors stored on GPU by default?
torch.rand(10).device
Set default tensor type to CUDA:
torch.set_default_tensor_type(torch.cuda.FloatTensor)
Is this tensor a GPU tensor?
my_tensor.is_cuda
Is this model stored on the GPU?
all(p.is_cuda for p in my_model.parameters())

To check if there is a GPU available:
torch.cuda.is_available()
If the above function returns False,
you either have no GPU,
or the Nvidia drivers have not been installed so the OS does not see the GPU,
or the GPU is being hidden by the environmental variable CUDA_VISIBLE_DEVICES. When the value of CUDA_VISIBLE_DEVICES is -1, then all your devices are being hidden. You can check that value in code with this line: os.environ['CUDA_VISIBLE_DEVICES']
If the above function returns True that does not necessarily mean that you are using the GPU. In Pytorch you can allocate tensors to devices when you create them. By default, tensors get allocated to the cpu. To check where your tensor is allocated do:
# assuming that 'a' is a tensor created somewhere else
a.device # returns the device where the tensor is allocated
Note that you cannot operate on tensors allocated in different devices. To see how to allocate a tensor to the GPU, see here: https://pytorch.org/docs/stable/notes/cuda.html

Almost all answers here reference torch.cuda.is_available(). However, that's only one part of the coin. It tells you whether the GPU (actually CUDA) is available, not whether it's actually being used. In a typical setup, you would set your device with something like this:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
but in larger environments (e.g. research) it is also common to give the user more options, so based on input they can disable CUDA, specify CUDA IDs, and so on. In such case, whether or not the GPU is used is not only based on whether it is available or not. After the device has been set to a torch device, you can get its type property to verify whether it's CUDA or not.
if device.type == 'cuda':
# do something

Simply from command prompt or Linux environment run the following command.
python -c 'import torch; print(torch.cuda.is_available())'
The above should print True
python -c 'import torch; print(torch.rand(2,3).cuda())'
This one should print the following:
tensor([[0.7997, 0.6170, 0.7042], [0.4174, 0.1494, 0.0516]], device='cuda:0')

If you are here because your pytorch always gives False for torch.cuda.is_available() that's probably because you installed your pytorch version without GPU support. (Eg: you coded up in laptop then testing on server).
The solution is to uninstall and install pytorch again with the right command from pytorch downloads page. Also refer this pytorch issue.

It is possible for
torch.cuda.is_available()
to return True but to get the following error when running
>>> torch.rand(10).to(device)
as suggested by MBT:
RuntimeError: CUDA error: no kernel image is available for execution on the device
This link explains that
... torch.cuda.is_available only checks whether your driver is compatible with the version of cuda we used in the binary. So it means that CUDA 10.1 is compatible with your driver. But when you do computation with CUDA, it couldn't find the code for your arch.

If you are using Linux I suggest to install nvtop
https://github.com/Syllo/nvtop
You will get something like this:

For a MacBook M1 system:
import torch
print(torch.backends.mps.is_available(), torch.backends.mps.is_built())
And both should be True.

Create a tensor on the GPU as follows:
$ python
>>> import torch
>>> print(torch.rand(3,3).cuda())
Do not quit, open another terminal and check if the python process is using the GPU using:
$ nvidia-smi

Using the code below
import torch
torch.cuda.is_available()
will only display whether the GPU is present and detected by pytorch or not.
But in the "task manager-> performance" the GPU utilization will be very few percent.
Which means you are actually running using CPU.
To solve the above issue check and change:
Graphics setting --> Turn on Hardware accelerated GPU settings, restart.
Open NVIDIA control panel --> Desktop --> Display GPU in the notification area
[Note: If you have newly installed windows then you also have to agree the terms and conditions in NVIDIA control panel]
This should work!

step 1: import torch library
import torch
#step 2: create tensor
tensor = torch.tensor([5, 6])
#step 3: find the device type
#output 1: in the below, the output we can get the size(tensor.shape), dimension(tensor.ndim),
#and device on which the tensor is processed
tensor, tensor.device, tensor.ndim, tensor.shape
(tensor([5, 6]), device(type='cpu'), 1, torch.Size([2]))
#or
#output 2: in the below, the output we can get the only device type
tensor.device
device(type='cpu')
#As my system using cpu processor "11th Gen Intel(R) Core(TM) i5-1135G7 # 2.40GHz 2.42 GHz"
#find, if the tensor processed GPU?
print(tensor, torch.cuda.is_available()
# the output will be
tensor([5, 6]) False
#above output is false, hence it is not on gpu
#happy coding :)

Related

How to use all GPUs in SageMaker real-time inference?

I have deployed a model on real-time inference in a single gpu instance, it works fine.
Now I want to use a multiple GPUs to decrease the inference time, what do I need to change in my inference.py to make it work?
Here is some of my code:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
logger.info("Loading first model...")
model = Model().to(DEVICE)
with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
model = model.eval()
logger.info("Loading second model...")
model_2 = Model_2()
model_2.to(DEVICE)
checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
model_2 = model_2()
logger.info('Done loading models')
return {'first_model': model, 'second_model': model_2}
def input_fn(request_body, request_content_type):
assert request_content_type=='application/json'
url = json.loads(request_body)['url']
save_name = json.loads(request_body)['save_name']
logger.info(f'Image url: {url}')
img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
w, h = img.size
input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0).to(DEVICE)
logger.info('Image ready to predict!')
return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}
def predict_fn(input_object, model):
data = input_object['tensor']
logger.info('Generating prediction based on the input image')
model_1 = model['first_model']
model_2 = model['second_model']
d0, d1, d2, d3, d4, d5, d6 = model_1(data)
torch.cuda.empty_cache()
mask = torch.argmax(d0[0], axis=0).cpu().numpy()
mask = np.where(mask==2, 255, mask)
mask = np.where(mask==1, 128, mask)
img = input_object['image']
final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
img = np.array(img)[:,:,::-1]
final_image = np.array(final_image)
image_dict = to_dict(img, final_image)
final_image = model_2_process(model_2, image_dict)
torch.cuda.empty_cache()
return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}
I was thinking that maybe with torch multiprocessing, any tips?

The answer mentioning Torch DDP and DP is not exactly appropriate since the value of those libraries is to conduct multi-GPU gradient descent (averaging the gradient inter-GPU in particular), which, as mentioned in 1., does not happen at inference. Actually, a well-done, optimized inference ideally doesn't even use PyTorch or TensorFlow at all, but instead a prediction-only optimized runtime such as SageMaker Neo, ONNXRuntime or NVIDIA TensorRT, to reduce memory footprint and latency.
to infer a single model that fits in a GPU, multi-GPU instances are generally not advised: inference is a share-nothing task, so that you can use N single-GPU instance and things are simpler and equally performant.
Inference on Multi-GPU host is useful in 2 cases: (1) if you do model parallel inference (not your case) or (2) if your service inference consists of a graph of models that are calling each other. In which case, the proximity of the various models called in the DAG can reduce latency. That seems to be your situation
My recommendations are the following:
Try using NVIDIA Triton, that supports well those DAG use-cases and is supported on SageMaker. https://aws.amazon.com/fr/blogs/machine-learning/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker/
If you want to do things custom, you could try assigning the 2 models to different cuda device id in PyTorch. Because cuda kernels are run asynchronously this could be enough to have some parallelism and a bit of acceleration vs 1 GPU if your models can run parallel
I saw multiprocessing used once (with MXNet) to load-balance inference requests across GPUs (in this AWS blog post) but it was for share-nothing, map-style distribution of batches of inferences. In your case you seem to have to connection between your model so Triton is probably a better fit.
Eventually, if your goal is to reduce latency, there are other ideas:
Fix any CPU bottleneck Your code seem to have a lot of CPU work (pre-processing, numpy...). Are you sure GPU is the bottleneck? If CPU is at 80%+, try large single-GPU G5, such as G5.16xlarge. They are great for computer vision inference
Use a better GPU if you are using a P2, P3 or G4dn, try G5 instead
Optimize code. 2 things to try, depending on the bottleneck:
If you do the inference in Torch, try to avoid doing algebra with Numpy, and do as much as possible with torch tensors on GPU.
If GPU is the bottleneck, try to replace PyTorch by ONNXRuntime or NVIDIA TensorRT.

You must use torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel (read "Multi-GPU Examples" and "Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel").
You must call the function by passing at least these three parameters:
module (Module) – module to be parallelized (your model)
device_ids (list of python:int or torch.device) – CUDA devices.
For single-device modules, device_ids can contain
exactly one device id, which represents the only CUDA device where the
input module corresponding to this process resides. Alternatively,
device_ids can also be None.
For multi-device modules and CPU
modules, device_ids must be None.
When device_ids is None for both cases, both the input data for the
forward pass and the actual module must be placed on the correct
device. (default: None)
output_device (int or torch.device) – Device location of output for single-device CUDA modules.
For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)
for example:
from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

How does torch.cuda.synchronize() behave?

According to the PyTorch documentation torch.cuda.synchronize "Waits for all kernels in all streams on a CUDA device to complete.". Questions:
Should this say "Waits for all kernels in all streams initiated by this Python session on a CUDA device to complete"? In other words, if Python session A is running CUDA operations, and I call torch.cuda.synchronize() in Python session B, that won't care about what's happening in Python session A right?
Surely if we don't call torch.cuda.synchronize(), but try to work with any python code referencing the tensors in the computation graph, then it's like implicitly calling it right?
Q2 in code:
output = model(inputs) # cuda starts working here
a = 1 + 1 # cuda might still be running the previous line. This line can run at the same time
other_model(output) # This implicitly does the same thing as torch.cuda.synchronize() then does a forward pass of other_model
b = a + a # This line can't happen until cuda is done and the previous line has been executed

'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False

I am trying to create a Bert model for classifying Turkish Lan. here is my code:
import pandas as pd
import torch
df = pd.read_excel (r'preparedDataNoId.xlsx')
df = df.sample(frac = 1)
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.10)
print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)
from simpletransformers.classification import ClassificationModel
# define hyperparameter
train_args ={"reprocess_input_data": True,
"fp16":False,
"num_train_epochs": 4}
# Create a ClassificationModel
model = ClassificationModel(
"bert", "dbmdz/bert-base-turkish-cased",
num_labels=4,
args=train_args
)
I am using Anaconda and Spyder. I think every thing is correct but when I run this I got the following error:
'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False.
how can I fix this exactly?

I ran into the same problem. If you have CUDA available, then set both use_cuda and fp16 to True. If not, then set both to False.

CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs.
If your computer does not have GPU, this error will be thrown to you.
Don't forget to include this parameter
use_cuda= False
This will not affect your result, just take a few more seconds than usual to process.

model = ClassificationModel(
"bert", "dbmdz/bert-base-turkish-cased",
num_labels=4,
args=train_args,
use_cuda=False
)
Adding use_cuda=False will help if GPU is not available

If your GPU is unavailable on your computer. Make sure to check CUDA or try use_cuda=False in args of your model. This error will be throw since CUDA does not exist on your computer.

How do I list all currently available GPUs with pytorch?

I know I can access the current GPU using torch.cuda.current_device(), but how can I get a list of all the currently available GPUs?

You can list all the available GPUs by doing:
>>> import torch
>>> available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
>>> available_gpus
[<torch.cuda.device object at 0x7f2585882b50>]

Check how many GPUs are available with PyTorch
import torch
num_of_gpus = torch.cuda.device_count()
print(num_of_gpus)
In case you want to use the first GPU from it.
device = 'cuda:0' if cuda.is_available() else 'cpu'
Replace 0 in the above command with another number If you want to use another GPU.

I know this answer is kind of late. I thought the author of the question asked what devices are actually available to Pytorch not:
how many are available (obtainable with device_count()) OR
the device manager handle (obtainable with torch.cuda.device(i)) which is what some of the other answers give.
If you want to know what the actual GPU name is (E.g.: NVIDIA 2070 GTI etc.) try the following instead:
import torch
for i in range(torch.cuda.device_count()):
print(torch.cuda.get_device_properties(i).name)
Note the use of get_device_properties(i) function. This returns a object that looks like this:
_CudaDeviceProperties(name='NVIDIA GeForce RTX 2070', major=8, minor=6, total_memory=12044MB, multi_processor_count=28))
This object contains a property called name. You may optionally drill down directly to the name property to get the human-readable name associated with the GPU in question.

Extending the previous replies with device properties
$ python3 -c "import torch; print([(i, torch.cuda.get_device_properties(i)) for i in range(torch.cuda.device_count())])"
[(0, _CudaDeviceProperties(name='NVIDIA GeForce RTX 3060', major=8, minor=6, total_memory=12044MB, multi_processor_count=28))]

automatically choose a device keras tensorflow [duplicate]

I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).
Others have access too and sometimes they take random gpus.
I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.
I would like to go throuh the gpus usage and find the first that is unused and use only this one.
I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.
I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?

I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.
Something like this:
import subprocess, re
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE
import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow

An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.