Pytorch NLP model doesn’t use GPU when making inference

Pytorch NLP model doesn’t use GPU when making inference - python

I have a NLP model trained on Pytorch to be run in Jetson Xavier. I installed Jetson stats to monitor usage of CPU and GPU. When I run the Python script, only CPU cores work on-load, GPU bar does not increase. I have searched on Google about that with keywords of " How to check if pytorch is using the GPU?" and checked results on stackoverflow.com etc. According to their advices to someone else facing similar issue, cuda is available and there is cuda device in my Jetson Xavier. However, I don’t understand why GPU bar does not change, CPU core bars go to the ends.
I don’t want to use CPU, it takes so long to compute. In my opinion, it uses CPU, not GPU. How can I be sure and if it uses CPU, how can I change it to GPU?
Note: Model is taken from huggingface transformers library. I have tried to use cuda() method on the model. (model.cuda()) In this scenario, GPU is used but I can not get an output from model and raises exception.
Here is the code:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch
BERT_DIR = "savasy/bert-base-turkish-squad"
tokenizer = AutoTokenizer.from_pretrained(BERT_DIR)
model = AutoModelForQuestionAnswering.from_pretrained(BERT_DIR)
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
def infer(question,corpus):
try:
ans = nlp(question=question, context=corpus)
return ans["answer"], ans["score"]
except:
ans = None
pass
return None, 0

The problem has been solved with loading pipeline containing device parameter:
nlp = pipeline("question-answering", model=BERT_DIR, device=0)

For the model to work on GPU, the data and the model has to be loaded to the GPU:
you can do this as follows:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch
BERT_DIR = "savasy/bert-base-turkish-squad"
device = torch.device("cuda")
tokenizer = AutoTokenizer.from_pretrained(BERT_DIR)
model = AutoModelForQuestionAnswering.from_pretrained(BERT_DIR)
model.to(device) ## model to GPU
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
def infer(question,corpus):
try:
ans = nlp(question=question.to(device), context=corpus.to(device)) ## data to GPU
return ans["answer"], ans["score"]
except:
ans = None
pass
return None, 0

Related

How to use all GPUs in SageMaker real-time inference?

I have deployed a model on real-time inference in a single gpu instance, it works fine.
Now I want to use a multiple GPUs to decrease the inference time, what do I need to change in my inference.py to make it work?
Here is some of my code:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
logger.info("Loading first model...")
model = Model().to(DEVICE)
with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
model = model.eval()
logger.info("Loading second model...")
model_2 = Model_2()
model_2.to(DEVICE)
checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
model_2 = model_2()
logger.info('Done loading models')
return {'first_model': model, 'second_model': model_2}
def input_fn(request_body, request_content_type):
assert request_content_type=='application/json'
url = json.loads(request_body)['url']
save_name = json.loads(request_body)['save_name']
logger.info(f'Image url: {url}')
img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
w, h = img.size
input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0).to(DEVICE)
logger.info('Image ready to predict!')
return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}
def predict_fn(input_object, model):
data = input_object['tensor']
logger.info('Generating prediction based on the input image')
model_1 = model['first_model']
model_2 = model['second_model']
d0, d1, d2, d3, d4, d5, d6 = model_1(data)
torch.cuda.empty_cache()
mask = torch.argmax(d0[0], axis=0).cpu().numpy()
mask = np.where(mask==2, 255, mask)
mask = np.where(mask==1, 128, mask)
img = input_object['image']
final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
img = np.array(img)[:,:,::-1]
final_image = np.array(final_image)
image_dict = to_dict(img, final_image)
final_image = model_2_process(model_2, image_dict)
torch.cuda.empty_cache()
return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}
I was thinking that maybe with torch multiprocessing, any tips?

The answer mentioning Torch DDP and DP is not exactly appropriate since the value of those libraries is to conduct multi-GPU gradient descent (averaging the gradient inter-GPU in particular), which, as mentioned in 1., does not happen at inference. Actually, a well-done, optimized inference ideally doesn't even use PyTorch or TensorFlow at all, but instead a prediction-only optimized runtime such as SageMaker Neo, ONNXRuntime or NVIDIA TensorRT, to reduce memory footprint and latency.
to infer a single model that fits in a GPU, multi-GPU instances are generally not advised: inference is a share-nothing task, so that you can use N single-GPU instance and things are simpler and equally performant.
Inference on Multi-GPU host is useful in 2 cases: (1) if you do model parallel inference (not your case) or (2) if your service inference consists of a graph of models that are calling each other. In which case, the proximity of the various models called in the DAG can reduce latency. That seems to be your situation
My recommendations are the following:
Try using NVIDIA Triton, that supports well those DAG use-cases and is supported on SageMaker. https://aws.amazon.com/fr/blogs/machine-learning/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker/
If you want to do things custom, you could try assigning the 2 models to different cuda device id in PyTorch. Because cuda kernels are run asynchronously this could be enough to have some parallelism and a bit of acceleration vs 1 GPU if your models can run parallel
I saw multiprocessing used once (with MXNet) to load-balance inference requests across GPUs (in this AWS blog post) but it was for share-nothing, map-style distribution of batches of inferences. In your case you seem to have to connection between your model so Triton is probably a better fit.
Eventually, if your goal is to reduce latency, there are other ideas:
Fix any CPU bottleneck Your code seem to have a lot of CPU work (pre-processing, numpy...). Are you sure GPU is the bottleneck? If CPU is at 80%+, try large single-GPU G5, such as G5.16xlarge. They are great for computer vision inference
Use a better GPU if you are using a P2, P3 or G4dn, try G5 instead
Optimize code. 2 things to try, depending on the bottleneck:
If you do the inference in Torch, try to avoid doing algebra with Numpy, and do as much as possible with torch tensors on GPU.
If GPU is the bottleneck, try to replace PyTorch by ONNXRuntime or NVIDIA TensorRT.

You must use torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel (read "Multi-GPU Examples" and "Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel").
You must call the function by passing at least these three parameters:
module (Module) – module to be parallelized (your model)
device_ids (list of python:int or torch.device) – CUDA devices.
For single-device modules, device_ids can contain
exactly one device id, which represents the only CUDA device where the
input module corresponding to this process resides. Alternatively,
device_ids can also be None.
For multi-device modules and CPU
modules, device_ids must be None.
When device_ids is None for both cases, both the input data for the
forward pass and the actual module must be placed on the correct
device. (default: None)
output_device (int or torch.device) – Device location of output for single-device CUDA modules.
For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)
for example:
from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

Using multiple CPU Cores with Tensorflow

I am running a tensorflow training on a Linux machine with 4 cores.
When checking the cpu utilization with htop, only one core is fully utilized, whereas the others are utilized only with ~15% (image below shows a screenshot of htop).
How can I make sure TF is using all CPUs to full capacity?
I am aware of this issue "Using multiple CPU cores in TensorFlow" - how to make it work for Tensoflow 2?
.
I am using the following code to generate the samples:
class WindowGenerator():
def make_dataset(self, data, stride=1):
data = np.array(data, dtype=np.float32)
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=stride,
shuffle=False,
batch_size=self.batch_size,)
ds = ds.map(self.split_window)
return ds
#property
def train(self):
return self.make_dataset(self.train_df)
#property
def val(self):
return self.make_dataset(self.val_df)
#property
def test(self):
return self.make_dataset(self.test_df, stride=24)
I'm using the following code to run the model training. sampleMgmt is of Class WindowGenerator. early_stopping defines the training termination criteria.
history = model.fit(sampleMgmt.train, epochs=self.nrEpochs,
validation_data=sampleMgmt.val,
callbacks=[early_stopping],
verbose=1)

Have you tried this?
config = tf.ConfigProto(device_count={"CPU": 8})
with tf.Session(config=config) as sess:
(source)

In the end, I did two things outside my code to make TF use all CPUs availible.
I used a SSD instead of a HDD
I used the Tensorflow Enterprise Image, provided in Google Cloud, instead of the Deep Learning Image

'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False

I am trying to create a Bert model for classifying Turkish Lan. here is my code:
import pandas as pd
import torch
df = pd.read_excel (r'preparedDataNoId.xlsx')
df = df.sample(frac = 1)
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.10)
print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)
from simpletransformers.classification import ClassificationModel
# define hyperparameter
train_args ={"reprocess_input_data": True,
"fp16":False,
"num_train_epochs": 4}
# Create a ClassificationModel
model = ClassificationModel(
"bert", "dbmdz/bert-base-turkish-cased",
num_labels=4,
args=train_args
)
I am using Anaconda and Spyder. I think every thing is correct but when I run this I got the following error:
'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False.
how can I fix this exactly?

I ran into the same problem. If you have CUDA available, then set both use_cuda and fp16 to True. If not, then set both to False.

CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs.
If your computer does not have GPU, this error will be thrown to you.
Don't forget to include this parameter
use_cuda= False
This will not affect your result, just take a few more seconds than usual to process.

model = ClassificationModel(
"bert", "dbmdz/bert-base-turkish-cased",
num_labels=4,
args=train_args,
use_cuda=False
)
Adding use_cuda=False will help if GPU is not available

If your GPU is unavailable on your computer. Make sure to check CUDA or try use_cuda=False in args of your model. This error will be throw since CUDA does not exist on your computer.

Why does keras model.fit use so much memory despite using allow_growth=True?

I have, thanks to this question mostly been able to solve the problem of tensorflow allocating memory which I didn't want allocated. However, I have recently found that despite my using set_session with allow_growth=True, using model.fit will still mean that all the memory is allocated and I can no longer use it for the rest of my program, even when the function is exited and the model should no longer have any allocated memory due to the fact that the model is a local variable.
Here is some example code demonstrating this:
from numpy import array
from keras import Input, Model
from keras.layers import Conv2D, Dense, Flatten
from keras.optimizers import SGD
# stops keras/tensorflow from allocating all the GPU's memory immediately
from tensorflow.compat.v1.keras.backend import set_session
from tensorflow.compat.v1 import Session, ConfigProto, GPUOptions
tf_config = ConfigProto(gpu_options=GPUOptions(allow_growth=True))
session = Session(config=tf_config)
set_session(session)
# makes the neural network
def make_net():
input = Input((2, 3, 3))
conv = Conv2D(256, (1, 1))(input)
flattened_input = Flatten()(conv)
output = Dense(1)(flattened_input)
model = Model(inputs=input, outputs=output)
sgd = SGD(0.2, 0.9)
model.compile(sgd, 'mean_squared_error')
model.summary()
return model
def make_data(input_data, target_output):
input_data.append([[[0 for i in range(3)] for j in range(3)] for k in range(2)])
target_output.append(0)
def main():
data_amount = 4096
input_data = []
target_output = []
model = make_model()
for i in range(data_amount):
make_data(input_data, target_output)
model.fit(array(input_data), array(target_output), batch_size=len(input_data))
return
while True:
main()
When I run this code with the Pycharm debugger, I find that the GPU RAM used stays at around 0.1GB until I run model.fit for the first time, at which point the memory usage shoots up to 3.2GB of my 4GB of GPU RAM. I have also noted that the memory usage doesn't increase after the first time that model.fit is run and that if I remove the convolutional layer from my network, the memory increase doesn't happen at all.
Could someone please shine some light on my problem?
UPDATE: Setting per_process_gpu_memory_fraction in GPUOptions to 0.1 helps limit the effect in the code included, but not in my actual program. A better solution would still be helpful.

I used to face this problem. And I found a solution from someone who I can't find anymore. His solution I paste below. In fact, I found that if you set allow_growth=True, tensorflow seems to use all your memory. So you should just set your max limit.
try this:
gpus = tf.config.experimental.list_physical_devices("GPU")
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, False)
tf.config.experimental.set_virtual_device_configuration(
gpu,
[
tf.config.experimental.VirtualDeviceConfiguration(
memory_limit=12288 # set your limit
)
],
)
tf.config.experimental.set_visible_devices(gpus[0], "GPU")
logical_gpus = tf.config.experimental.list_logical_devices("GPU")
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)

Training with SGD and the whole training data in one batch can (depending on your input data) be very memory consumptive.
Try tweaking your batch_size to a lower size (e.g. 8, 16, 32)

A better way to make pytorch code agnostic to running on a CPU or GPU?

The Migration guide recommends the following to make code CPU/GPU agnostic:
> # at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
...
# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)
I did this and ran my code on a CPU-only device, but my model crashed when fed an input array, as saying it was expecting a CPU tensor not a GPU one. Somehow my model was automatically converting the a CPU-input array to a GPU array. Finally I traced it down to this command in my code:
model = torch.nn.DataParallel(model).to(device)
Even though I convert the model to 'cpu', the nn.DataParallel overrides this. The best solution I came up with was a conditional:
if device.type=='cpu':
model = model.to(device)
else:
model = torch.nn.DataParallel(model).to(device)
This does not seem elegant. Is there a better way?

How about
if torch.cuda.device_count() > 1:
model = torch.nn.DataParallel(model)
model = model.to(device)
?
You don't need DataParallel if you have only one GPU.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.