I am running a tensorflow training on a Linux machine with 4 cores.
When checking the cpu utilization with htop, only one core is fully utilized, whereas the others are utilized only with ~15% (image below shows a screenshot of htop).
How can I make sure TF is using all CPUs to full capacity?
I am aware of this issue "Using multiple CPU cores in TensorFlow" - how to make it work for Tensoflow 2?
.
I am using the following code to generate the samples:
class WindowGenerator():
def make_dataset(self, data, stride=1):
data = np.array(data, dtype=np.float32)
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=stride,
shuffle=False,
batch_size=self.batch_size,)
ds = ds.map(self.split_window)
return ds
#property
def train(self):
return self.make_dataset(self.train_df)
#property
def val(self):
return self.make_dataset(self.val_df)
#property
def test(self):
return self.make_dataset(self.test_df, stride=24)
I'm using the following code to run the model training. sampleMgmt is of Class WindowGenerator. early_stopping defines the training termination criteria.
history = model.fit(sampleMgmt.train, epochs=self.nrEpochs,
validation_data=sampleMgmt.val,
callbacks=[early_stopping],
verbose=1)
Have you tried this?
config = tf.ConfigProto(device_count={"CPU": 8})
with tf.Session(config=config) as sess:
(source)
In the end, I did two things outside my code to make TF use all CPUs availible.
I used a SSD instead of a HDD
I used the Tensorflow Enterprise Image, provided in Google Cloud, instead of the Deep Learning Image
Related
I have deployed a model on real-time inference in a single gpu instance, it works fine.
Now I want to use a multiple GPUs to decrease the inference time, what do I need to change in my inference.py to make it work?
Here is some of my code:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
logger.info("Loading first model...")
model = Model().to(DEVICE)
with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
model = model.eval()
logger.info("Loading second model...")
model_2 = Model_2()
model_2.to(DEVICE)
checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
model_2 = model_2()
logger.info('Done loading models')
return {'first_model': model, 'second_model': model_2}
def input_fn(request_body, request_content_type):
assert request_content_type=='application/json'
url = json.loads(request_body)['url']
save_name = json.loads(request_body)['save_name']
logger.info(f'Image url: {url}')
img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
w, h = img.size
input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0).to(DEVICE)
logger.info('Image ready to predict!')
return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}
def predict_fn(input_object, model):
data = input_object['tensor']
logger.info('Generating prediction based on the input image')
model_1 = model['first_model']
model_2 = model['second_model']
d0, d1, d2, d3, d4, d5, d6 = model_1(data)
torch.cuda.empty_cache()
mask = torch.argmax(d0[0], axis=0).cpu().numpy()
mask = np.where(mask==2, 255, mask)
mask = np.where(mask==1, 128, mask)
img = input_object['image']
final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
img = np.array(img)[:,:,::-1]
final_image = np.array(final_image)
image_dict = to_dict(img, final_image)
final_image = model_2_process(model_2, image_dict)
torch.cuda.empty_cache()
return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}
I was thinking that maybe with torch multiprocessing, any tips?
The answer mentioning Torch DDP and DP is not exactly appropriate since the value of those libraries is to conduct multi-GPU gradient descent (averaging the gradient inter-GPU in particular), which, as mentioned in 1., does not happen at inference. Actually, a well-done, optimized inference ideally doesn't even use PyTorch or TensorFlow at all, but instead a prediction-only optimized runtime such as SageMaker Neo, ONNXRuntime or NVIDIA TensorRT, to reduce memory footprint and latency.
to infer a single model that fits in a GPU, multi-GPU instances are generally not advised: inference is a share-nothing task, so that you can use N single-GPU instance and things are simpler and equally performant.
Inference on Multi-GPU host is useful in 2 cases: (1) if you do model parallel inference (not your case) or (2) if your service inference consists of a graph of models that are calling each other. In which case, the proximity of the various models called in the DAG can reduce latency. That seems to be your situation
My recommendations are the following:
Try using NVIDIA Triton, that supports well those DAG use-cases and is supported on SageMaker. https://aws.amazon.com/fr/blogs/machine-learning/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker/
If you want to do things custom, you could try assigning the 2 models to different cuda device id in PyTorch. Because cuda kernels are run asynchronously this could be enough to have some parallelism and a bit of acceleration vs 1 GPU if your models can run parallel
I saw multiprocessing used once (with MXNet) to load-balance inference requests across GPUs (in this AWS blog post) but it was for share-nothing, map-style distribution of batches of inferences. In your case you seem to have to connection between your model so Triton is probably a better fit.
Eventually, if your goal is to reduce latency, there are other ideas:
Fix any CPU bottleneck Your code seem to have a lot of CPU work (pre-processing, numpy...). Are you sure GPU is the bottleneck? If CPU is at 80%+, try large single-GPU G5, such as G5.16xlarge. They are great for computer vision inference
Use a better GPU if you are using a P2, P3 or G4dn, try G5 instead
Optimize code. 2 things to try, depending on the bottleneck:
If you do the inference in Torch, try to avoid doing algebra with Numpy, and do as much as possible with torch tensors on GPU.
If GPU is the bottleneck, try to replace PyTorch by ONNXRuntime or NVIDIA TensorRT.
You must use torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel (read "Multi-GPU Examples" and "Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel").
You must call the function by passing at least these three parameters:
module (Module) – module to be parallelized (your model)
device_ids (list of python:int or torch.device) – CUDA devices.
For single-device modules, device_ids can contain
exactly one device id, which represents the only CUDA device where the
input module corresponding to this process resides. Alternatively,
device_ids can also be None.
For multi-device modules and CPU
modules, device_ids must be None.
When device_ids is None for both cases, both the input data for the
forward pass and the actual module must be placed on the correct
device. (default: None)
output_device (int or torch.device) – Device location of output for single-device CUDA modules.
For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)
for example:
from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)
I have a NLP model trained on Pytorch to be run in Jetson Xavier. I installed Jetson stats to monitor usage of CPU and GPU. When I run the Python script, only CPU cores work on-load, GPU bar does not increase. I have searched on Google about that with keywords of " How to check if pytorch is using the GPU?" and checked results on stackoverflow.com etc. According to their advices to someone else facing similar issue, cuda is available and there is cuda device in my Jetson Xavier. However, I don’t understand why GPU bar does not change, CPU core bars go to the ends.
I don’t want to use CPU, it takes so long to compute. In my opinion, it uses CPU, not GPU. How can I be sure and if it uses CPU, how can I change it to GPU?
Note: Model is taken from huggingface transformers library. I have tried to use cuda() method on the model. (model.cuda()) In this scenario, GPU is used but I can not get an output from model and raises exception.
Here is the code:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch
BERT_DIR = "savasy/bert-base-turkish-squad"
tokenizer = AutoTokenizer.from_pretrained(BERT_DIR)
model = AutoModelForQuestionAnswering.from_pretrained(BERT_DIR)
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
def infer(question,corpus):
try:
ans = nlp(question=question, context=corpus)
return ans["answer"], ans["score"]
except:
ans = None
pass
return None, 0
The problem has been solved with loading pipeline containing device parameter:
nlp = pipeline("question-answering", model=BERT_DIR, device=0)
For the model to work on GPU, the data and the model has to be loaded to the GPU:
you can do this as follows:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch
BERT_DIR = "savasy/bert-base-turkish-squad"
device = torch.device("cuda")
tokenizer = AutoTokenizer.from_pretrained(BERT_DIR)
model = AutoModelForQuestionAnswering.from_pretrained(BERT_DIR)
model.to(device) ## model to GPU
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
def infer(question,corpus):
try:
ans = nlp(question=question.to(device), context=corpus.to(device)) ## data to GPU
return ans["answer"], ans["score"]
except:
ans = None
pass
return None, 0
I am using the high-level Estimator on TF:
estim = tf.contrib.learn.Estimator(...)
estim.fit ( some_input )
If some_input has x, y, and batch_size, the codes run but with a warning; so I tried to use input_fn, and managed to send x, y through this input_fn, but not to send the batch_size. Didn't find any example for it.
Could anyone share a simple example that uses input_fn as input to the estim.fit / estim.evaluate, and uses batch_size as well?
Do I have to use tf.train.batch? If so, how does it merge into the higher-level implementation (tf.layers) - I don't know the graph's tf.Graph() or session?
Below is the warning I got:
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:657: calling evaluate
(from tensorflow.contrib.learn.python.learn.estimators.estimator) with y is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
est = Estimator(...) -> est = SKCompat(Estimator(...))
The link provided in Roi's own comment was indeed really helpful. Since I was struggling with the same question as well for a while, I would like to summarize the answer provided by the link above as a reference:
def batched_input_fn(dataset_x, dataset_y, batch_size):
def _input_fn():
all_x = tf.constant(dataset_x, shape=dataset_x.shape, dtype=tf.float32)
all_y = tf.constant(dataset_y, shape=dataset_y.shape, dtype=tf.float32)
sliced_input = tf.train.slice_input_producer([all_x, all_y])
return tf.train.batch(sliced_input, batch_size=batch_size)
return _input_fn
This can then be used like this example (using TensorFlow v1.1):
model = CustomModel(FLAGS.learning_rate)
estimator= tf.estimator.Estimator(model_fn=model.build(), params=model.params())
estimator.train(input_fn=batched_input_fn(
train.features,
train.labels,
FLAGS.batch_size),
steps=FLAGS.train_steps)
Unfortunately, this approach is about 10x slower compared to manual feeding (using TensorFlows low-level API) or compared to using the whole dataset with train.shape[0] == batch_size and not using train.sliced_input_producer() and train.batch() at all. At least on my machine (CPU only). I'm really wondering why this approach is so slow. Any ideas?
Edited:
I could speed it up a bit by using num_threads > 1 as a parameter for train.batch(). On a VM with 2 CPUs, I'm able to double the performance using this batching mechanism compared to the default num_threads=1. But still, it is 5x slower than manual feeding.
But results might be different on a native system or a system that uses all CPU cores for the input-pipeline and the GPU for the model computation. Would be great if somebody could post his results in the comments.
I'm using Keras with tensorflow as backend.
I have one compiled/trained model.
My prediction loop is slow so I would like to find a way to parallelize the predict_proba calls to speed things up.
I would like to take a list of batches (of data) and then per available gpu, run model.predict_proba() over a subset of those batches.
Essentially:
data = [ batch_0, batch_1, ... , batch_N ]
on gpu_0 => return predict_proba(batch_0)
on gpu_1 => return predict_proba(batch_1)
...
on gpu_N => return predict_proba(batch_N)
I know that it's possible in pure Tensorflow to assign ops to a given gpu (https://www.tensorflow.org/tutorials/using_gpu). However, I don't know how this translates to my situation since I've built/compiled/trained my model using Keras' api.
I had thought that maybe I just needed to use python's multiprocessing module and start a process per gpu that would run predict_proba(batch_n). I know this is theoretically possible given another SO post of mine: Keras + Tensorflow and Multiprocessing in Python. However, this still leaves me with the dilemma of not knowing how to actually "choose" a gpu to operate the process on.
My question boils down to: how does one parallelize prediction for one model in Keras across multiple gpus when using Tensorflow as Keras' backend?
Additionally I am curious if similar parallelization for prediction is possible with only one gpu.
A high level description or code example would be greatly appreciated!
Thanks!
I created one simple example to show how to run keras model across multiple gpus. Basically, multiple processes are created and each of process owns a gpu. To specify the gpu id in process, setting env variable CUDA_VISIBLE_DEVICES is a very straightforward way (os.environ["CUDA_VISIBLE_DEVICES"]). Hope this git repo can help you.
https://github.com/yuanyuanli85/Keras-Multiple-Process-Prediction
You can use this function to parallelize a Keras model (credits to kuza55).
https://github.com/kuza55/keras-extras/blob/master/utils/multi_gpu.py
.
from keras.layers import merge
from keras.layers.core import Lambda
from keras.models import Model
import tensorflow as tf
def make_parallel(model, gpu_count):
def get_slice(data, idx, parts):
shape = tf.shape(data)
size = tf.concat([ shape[:1] // parts, shape[1:] ],axis=0)
stride = tf.concat([ shape[:1] // parts, shape[1:]*0 ],axis=0)
start = stride * idx
return tf.slice(data, start, size)
outputs_all = []
for i in range(len(model.outputs)):
outputs_all.append([])
#Place a copy of the model on each GPU, each getting a slice of the batch
for i in range(gpu_count):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
inputs = []
#Slice each input into a piece for processing on this GPU
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
inputs.append(slice_n)
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
#Save all the outputs for merging back together later
for l in range(len(outputs)):
outputs_all[l].append(outputs[l])
# merge outputs on CPU
with tf.device('/cpu:0'):
merged = []
for outputs in outputs_all:
merged.append(merge(outputs, mode='concat', concat_axis=0))
return Model(input=model.inputs, output=merged)
(I'm using tensorflow 1.0 and Python 2.7)
I'm having trouble getting an Estimator to work with queues. Indeed, if I use the deprecated SKCompat interface with custom data files and a given batch size, the model trains properly. I'm trying to use the new interface with an input_fn that batches features out of TFRecord files (equivalent to my custom data files). The scripts runs properly but the loss value doesn't change after 200 or 300 steps. It seems that the model is looping on a small input batch (this would explain why the loss converges so fast).
I have a 'run.py' script that looks like the following:
import tensorflow as tf
from tensorflow.contrib import learn, metrics
#[...]
evalMetrics = {'accuracy':learn.MetricSpec(metric_fn=metrics.streaming_accuracy)}
runConfig = learn.RunConfig(save_summary_steps=10)
estimator = learn.Estimator(model_fn=myModel,
params=myParams,
modelDir='/tmp/myDir',
config=runConfig)
session = tf.Session(graph=tf.get_default_graph())
with session.as_default():
tf.global_variables_initializer()
coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=session,coord=coordinator)
estimator.fit(input_fn=lambda: inputToModel(trainingFileList),steps=10000)
estimator.evaluate(input_fn=lambda: inputToModel(evalFileList),steps=10000,metrics=evalMetrics)
coordinator.request_stop()
coordinator.join(threads)
session.close()
My inputToModel function looks like this:
import tensorflow as tf
def inputToModel(fileList):
features = {'rawData': tf.FixedLenFeature([100],tf.float32),
'label': tf.FixedLenFeature([],tf.int64)}
tensorDict = tf.contrib.learn.read_batch_record_features(fileList,
batch_size=100,
features=features,
randomize_input=True,
reader_num_threads=4,
num_epochs=1,
name='inputPipeline')
tf.local_variables_initializer()
data = tensorDict['rawData']
labelTensor = tensorDict['label']
inputTensor = tf.reshape(data,[-1,10,10,1])
return inputTensor,labelTensor
Any help or suggestions is welcome !
Try to use: tf.global_variables_initializer().run()
I wanna do a similar thing but I do not know how to use Estimator API with multi-threading. There is an Experiment class for serving too - might be useful
delete line session = tf.Session(graph=tf.get_default_graph())
and session.close() and try:
with tf.Session() as sess:
tf.global_variables_initializer().run()