Why is TensorFlow 2 much slower than TensorFlow 1?

Why is TensorFlow 2 much slower than TensorFlow 1? - python

It's been cited by many users as the reason for switching to Pytorch, but I've yet to find a justification/explanation for sacrificing the most important practical quality, speed, for eager execution.
Below is code benchmarking performance, TF1 vs. TF2 - with TF1 running anywhere from 47% to 276% faster.
My question is: what is it, at the graph or hardware level, that yields such a significant slowdown?
Looking for a detailed answer - am already familiar with broad concepts. Relevant Git
Specs: CUDA 10.0.130, cuDNN 7.4.2, Python 3.7.4, Windows 10, GTX 1070
Benchmark results:
UPDATE: Disabling Eager Execution per below code does not help. The behavior, however, is inconsistent: sometimes running in graph mode helps considerably, other times it runs slower relative to Eager.
Benchmark code:
# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time
batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)
model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y) # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)
K.clear_session() # in my testing, kernel was restarted instead
model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y) # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)
Functions used:
def timeit(func, iterations, *args):
t0 = time()
for _ in range(iterations):
func(*args)
print("Time/iter: %.4f sec" % ((time() - t0) / iterations))
def make_small_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = Conv1D(128, 400, strides=4, padding='same')(ipt)
x = Flatten()(x)
x = Dropout(0.5)(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_medium_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x = LSTM(512, activation='relu', return_sequences=True)(x)
x = Conv1D(128, 400, strides=4, padding='same')(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_data(batch_shape):
return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))

UPDATE 8/1730/2020: TF 2.3 has finally done it: all cases run as fast, or notably faster, than any previous version.
Further, my previous update was unfair to TF; my GPU was to blame, has been overheating lately. If you see a rising stem plot of iteration times, it's a reliable symptom. Lastly, see a dev's note on Eager vs Graph.
This might be my last update on this answer. The true stats on your model's speed can only be found by you, on your device.
UPDATE 5/19/2020: TF 2.2, using same tests: only a minor improvement in Eager speed. Plots for Large-Large Numpy train_on_batch case below, x-axis is successive fit iterations; my GPU isn't near its full capacity, so doubt it's throttling, but iterations do get slower over time.
Per above, Graph and Eager are 1.56x and 1.97x slower than their TF1 counterparts, respectively. Unsure I'll debug this further, as I'm considering switching to Pytorch per TensorFlow's poor support for custom / low-level functionality. I did, however, open an Issue to get devs' feedback.
UPDATE 2/18/2020: I've benched 2.1 and 2.1-nightly; the results are mixed. All but one configs (model & data size) are as fast as or much faster than the best of TF2 & TF1. The one that's slower, and slower dramatically, is Large-Large - esp. in Graph execution (1.6x to 2.5x slower).
Furthermore, there are extreme reproducibility differences between Graph and Eager for a large model I tested - one not explainable via randomness/compute-parallelism. I can't currently present reproducible code for these claims per time constraints, so instead I strongly recommend testing this for your own models.
Haven't opened a Git issue on these yet, but I did comment on the original - no response yet. I'll update the answer(s) once progress is made.
VERDICT: it isn't, IF you know what you're doing. But if you don't, it could cost you, lots - by a few GPU upgrades on average, and by multiple GPUs worst-case.
THIS ANSWER: aims to provide a high-level description of the issue, as well as guidelines for how to decide on the training configuration specific to your needs. For a detailed, low-level description, which includes all benchmarking results + code used, see my other answer.
I'll be updating my answer(s) w/ more info if I learn any - can bookmark / "star" this question for reference.
ISSUE SUMMARY: as confirmed by a TensorFlow developer, Q. Scott Zhu, TF2 focused development on Eager execution & tight integration w/ Keras, which involved sweeping changes in TF source - including at graph-level. Benefits: greatly expanded processing, distribution, debug, and deployment capabilities. The cost of some of these, however, is speed.
The matter, however, is fairly more complex. It isn't just TF1 vs. TF2 - factors yielding significant differences in train speed include:
TF2 vs. TF1
Eager vs. Graph mode
keras vs. tf.keras
numpy vs. tf.data.Dataset vs. ...
train_on_batch() vs. fit()
GPU vs. CPU
model(x) vs. model.predict(x) vs. ...
Unfortunately, almost none of the above are independent of the other, and each can at least double execution time relative to another. Fortunately, you can determine what'll work best systematically, and with a few shortcuts - as I'll be showing.
WHAT SHOULD I DO? Currently, the only way is - experiment for your specific model, data, and hardware. No single configuration will always work best - but there are do's and don't's to simplify your search:
>> DO:
train_on_batch() + numpy + tf.keras + TF1 + Eager/Graph
train_on_batch() + numpy + tf.keras + TF2 + Graph
fit() + numpy + tf.keras + TF1/TF2 + Graph + large model & data
>> DON'T:
fit() + numpy + keras for small & medium models and data
fit() + numpy + tf.keras + TF1/TF2 + Eager
train_on_batch() + numpy + keras + TF1 + Eager
[Major] tf.python.keras; it can run 10-100x slower, and w/ plenty of bugs; more info
This includes layers, models, optimizers, & related "out-of-box" usage imports; ops, utils, & related 'private' imports are fine - but to be sure, check for alts, & whether they're used in tf.keras
Refer to code at bottom of my other answer for an example benchmarking setup. The list above is based mainly on the "BENCHMARKS" tables in the other answer.
LIMITATIONS of the above DO's & DON'T's:
This question's titled "Why is TF2 much slower than TF1?", and while its body concerns training explicitly, the matter isn't limited to it; inference, too, is subject to major speed differences, even within the same TF version, import, data format, etc. - see this answer.
RNNs are likely to notably change the data grid in the other answer, as they've been improved in TF2
Models primarily used Conv1D and Dense - no RNNs, sparse data/targets, 4/5D inputs, & other configs
Input data limited to numpy and tf.data.Dataset, while many other formats exist; see other answer
GPU was used; results will differ on a CPU. In fact, when I asked the question, my CUDA wasn't properly configured, and some of the results were CPU-based.
Why did TF2 sacrifice the most practical quality, speed, for eager execution? It hasn't, clearly - graph is still available. But if the question is "why eager at all":
Superior debugging: you've likely come across multitudes of questions asking "how do I get intermediate layer outputs" or "how do I inspect weights"; with eager, it's (almost) as simple as .__dict__. Graph, in contrast, requires familiarity with special backend functions - greatly complicating the entire process of debugging & introspection.
Faster prototyping: per ideas similar to above; faster understanding = more time left for actual DL.
HOW TO ENABLE/DISABLE EAGER?
tf.enable_eager_execution() # TF1; must be done before any model/tensor creation
tf.compat.v1.disable_eager_execution() # TF2; above holds
Misleading in TF2; see here.
ADDITIONAL INFO:
Careful with _on_batch() methods in TF2; according to the TF dev, they still use a slower implementation, but not intentionally - i.e. it's to be fixed. See other answer for details.
REQUESTS TO TENSORFLOW DEVS:
Please fix train_on_batch(), and the performance aspect of calling fit() iteratively; custom train loops are important to many, especially to me.
Add documentation / docstring mention of these performance differences for users' knowledge.
Improve general execution speed to keep peeps from hopping to Pytorch.
ACKNOWLEDGEMENTS: Thanks to
Q. Scott Zhu, TensorFlow developer, for his detailed clarification on the matter.
P. Andrey for sharing useful testing, and discussion.
UPDATES:
11/14/19 - found a model (in my real application) that that runs slower on TF2 for all* configurations w/ Numpy input data. Differences ranged 13-19%, averaging 17%. Differences between keras and tf.keras, however, were more dramatic: 18-40%, avg. 32% (both TF1 & 2). (* - except Eager, for which TF2 OOM'd)
11/17/19 - devs updated on_batch() methods in a recent commit, stating to have improved speed - to be released in TF 2.1, or available now as tf-nightly. As I'm unable to get latter running, will delay benching until 2.1.
2/20/20 - prediction performance is also worth benching; in TF2, for example, CPU prediction times can involve periodic spikes

THIS ANSWER: aims to provide a detailed, graph/hardware-level description of the issue - including TF2 vs. TF1 train loops, input data processors, and Eager vs. Graph mode executions. For an issue summary & resolution guidelines, see my other answer.
PERFORMANCE VERDICT: sometimes one is faster, sometimes the other, depending on configuration. As far as TF2 vs TF1 goes, they're about on par on average, but significant config-based differences do exist, and TF1 trumps TF2 more often than vice versa. See "BENCHMARKING" below.
EAGER VS. GRAPH: the meat of this entire answer for some: TF2's eager is slower than TF1's, according to my testing. Details further down.
The fundamental difference between the two is: Graph sets up a computational network proactively, and executes when 'told to' - whereas Eager executes everything upon creation. But the story only begins here:
Eager is NOT devoid of Graph, and may in fact be mostly Graph, contrary to expectation. What it largely is, is executed Graph - this includes model & optimizer weights, comprising a great portion of the graph.
Eager rebuilds part of own graph at execution; a direct consequence of Graph not being fully built -- see profiler results. This has a computational overhead.
Eager is slower w/ Numpy inputs; per this Git comment & code, Numpy inputs in Eager include the overhead cost of copying tensors from CPU to GPU. Stepping through source code, data handling differences are clear; Eager directly passes Numpy, while Graph passes tensors which then evaluate to Numpy; uncertain of the exact process, but latter should involve GPU-level optimizations
TF2 Eager is slower than TF1 Eager - this is... unexpected. See benchmarking results below. Differences span from negligible to significant but are consistent. Unsure why it's the case - if a TF dev clarifies, will update the answer.
TF2 vs. TF1: quoting relevant portions of a TF dev's, Q. Scott Zhu's, response - w/ bit of my emphasis & rewording:
In eager, the runtime needs to execute the ops and return the numerical value for every line of python code. The nature of single step execution causes it to be slow.
In TF2, Keras leverages tf.function to build its graph for training, eval, and prediction. We call them "execution function" for the model. In TF1, the "execution function" was a FuncGraph, which shared some common components as the TF function, but has a different implementation.
During the process, we somehow left an incorrect implementation for train_on_batch(), test_on_batch() and predict_on_batch(). They are still numerically correct, but the execution function for x_on_batch is a pure python function, rather than a tf.function wrapped python function. This will cause slowness
In TF2, we convert all input data into a tf.data.Dataset, by which we can unify our execution function to handle the single type of the inputs. There might be some overhead in the dataset conversion, and I think this is a one-time-only overhead, rather than a per-batch cost
With the last sentence of the last paragraph above, and the last clause of the below paragraph:
To overcome the slowness in eager mode, we have #tf.function, which will turn a python function into a graph. When feed numerical value like np array, the body of the tf.function is converted into a static graph, being optimized, and return the final value, which is fast and should have similar performance as TF1 graph mode.
I disagree - per my profiling results, which show Eager's input data processing to be substantially slower than Graph's. Also, unsure about tf.data.Dataset in particular, but Eager does repeatedly call multiple of the same data conversion methods - see profiler.
Lastly, dev's linked commit: Significant number of changes to support the Keras v2 loops.
Train Loops: depending on (1) Eager vs. Graph; (2) input data format, training in will proceed with a distinct train loop - in TF2, _select_training_loop(), training.py, one of:
training_v2.Loop()
training_distributed.DistributionMultiWorkerTrainingLoop(
training_v2.Loop()) # multi-worker mode
# Case 1: distribution strategy
training_distributed.DistributionMultiWorkerTrainingLoop(
training_distributed.DistributionSingleWorkerTrainingLoop())
# Case 2: generator-like. Input is Python generator, or Sequence object,
# or a non-distributed Dataset or iterator in eager execution.
training_generator.GeneratorOrSequenceTrainingLoop()
training_generator.EagerDatasetOrIteratorTrainingLoop()
# Case 3: Symbolic tensors or Numpy array-like. This includes Datasets and iterators
# in graph mode (since they generate symbolic tensors).
training_generator.GeneratorLikeTrainingLoop() # Eager
training_arrays.ArrayLikeTrainingLoop() # Graph
Each handles resource allocation differently and bears consequences on performance & capability.
Train Loops: fit vs train_on_batch, keras vs. tf.keras: each of the four uses different train loops, though perhaps not in every possible combination. keras' fit, for example, uses a form of fit_loop, e.g. training_arrays.fit_loop(), and its train_on_batch may use K.function(). tf.keras has a more sophisticated hierarchy described in part in previous section.
Train Loops: documentation -- relevant source docstring on some of the different execution methods:
Unlike other TensorFlow operations, we don't convert python
numerical inputs to tensors. Moreover, a new graph is generated for each
distinct python numerical value
function instantiates a separate graph for every unique set of input
shapes and datatypes.
A single tf.function object might need to map to multiple computation graphs
under the hood. This should be visible only as performance (tracing graphs has
a nonzero computational and memory cost)
Input data processors: similar to above, the processor is selected case-by-case, depending on internal flags set according to runtime configurations (execution mode, data format, distribution strategy). The simplest case's with Eager, which works directly w/ Numpy arrays. For some specific examples, see this answer.
MODEL SIZE, DATA SIZE:
Is decisive; no single configuration crowned itself atop all model & data sizes.
Data size relative to model size is important; for small data & models, data transfer (e.g. CPU to GPU) overhead can dominate. Likewise, small overhead processors can run slower on large data per data conversion time dominating (see convert_to_tensor in "PROFILER")
Speed differs per train loops' and input data processors' differing means of handling resources.
BENCHMARKS: the grinded meat. -- Word Document -- Excel Spreadsheet
Terminology:
%-less numbers are all seconds
% computed as (1 - longer_time / shorter_time)*100; rationale: we're interested by what factor one is faster than the other; shorter / longer is actually a non-linear relation, not useful for direct comparison
% sign determination:
TF2 vs TF1: + if TF2 is faster
GvE (Graph vs. Eager): + if Graph is faster
TF2 = TensorFlow 2.0.0 + Keras 2.3.1; TF1 = TensorFlow 1.14.0 + Keras 2.2.5
PROFILER:
PROFILER - Explanation: Spyder 3.3.6 IDE profiler.
Some functions are repeated in nests of others; hence, it's hard to track down the exact separation between "data processing" and "training" functions, so there will be some overlap - as pronounced in the very last result.
% figures computed w.r.t. runtime minus build time
Build time computed by summing all (unique) runtimes which were called 1 or 2 times
Train time computed by summing all (unique) runtimes which were called the same # of times as the # of iterations, and some of their nests' runtimes
Functions are profiled according to their original names, unfortunately (i.e. _func = func will profile as func), which mixes in build time - hence the need to exclude it
TESTING ENVIRONMENT:
Executed code at bottom w/ minimal background tasks running
GPU was "warmed up" w/ a few iterations before timing iterations, as suggested in this post
CUDA 10.0.130, cuDNN 7.6.0, TensorFlow 1.14.0, & TensorFlow 2.0.0 built from source, plus Anaconda
Python 3.7.4, Spyder 3.3.6 IDE
GTX 1070, Windows 10, 24GB DDR4 2.4-MHz RAM, i7-7700HQ 2.8-GHz CPU
METHODOLOGY:
Benchmark 'small', 'medium', & 'large' model & data sizes
Fix # of parameters for each model size, independent of input data size
"Larger" model has more parameters and layers
"Larger" data has a longer sequence, but same batch_size and num_channels
Models only use Conv1D, Dense 'learnable' layers; RNNs avoided per TF-version implem. differences
Always ran one train fit outside of benchmarking loop, to omit model & optimizer graph building
Not using sparse data (e.g. layers.Embedding()) or sparse targets (e.g. SparseCategoricalCrossEntropy()
LIMITATIONS: a "complete" answer would explain every possible train loop & iterator, but that's surely beyond my time ability, nonexistent paycheck, or general necessity. The results are only as good as the methodology - interpret with an open mind.
CODE:
import numpy as np
import tensorflow as tf
import random
from termcolor import cprint
from time import time
from tensorflow.keras.layers import Input, Dense, Conv1D
from tensorflow.keras.layers import Dropout, GlobalAveragePooling1D
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K
#from keras.layers import Input, Dense, Conv1D
#from keras.layers import Dropout, GlobalAveragePooling1D
#from keras.models import Model
#from keras.optimizers import Adam
#import keras.backend as K
#tf.compat.v1.disable_eager_execution()
#tf.enable_eager_execution()
def reset_seeds(reset_graph_with_backend=None, verbose=1):
if reset_graph_with_backend is not None:
K = reset_graph_with_backend
K.clear_session()
tf.compat.v1.reset_default_graph()
if verbose:
print("KERAS AND TENSORFLOW GRAPHS RESET")
np.random.seed(1)
random.seed(2)
if tf.__version__[0] == '2':
tf.random.set_seed(3)
else:
tf.set_random_seed(3)
if verbose:
print("RANDOM SEEDS RESET")
print("TF version: {}".format(tf.__version__))
reset_seeds()
def timeit(func, iterations, *args, _verbose=0, **kwargs):
t0 = time()
for _ in range(iterations):
func(*args, **kwargs)
print(end='.'*int(_verbose))
print("Time/iter: %.4f sec" % ((time() - t0) / iterations))
def make_model_small(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = Conv1D(128, 40, strides=4, padding='same')(ipt)
x = GlobalAveragePooling1D()(x)
x = Dropout(0.5)(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_model_medium(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = ipt
for filters in [64, 128, 256, 256, 128, 64]:
x = Conv1D(filters, 20, strides=1, padding='valid')(x)
x = GlobalAveragePooling1D()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_model_large(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = Conv1D(64, 400, strides=4, padding='valid')(ipt)
x = Conv1D(128, 200, strides=1, padding='valid')(x)
for _ in range(40):
x = Conv1D(256, 12, strides=1, padding='same')(x)
x = Conv1D(512, 20, strides=2, padding='valid')(x)
x = Conv1D(1028, 10, strides=2, padding='valid')(x)
x = Conv1D(256, 1, strides=1, padding='valid')(x)
x = GlobalAveragePooling1D()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_data(batch_shape):
return np.random.randn(*batch_shape), \
np.random.randint(0, 2, (batch_shape[0], 1))
def make_data_tf(batch_shape, n_batches, iters):
data = np.random.randn(n_batches, *batch_shape),
trgt = np.random.randint(0, 2, (n_batches, batch_shape[0], 1))
return tf.data.Dataset.from_tensor_slices((data, trgt))#.repeat(iters)
batch_shape_small = (32, 140, 30)
batch_shape_medium = (32, 1400, 30)
batch_shape_large = (32, 14000, 30)
batch_shapes = batch_shape_small, batch_shape_medium, batch_shape_large
make_model_fns = make_model_small, make_model_medium, make_model_large
iterations = [200, 100, 50]
shape_names = ["Small data", "Medium data", "Large data"]
model_names = ["Small model", "Medium model", "Large model"]
def test_all(fit=False, tf_dataset=False):
for model_fn, model_name, iters in zip(make_model_fns, model_names, iterations):
for batch_shape, shape_name in zip(batch_shapes, shape_names):
if (model_fn is make_model_large) and (batch_shape == batch_shape_small):
continue
reset_seeds(reset_graph_with_backend=K)
if tf_dataset:
data = make_data_tf(batch_shape, iters, iters)
else:
data = make_data(batch_shape)
model = model_fn(batch_shape)
if fit:
if tf_dataset:
model.train_on_batch(data.take(1))
t0 = time()
model.fit(data, steps_per_epoch=iters)
print("Time/iter: %.4f sec" % ((time() - t0) / iters))
else:
model.train_on_batch(*data)
timeit(model.fit, iters, *data, _verbose=1, verbose=0)
else:
model.train_on_batch(*data)
timeit(model.train_on_batch, iters, *data, _verbose=1)
cprint(">> {}, {} done <<\n".format(model_name, shape_name), 'blue')
del model
test_all(fit=True, tf_dataset=False)

Related

How to use all GPUs in SageMaker real-time inference?

I have deployed a model on real-time inference in a single gpu instance, it works fine.
Now I want to use a multiple GPUs to decrease the inference time, what do I need to change in my inference.py to make it work?
Here is some of my code:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
logger.info("Loading first model...")
model = Model().to(DEVICE)
with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
model = model.eval()
logger.info("Loading second model...")
model_2 = Model_2()
model_2.to(DEVICE)
checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
model_2 = model_2()
logger.info('Done loading models')
return {'first_model': model, 'second_model': model_2}
def input_fn(request_body, request_content_type):
assert request_content_type=='application/json'
url = json.loads(request_body)['url']
save_name = json.loads(request_body)['save_name']
logger.info(f'Image url: {url}')
img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
w, h = img.size
input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0).to(DEVICE)
logger.info('Image ready to predict!')
return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}
def predict_fn(input_object, model):
data = input_object['tensor']
logger.info('Generating prediction based on the input image')
model_1 = model['first_model']
model_2 = model['second_model']
d0, d1, d2, d3, d4, d5, d6 = model_1(data)
torch.cuda.empty_cache()
mask = torch.argmax(d0[0], axis=0).cpu().numpy()
mask = np.where(mask==2, 255, mask)
mask = np.where(mask==1, 128, mask)
img = input_object['image']
final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
img = np.array(img)[:,:,::-1]
final_image = np.array(final_image)
image_dict = to_dict(img, final_image)
final_image = model_2_process(model_2, image_dict)
torch.cuda.empty_cache()
return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}
I was thinking that maybe with torch multiprocessing, any tips?

The answer mentioning Torch DDP and DP is not exactly appropriate since the value of those libraries is to conduct multi-GPU gradient descent (averaging the gradient inter-GPU in particular), which, as mentioned in 1., does not happen at inference. Actually, a well-done, optimized inference ideally doesn't even use PyTorch or TensorFlow at all, but instead a prediction-only optimized runtime such as SageMaker Neo, ONNXRuntime or NVIDIA TensorRT, to reduce memory footprint and latency.
to infer a single model that fits in a GPU, multi-GPU instances are generally not advised: inference is a share-nothing task, so that you can use N single-GPU instance and things are simpler and equally performant.
Inference on Multi-GPU host is useful in 2 cases: (1) if you do model parallel inference (not your case) or (2) if your service inference consists of a graph of models that are calling each other. In which case, the proximity of the various models called in the DAG can reduce latency. That seems to be your situation
My recommendations are the following:
Try using NVIDIA Triton, that supports well those DAG use-cases and is supported on SageMaker. https://aws.amazon.com/fr/blogs/machine-learning/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker/
If you want to do things custom, you could try assigning the 2 models to different cuda device id in PyTorch. Because cuda kernels are run asynchronously this could be enough to have some parallelism and a bit of acceleration vs 1 GPU if your models can run parallel
I saw multiprocessing used once (with MXNet) to load-balance inference requests across GPUs (in this AWS blog post) but it was for share-nothing, map-style distribution of batches of inferences. In your case you seem to have to connection between your model so Triton is probably a better fit.
Eventually, if your goal is to reduce latency, there are other ideas:
Fix any CPU bottleneck Your code seem to have a lot of CPU work (pre-processing, numpy...). Are you sure GPU is the bottleneck? If CPU is at 80%+, try large single-GPU G5, such as G5.16xlarge. They are great for computer vision inference
Use a better GPU if you are using a P2, P3 or G4dn, try G5 instead
Optimize code. 2 things to try, depending on the bottleneck:
If you do the inference in Torch, try to avoid doing algebra with Numpy, and do as much as possible with torch tensors on GPU.
If GPU is the bottleneck, try to replace PyTorch by ONNXRuntime or NVIDIA TensorRT.

You must use torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel (read "Multi-GPU Examples" and "Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel").
You must call the function by passing at least these three parameters:
module (Module) – module to be parallelized (your model)
device_ids (list of python:int or torch.device) – CUDA devices.
For single-device modules, device_ids can contain
exactly one device id, which represents the only CUDA device where the
input module corresponding to this process resides. Alternatively,
device_ids can also be None.
For multi-device modules and CPU
modules, device_ids must be None.
When device_ids is None for both cases, both the input data for the
forward pass and the actual module must be placed on the correct
device. (default: None)
output_device (int or torch.device) – Device location of output for single-device CUDA modules.
For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)
for example:
from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)

Why does keras model.fit use so much memory despite using allow_growth=True?

I have, thanks to this question mostly been able to solve the problem of tensorflow allocating memory which I didn't want allocated. However, I have recently found that despite my using set_session with allow_growth=True, using model.fit will still mean that all the memory is allocated and I can no longer use it for the rest of my program, even when the function is exited and the model should no longer have any allocated memory due to the fact that the model is a local variable.
Here is some example code demonstrating this:
from numpy import array
from keras import Input, Model
from keras.layers import Conv2D, Dense, Flatten
from keras.optimizers import SGD
# stops keras/tensorflow from allocating all the GPU's memory immediately
from tensorflow.compat.v1.keras.backend import set_session
from tensorflow.compat.v1 import Session, ConfigProto, GPUOptions
tf_config = ConfigProto(gpu_options=GPUOptions(allow_growth=True))
session = Session(config=tf_config)
set_session(session)
# makes the neural network
def make_net():
input = Input((2, 3, 3))
conv = Conv2D(256, (1, 1))(input)
flattened_input = Flatten()(conv)
output = Dense(1)(flattened_input)
model = Model(inputs=input, outputs=output)
sgd = SGD(0.2, 0.9)
model.compile(sgd, 'mean_squared_error')
model.summary()
return model
def make_data(input_data, target_output):
input_data.append([[[0 for i in range(3)] for j in range(3)] for k in range(2)])
target_output.append(0)
def main():
data_amount = 4096
input_data = []
target_output = []
model = make_model()
for i in range(data_amount):
make_data(input_data, target_output)
model.fit(array(input_data), array(target_output), batch_size=len(input_data))
return
while True:
main()
When I run this code with the Pycharm debugger, I find that the GPU RAM used stays at around 0.1GB until I run model.fit for the first time, at which point the memory usage shoots up to 3.2GB of my 4GB of GPU RAM. I have also noted that the memory usage doesn't increase after the first time that model.fit is run and that if I remove the convolutional layer from my network, the memory increase doesn't happen at all.
Could someone please shine some light on my problem?
UPDATE: Setting per_process_gpu_memory_fraction in GPUOptions to 0.1 helps limit the effect in the code included, but not in my actual program. A better solution would still be helpful.

I used to face this problem. And I found a solution from someone who I can't find anymore. His solution I paste below. In fact, I found that if you set allow_growth=True, tensorflow seems to use all your memory. So you should just set your max limit.
try this:
gpus = tf.config.experimental.list_physical_devices("GPU")
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, False)
tf.config.experimental.set_virtual_device_configuration(
gpu,
[
tf.config.experimental.VirtualDeviceConfiguration(
memory_limit=12288 # set your limit
)
],
)
tf.config.experimental.set_visible_devices(gpus[0], "GPU")
logical_gpus = tf.config.experimental.list_logical_devices("GPU")
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)

Training with SGD and the whole training data in one batch can (depending on your input data) be very memory consumptive.
Try tweaking your batch_size to a lower size (e.g. 8, 16, 32)

Why does keras model predict slower after compile?

In theory, the prediction should be constant as the weights have a fixed size. How do I get my speed back after compile (without the need to remove optimizer)?
See associated experiment: https://nbviewer.jupyter.org/github/off99555/TensorFlowExperiments/blob/master/test-prediction-speed-after-compile.ipynb?flush_cache=true

UPDATE - 1/15/2020: the current best practice for small batch sizes should be to feed inputs to the model directly - i.e. preds = model(x), and if layers behave differently at train / inference, model(x, training=False). Per latest commit, this is now documented.
I haven't benchmarked these, but per the Git discussion, it's also worth trying predict_on_batch() - especially with improvements in TF 2.1.
ULTIMATE CULPRIT: self._experimental_run_tf_function = True. It's experimental. But it's not actually bad.
To any TensorFlow devs reading: clean up your code. It's a mess. And it violates important coding practices, such as one function does one thing; _process_inputs does a lot more than "process inputs", same for _standardize_user_data. "I'm not paid enough" - but you do pay, in extra time spent understanding your own stuff, and in users filling your Issues page with bugs easier resolved with a clearer code.
SUMMARY: it's only a little slower with compile().
compile() sets an internal flag which assigns a different prediction function to predict. This function constructs a new graph upon each call, slowing it down relative to uncompiled. However, the difference is only pronounced when train time is much shorter than data processing time. If we increase the model size to at least mid-sized, the two become equal. See code at the bottom.
This slight increase in data processing time is more than compensated by amplified graph capability. Since it's more efficient to keep only one model graph around, the one pre-compile is discarded. Nonetheless: if your model is small relative to data, you are better off without compile() for model inference. See my other answer for a workaround.
WHAT SHOULD I DO?
Compare model performance compiled vs uncompiled as I have in code at the bottom.
Compiled is faster: run predict on a compiled model.
Compiled is slower: run predict on an uncompiled model.
Yes, both are possible, and it will depend on (1) data size; (2) model size; (3) hardware. Code at the bottom actually shows compiled model being faster, but 10 iterations is a small sample. See "workarounds" in my other answer for the "how-to".
DETAILS:
This took a while to debug, but was fun. Below I describe the key culprits I discovered, cite some relevant documentation, and show profiler results that led to the ultimate bottleneck.
(FLAG == self.experimental_run_tf_function, for brevity)
Model by default instantiates with FLAG=False. compile() sets it to True.
predict() involves acquiring the prediction function, func = self._select_training_loop(x)
Without any special kwargs passed to predict and compile, all other flags are such that:
(A) FLAG==True --> func = training_v2.Loop()
(B) FLAG==False --> func = training_arrays.ArrayLikeTrainingLoop()
From source code docstring, (A) is heavily graph-reliant, uses more distribution strategy, and ops are prone to creating & destroying graph elements, which "may" (do) impact performance.
True culprit: _process_inputs(), accounting for 81% of runtime. Its major component? _create_graph_function(), 72% of runtime. This method does not even exist for (B). Using a mid-sized model, however, _process_inputs comprises less than 1% of runtime. Code at bottom, and profiling results follow.
DATA PROCESSORS:
(A): <class 'tensorflow.python.keras.engine.data_adapter.TensorLikeDataAdapter'>, used in _process_inputs() . Relevant source code
(B): numpy.ndarray, returned by convert_eager_tensors_to_numpy. Relevant source code, and here
MODEL EXECUTION FUNCTION (e.g. predict)
(A): distribution function, and here
(B): distribution function (different), and here
PROFILER: results for code in my other answer, "tiny model", and in this answer, "medium model":
Tiny model: 1000 iterations, compile()
Tiny model: 1000 iterations, no compile()
Medium model: 10 iterations
DOCUMENTATION (indirectly) on effects of compile(): source
Unlike other TensorFlow operations, we don't convert python
numerical inputs to tensors. Moreover, a new graph is generated for each
distinct python numerical value, for example calling g(2) and g(3) will
generate two new graphs
function instantiates a separate graph for every unique set of input
shapes and datatypes. For example, the following code snippet will result
in three distinct graphs being traced, as each input has a different
shape
A single tf.function object might need to map to multiple computation graphs
under the hood. This should be visible only as performance (tracing graphs has
a nonzero computational and memory cost) but should not affect the correctness
of the program
COUNTEREXAMPLE:
from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from tensorflow.keras.layers import Flatten, Dropout
from tensorflow.keras.models import Model
import numpy as np
from time import time
def timeit(func, arg, iterations):
t0 = time()
for _ in range(iterations):
func(arg)
print("%.4f sec" % (time() - t0))
batch_size = 32
batch_shape = (batch_size, 400, 16)
ipt = Input(batch_shape=batch_shape)
x = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x = LSTM(512, activation='relu', return_sequences=True)(ipt)
x = Conv1D(128, 400, 1, padding='same')(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
X = np.random.randn(*batch_shape)
timeit(model.predict, X, 10)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 10)
Outputs:
34.8542 sec
34.7435 sec

UPDATE: see actual answer posted as a separate answer; this post contains supplemental info
.compile() sets up the majority of TF/Keras graph, including losses, metrics, gradients, and partly the optimizer and its weights - which guarantees a notable slowdown.
What is unexpected is the extent of slowdown - 10-fold on my own experiment, and for predict(), which doesn't update any weights. Looking into TF2's source code, graph elements appear tightly intertwined, with resources not necessarily being allocated "fairly".
Possible overlook by developers on predict's performance for an uncompiled model, as models are typically used compiled - but in practice, this is an unacceptable difference. It's also possible it's a "necessary evil", as there is a simple workaround (see below).
This isn't a complete answer, and I hope someone can provide it here - if not, I'd suggest opening a Github issue on TensorFlow. (OP has; here)
Workaround: train a model, save its weights, re-build the model without compiling, load the weights. Do not save the entire model (e.g. model.save()), as it'll load compiled - instead use model.save_weights() and model.load_weights().
Workaround 2: above, but use load_model(path, compile=False); suggestion credit: D. Möller
UPDATE: to clarify, optimizer is not fully instantiated with compile, including its weights and updates tensors - this is done when the first call to a fitting function is made (fit, train_on_batch, etc), via model._make_train_function().
The observed behavior is thus even more strange. Worse yet, building the optimizer does not elicit any further slowdowns (see below) - suggesting "graph size" is not the main explanation here.
EDIT: on some models, a 30x slowdown. TensorFlow, what have you done. Example below:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np
from time import time
def timeit(func, arg, iterations):
t0 = time()
for _ in range(iterations):
func(arg)
print("%.4f sec" % (time() - t0))
ipt = Input(shape=(4,))
x = Dense(2, activation='relu')(ipt)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
X = np.random.randn(32,4)
timeit(model.predict, X, 1000)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 1000)
model._make_train_function() # build optimizer
timeit(model.predict, X, 1000)
Outputs:
0.9891 sec
29.785 sec
29.521 sec

Warning `tried to deallocate nullptr` when using tensorflow eager execution with tf.keras

As per the tensorflow team suggestion, I'm getting used to tensorflow's eager execution with tf.keras. However, whenever I train a model, I receive a warning (EDIT: actually, I receive this warning repeated many times, more than once per training step, flooding my standard output):
E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr
The warning doesn't seem to affect the quality of the training but I wonder what it means and if it is possible to get rid of it.
I use a conda virtual environment with python 3.7 and tensorflow 1.12 running on a CPU. (EDIT: a test with python 3.6 gives the same results.) A minimal code that reproduces the warnings follows. Interestingly, it is possible to comment the line tf.enable_eager_execution() and see that the warnings disappear.
import numpy as np
import tensorflow as tf
tf.enable_eager_execution()
N_EPOCHS = 50
N_TRN = 10000
N_VLD = 1000
# the label is positive if the input is a number larger than 0.5
# a little noise is added, just for fun
x_trn = np.random.random(N_TRN)
x_vld = np.random.random(N_VLD)
y_trn = ((x_trn + np.random.random(N_TRN) * 0.02) > 0.5).astype(float)
y_vld = ((x_vld + np.random.random(N_VLD) * 0.02) > 0.5).astype(float)
# a simple logistic regression
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(1, input_dim=1))
model.add(tf.keras.layers.Activation('sigmoid'))
model.compile(
optimizer=tf.train.AdamOptimizer(),
# optimizer=tf.keras.optimizers.Adam(), # doesn't work at all with tf eager execution
loss='binary_crossentropy',
metrics=['accuracy']
)
# Train model on dataset
model.fit(
x_trn, y_trn,
epochs=N_EPOCHS,
validation_data=(x_vld, y_vld),
)
model.summary()

Quick solutions:
It did not appear when I ran the same script in TF 1.11 while the optimization was performed to reach the same final validation accuracy on a synthetic dataset.
OR
Suppress the errors/warning using the native os module (adapted from https://stackoverflow.com/a/38645250/2374160). ie; by setting the Tensorflow logging environment variable to not show any error messages.
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
More info:
Solving this error in the correct way may require familiarity with MKL library calls and its interfacing on Tensorflow which is written in C (this is beyond my current TF expertise)
In my case, this memory deallocation error occurred whenever the
apply_gradients() method of an optimizer was called. In your script, it is called when the model is being fitted to the training data.
This error is raised from here: tensorflow/core/common_runtime/mkl_cpu_allocator.h
I hope this helps as a temporary solution for convenience.

Keras + Tensorflow: Prediction on multiple gpus

I'm using Keras with tensorflow as backend.
I have one compiled/trained model.
My prediction loop is slow so I would like to find a way to parallelize the predict_proba calls to speed things up.
I would like to take a list of batches (of data) and then per available gpu, run model.predict_proba() over a subset of those batches.
Essentially:
data = [ batch_0, batch_1, ... , batch_N ]
on gpu_0 => return predict_proba(batch_0)
on gpu_1 => return predict_proba(batch_1)
...
on gpu_N => return predict_proba(batch_N)
I know that it's possible in pure Tensorflow to assign ops to a given gpu (https://www.tensorflow.org/tutorials/using_gpu). However, I don't know how this translates to my situation since I've built/compiled/trained my model using Keras' api.
I had thought that maybe I just needed to use python's multiprocessing module and start a process per gpu that would run predict_proba(batch_n). I know this is theoretically possible given another SO post of mine: Keras + Tensorflow and Multiprocessing in Python. However, this still leaves me with the dilemma of not knowing how to actually "choose" a gpu to operate the process on.
My question boils down to: how does one parallelize prediction for one model in Keras across multiple gpus when using Tensorflow as Keras' backend?
Additionally I am curious if similar parallelization for prediction is possible with only one gpu.
A high level description or code example would be greatly appreciated!
Thanks!

I created one simple example to show how to run keras model across multiple gpus. Basically, multiple processes are created and each of process owns a gpu. To specify the gpu id in process, setting env variable CUDA_VISIBLE_DEVICES is a very straightforward way (os.environ["CUDA_VISIBLE_DEVICES"]). Hope this git repo can help you.
https://github.com/yuanyuanli85/Keras-Multiple-Process-Prediction

You can use this function to parallelize a Keras model (credits to kuza55).
https://github.com/kuza55/keras-extras/blob/master/utils/multi_gpu.py
.
from keras.layers import merge
from keras.layers.core import Lambda
from keras.models import Model
import tensorflow as tf
def make_parallel(model, gpu_count):
def get_slice(data, idx, parts):
shape = tf.shape(data)
size = tf.concat([ shape[:1] // parts, shape[1:] ],axis=0)
stride = tf.concat([ shape[:1] // parts, shape[1:]*0 ],axis=0)
start = stride * idx
return tf.slice(data, start, size)
outputs_all = []
for i in range(len(model.outputs)):
outputs_all.append([])
#Place a copy of the model on each GPU, each getting a slice of the batch
for i in range(gpu_count):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
inputs = []
#Slice each input into a piece for processing on this GPU
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
inputs.append(slice_n)
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
#Save all the outputs for merging back together later
for l in range(len(outputs)):
outputs_all[l].append(outputs[l])
# merge outputs on CPU
with tf.device('/cpu:0'):
merged = []
for outputs in outputs_all:
merged.append(merge(outputs, mode='concat', concat_axis=0))
return Model(input=model.inputs, output=merged)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.