Slow ResNet50 training time using AWS Sagemaker GPU instance

Slow ResNet50 training time using AWS Sagemaker GPU instance - python

I am trying to train a ResNet50 model using keras with tensorflow backend. I'm using a sagemaker GPU instance ml.p3.2xlarge but my training time is extremely long. I am using conda_tensorflow_p36 kernel and I have verified that I have tensorflow-gpu installed.
When inspecting the output of nvidia-smi I see the process is on the GPU, but the utilization is never above 0%.
Tensorflow also recognizes the GPU.
Screenshot of training time.
Is sagemaker in fact using the GPU even though the usage is 0%?
Could the long epoch training time be caused by another issue?

Looks like you've completed 8 steps and it just takes very long. What's your step time?
It might be due to data loading. Where ia data stored? Try to take data loading out of the picture by caching and feeding a single image to the DNN repeatedly and see if that helps.

Related

Do I need gpu while working with pretrained model?

I am already using Google Colab to train my model. So I will not use my own GPU for training. I want to ask, is there a performance difference beetween GPU and CPU while working with pre-trained model. I already trained a model with Google Colab GPU and used with my own local CPU. Should I use GPU for testing?

It depends how many predictions you need to do. Usually in training you are making many calculations therefore parallelisation by GPU shortens overall training time. Usually, when using a trained model you just need to do a sparse prediction per time unit. In such situation CPU approach should be OK. However, if you need to do as many predictions as during training then GPU would be beneficial. This can particularly be true with reinforcement training, when your model must adopt to continuously changing environmental input.

How do I get Keras to train a model on a specific GPU?

There is a shared server with 2 GPUs in my institution. Suppose there are two team members each wants to train a model at the same time, then how do they get Keras to train their model on a specific GPU so as to avoid resource conflict?
Ideally, Keras should figure out which GPU is currently busy training a model and then use the other GPU to train the other model. However, this doesn't seem to be the case. It seems that by default Keras only uses the first GPU (since the Volatile GPU-Util of the second GPU is always 0%).

Possibly duplicate with my previous question
It's a bit more complicated. Keras will the memory in both GPUs althugh it will only use one GPU by default. Check keras.utils.multi_gpu_model for using several GPUs.
I found the solution by choosing the GPU using the environment variable CUDA_VISIBLE_DEVICES.
You can add this manually before importing keras or tensorflow to choose your gpu
os.environ["CUDA_VISIBLE_DEVICES"]="0" # first gpu
os.environ["CUDA_VISIBLE_DEVICES"]="1" # second gpu
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # runs in cpu
To make it automatically, I made a function that parses nvidia-smi and detects automatically which GPU is being already used and sets the appropriate value to the variable.

If you are using a training script you can simply set it in the command line before invoking the script
CUDA_VISIBLE_DEVICES=1 python train.py

If you want to train models on cloud GPUs (e.g. GPU instances from AWS), try this library:
!pip install aibro==0.0.45 --extra-index-url https://test.pypi.org/simple
from aibro.train import fit
machine_id = 'g4dn.4xlarge' #instance name on AWS
job_id, trained_model, history = fit(
model=model,
train_X=train_X,
train_Y=train_Y,
validation_data=(validation_X, validation_Y),
machine_id=machine_id
)
Tutorial: https://colab.research.google.com/drive/19sXZ4kbic681zqEsrl_CZfB5cegUwuIB#scrollTo=ERqoHEaamR1Y

Keras model predict iteration getting slower.

Hi I have some problem about Keras with python 3.6
My enviroment is keras with Python and Only CPU.
but the problem is when I iterate same Keras model for predict some diferrent input, its getting slower and slower..
my code is so simple just like that
for i in range(100):
model.predict(x)
the First run is fast. it takes 2 seconds may be. but second run takes 3 seconds and Third takes 5 seconds... its getting slower and slower even if I use same input.
what can I iterate predict keras model hold fast? I don't want any getting slower.. it will be very critical.
How can I Fix IT??

Try using the __call__ method directly. The documentation of the predict method states the following:
For small numbers of inputs that fit in one batch, directly use __call__() for faster execution, e.g., model(x).
I see the performance is critical in this case. So, if it doesn't help, you could use OpenVINO which is optimized for Intel hardware but it should work with any CPU. Your performance should be much better than using Keras directly.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

If your model calls the fit function in batches, there are different samples in the same batch with slightly different times over the course of the iteration, and then you try again and again to get more and more groups of predictive model performance time will be longer and longer.

CUDA_ERROR_OUT_OF_MEMORY on Tensorflow#object_detection/train.py

I'm running Tensorflow Object Detection API to train my own detector using the object_detection/train.py script, found here. The problem is that I'm getting CUDA_ERROR_OUT_OF_MEMORY constantly.
I found some suggestions to reduce the batch size so the trainer consumes less memory, but I reduced from 16 to 4 and I'm still getting the same error. The difference is that when using batch_size=16, the error was thrown in step ~18 and now it is been thrown in step ~70. EDIT: setting batch_size=1 didn't solve the problem, as I still got the error at step ~2700.
What can I do to make it run smoothly until I stop the training proccess? I don't really need to get a fast training.
EDIT:
I'm currently using a GTX 750 Ti 2GB for this. The GPU is not being used for anything else than training and providing monitor image. Currently, I'm using only 80 images for training and 20 images for evaluation.

I think is not about batch_size, because you can start the training at first place.
open a terminal ans run
nvidia-smi -l
to check if there are other process kick in when this error happens. if you set batch_size=16, you can find out pretty quick.

Found the solution for my problem. The batch_size was not the problem, but a higher batch_size made the training memory consumption increase faster, because I was using the config.gpu_options.allow_growth = True configuration. This setting allows Tensorflow to increase memory consumption when needed and tries to use until 100% of GPU memory.
The problem was that I was running the eval.py script at the same time (as recommended in their tutorial) and it was using parte of the GPU memory. When the train.py script tried to use all 100%, the error was thrown.
I solved it by settings the maximum use percentage to 70% for the training proccess. It also solved the problem of stuttering while training. This may not be the optimum value for my GPU, but it is configurable using config.gpu_options.per_process_gpu_memory_fraction = 0.7 setting, for example.

Another option is to dedicate the GPU for training and use the CPU for evaluation.
Disadvantage: Evaluation will consume large portion of your CPU, but only for a few seconds every time a training checkpoint is created, which is not often.
Advantage: 100% of your GPU is used for training all the time
To target CPU, set this environment variable before you run the evaluation script:
export CUDA_VISIBLE_DEVICES=-1
You can explicitly set the evaluate batch job size to 1 in pipeline.config to consume less memory:
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1;
}
If you're still having issues, TensorFlow may not be releasing GPU memory between training runs. Try restarting your terminal or IDE and try again. This answer has more details.

AWS g2.8xlarge performance and out of memory issues when using tensorflow

I am currently training a recurrent net on Tensorflow for a text classification problem and am running into performance and out of memory issues. I am on AWS g2.8xlarge with Ubuntu 14.04, and a recent nightly build of tensorflow (which I downloaded on Aug 25).
1) Performance issue:
On the surface, both the CPU and GPU are highly under-utilized. I've run multiple tests on this (and have used line_profiler and memory_profiler in the process). Train durations scale linearly with number of epochs, so I tested with 1 epoch. For RNN config = 1 layer, 20 nodes, training time = 146 seconds.
Incidentally, that number is about 20 seconds higher/slower than the same test run on a g2.2xlarge!
Here is a snapshot of System Monitor and nvidia-smi (updated every 2 seconds) about 20 seconds into the run:
SnapshotEarlyPartOfRun
As you can see, GPU utilization is at 19%. When I use nvprof, I find that the total GPU process time is about 27 seconds or so. Also, except for one vCPU, all others are very under-utilized. The numbers stay around this level, till the end of the epoch where I measure error across the entire training set, sending GPU utilization up to 45%.
Unless I am missing something, on the surface, each device is sitting around waiting for something to happen.
2) Out of memory issue:
If I increase the number of nodes to 200, it gives me an Out of Memory error which happens on the GPU side. As you can see from the above snapshots, only one of the four GPUs is used. I've found that the way to get tensorflow to use the GPU has to do with how you assign the model. If you don't specify anything, tensorflow will assign it to a GPU. If you specify a GPU, only it will be used. Tensorflow does not like it when I assign it to multiple devices with a "for d in ['/gpu:0',...]". I get into an issue with re-using the embedding variable. I would like to use all 4 GPUs for this (without setting up distributed tensorflow). Here is the snapshot of the Out of memory error:
OutofMemoryError
Any suggestions you may have for both these problems would be greatly appreciated!

Re (1), to improve GPU utilization did you try increasing the batch size and / or shortening the sequences you use for training?
Re (2), to use multiple GPUs you do need to manually assign the ops to GPU devices, I believe. The right way is to place ops on specific GPUs by doing
with g.Device("/gpu:0"):
...
with g.Device("/gpu:1"):
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.