I'm running a NN training on my GPU with pytorch.
But the GPU usage is strangely "limited" at about 50-60%.
That's a waste of computing resources but I can't make it a bit higher.
I'm sure that the hardware is fine because running 2 of my process at the same time,or training a simple NN (DCGAN,for instance) can both occupy 95% or more GPU.(which is how it supposed to be)
My NN contains several convolution layers and it should use more GPU resources.
Besides, I guess that the data from dataset has been feeding fast enough,because I used workers=64 in my dataloader instance and my disk works just fine.
I just confused about what is happening.
Dev details:
GPU : Nvidia GTX 1080 Ti
os:Ubuntu 64-bit
I can only guess without further research but it could be that your network is small in terms of layer-size (not number of layers) so each step of the training is not enough to occupy all the GPU resources. Or at least the ratio between the data size and the transfer speed (to the gpu memory) is bad and the GPU stays idle most of the time.
tl;dr: the gpu jobs are not long enough to justify the memory transfers
Related
I'm using a GPU to train quite a lot of models. I want to tune the architecture of the network, so I train different models sequentially to compare their performances (I'm using keras-tuner).
The problem is that some models are very small, and some others are very large. I don't want to allocate all the GPU memory to my trainings, but only the quantity I need. I've TF_FORCE_GPU_ALLOW_GROWTH to true, meaning that when a model requires a large quantity of memory, then the GPU will allocate it. However, once that big model has been trained, the memory will not be released, even if the next trainings are tiny models.
Is there a way to force the GPU to release unused memory? Something like TF_FORCE_GPU_ALLOW_SHRINK?
Maybe having an automatic shrinking might be difficult to achieve. If so I would be happy with a manual releasing that I could add in a callback to be run after each training.
You can try by limiting GPU memory growth using this code:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
The second method is to configure a virtual GPU device with tf.config.set_logical_device_configuration and set a hard limit on the total memory to allocate it to the GPU.
Please check this link for more details.
We are testing CatBoost on both CPU and GPU.
While it runs much faster on GPU than on CPU, the results we are getting are so much worse and we are using the same data.
I am talking around 50% worse.
How is this possible?
We are using the following code to run it on CPU and only changing the task_type to GPU when running on GPU:
catBoostModel = CatBoostClassifier(
task_type="CPU",
early_stopping_rounds=50,
eval_metric="Precision",
cat_features=["Symbol"],
auto_class_weights="Balanced",
thread_count=-1
)
What are we missing?
There are some hyperparameters for which CatBoost uses different default values on CPU and GPU. There are also some hyperparameters that are only available on GPU or only available on CPU. The CatBoost documentation provides all the details.
This means that even if you are running the same code both on CPU and on GPU, you are likely training two different models. You can use model.get_all_params() (where model is your trained model object) to get the list of all hyperparameters and compare between CPU and GPU.
For a project I'm working on, I am using an altered version of Mask RCNN to train a model that will find objects in an image. These images are relatively small, about 300 x 200 pixels, and I train them for a relatively long time, around 100 epochs.
However, my main question relates to the batch size and how Tensorflow allocates memory on the gpu for the validation stage per epoch. I want to increase my batch size to help better smooth out the validation curve, as well as increase the accuracy of the overall model. However, if I increase my batch size to drastically, I get a OOM: GPU out-of-momory and keras_scratch_graph error. I'm currently working with two NVIDIA Quadro P5000s that have 16GB of vram each. having about 3 images per GPU, I can have a max batch size of 6 before it errors out. I've looked around and most people either say to just decrease the batch size, which I would prefer not to do, or enable GPU growth, which I couldn't get to work either. I could decrease the complexity of my model to decrease the size of tensors that are being evaluated, but I don't want to risk it as it could cause my accuracy to decrease, or loss to increase.
Is there a way that I can offset some images onto my physical systems memory, or am I purely limited to the amount of ram I have available on my GPU? Are their any more compact or robust methods out there that could solve this issue?
I want to find out how much GPU memory my Tensorflow model needs at inference. So I used tf.contrib.memory_stats.MaxBytesInUse which returned 6168 MB.
But with config.gpu_options.per_process_gpu_memory_fraction I can use a way smaller fraction of my GPU and the model still runs fine without needing more time for one inference step.
Is there a way to determine how much GPU memory a Tensorflow model requires? I could just decrease the GPU memory fraction until TF crashes, but I guess there is a more elegant and precise way?
I use this notebook from Kaggle to run LSTM neural network.
I had started training of neural network and I saw that it is too slow. It is almost three times slower than CPU training.
CPU perfomance: 8 min per epoch;
GPU perfomance: 26 min per epoch.
After this I decided to find answer in this question on Stackoverflow and I applied a CuDNNLSTM (which runs only on GPU) instead of LSTM.
Hence, GPU perfomance became only 1 min per epoch and accuracy of model decreased on 3%.
Questions:
1) Does somebody know why GPU works slower than CPU in the classic LSTM layer? I do not understand why this happens.
2) Why when I use CuDNNLSTM instead of LSTM, training become much more faster and the accuracy of the model decrease?
P.S.:
My CPU: Intel Core i7-7700 Processor (8M Cache, up to 4.20 GHz)
My GPU: nVidia GeForce GTX 1050 Ti (4 GB)
Guessing it's just a different, better implementation and, if the implementation is different, you shouldn't expect identical results.
In general, efficiently implementing an algorithm on a GPU is hard and getting maximum performance requires architecture-specific implementations. Therefore, it wouldn't be surprising if an implementation specific to Nvidia's GPUs had enhanced performance versus a general implementation for GPUs. It also wouldn't be surprising that Nvidia would sink significantly more resources into accelerating their code for their GPUs versus than would a team working on a general CNN implementation.
The other possibility is that the data type used on the backend has changed from double- to single- or even half-precision float. The smaller data types mean you can crunch more numbers faster at the cost of accuracy. For NN applications this is often acceptable because no individual number needs to be especially accurate for the net to produce acceptable results.
I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs):
Using the CuDNNLSTM layer instead of the LSTM layer on a GPU machine reduced the fit time from ~13500 seconds to ~400 seconds per epoch.
Increasing the batch size (~500 to ~4700) reduced it to ~130 seconds per epoch.
Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make.
In Keras, the fast LSTM implementation with CuDNN.
model.add(CuDNNLSTM(units, input_shape=(len(X_train), len(X_train[0])), return_sequences=True))
It can only be run on the GPU with the TensorFlow backend.