Extremely high CPU usage when training a model on GPU

Extremely high CPU usage when training a model on GPU - python

Recently I discovered something rather strange with the project that I worked on for quite a while already. The model I have is rather conventional: a convnet with a few fully connected layers. For data loading I use tf.data API, but the same thing happens with queue-based code that I had before porting to tf.data. After a few hours since the training of the model begins, the CPU usage rises to very high levels, 1500-2000% as reported by the htop util. And at the beginning of training everything is fine, the main process shows only about 200% CPU usage. Attached is the screenshot of the htop output, and another thing that's worrying is all the child processes that also have pretty high CPU load.
I am using tensorflow-gpu version 1.11, running it on NVIDIA Tesla V100. I am pretty sure that the model does run on the GPU and not on the CPU: nvidia-smi shows that GPU is occupied at an about 70% rate.
Obviously, I cannot ask for an exact cause of this, and it would be difficult to strip the problem down to a reproducible test case. However, may be you could point me at some debugging techniques that are applicable in such case.

Related

My python is using less of my system resources than it used to when training my ML models

I am working on a project with some ML models. I worked on it during the summer, and have returned to it recently. For some reason, it is a lot slower to train and test now than it was then--I think it is a problem with python not using all of my system resources all of a sudden.
I am working on a project where I am building, training, and testing some ML models. I am using the sktime package in a python3.7 conda environment with jupyter notebook to do so.
I first started working on this project in the summer, and when I was building the models, I timed how long the training process took. I am resuming the project now, and I tried training the exact same model with the exact same data again, and it took around 6 hours this time compared to 76 minutes when I trained it during the summer. I have noticed that running inference on the test set also takes longer.
I am running on an 10-core M1 Max with 64 Gigs of RAM. I can tell that my computer is barely breaking a sweat right now, and with activity monitor it says that python is using 99.9% CPU, but it says overall the user is only using around 11.60% of the CPU. I remember my computer working a little harder when I was working on this project during the summer (the fan would actually turn on and the computer would get hot, but that is not happening now), so I have a feeling that the problem here is that my environment is not using the full system resources accessible to them. I have checked and my RAM limit on jupyter is enough, so that is not the issue. I am very confused what could have changed in this environment between the summer and now that has caused this problem. Any help would be much appreciated.

Can you use your local CPU and Google Colab's GPU at the same time?

I'm running some notebooks which, at different points, are both CPU and GPU intensive. Running the notebook on my local PC is fast in terms of CPU power, but slow as my GPU cannot be used for Torch (I have a Ryzen 9 with an AMD GPU). On the other hand, running the notebook on the Colab GPU is fast in the GPU sections, but terribly slow in the CPU sections.
I know that it is possible to use my CPU using local runtime, but then I am also stuck with my local GPU. Is it possible to allocate only my local CPU and use the Google colab GPU at the same time?
(An alternative solution would be to run the CPU intensive code on my local machine, store the intermediate results, and then use the Google GPU for the GPU intensive parts. But this is of course sub optimal.)

As per my knowledge, you can distribute training among multiple GPUs, multiple machines, or TPUs with a built-in distribution strategy.
The MultiWorkerMirroredStrategy is the one you are looking for most likely. It has two implementations for cross-device communications. Most of all, CommunicationImplementation.RING is RPC-based and supports both CPU and GPU.
Details: Link 1 Link 2

Artificial neural network performance degradation in other PCs

I am experimenting with designing a semantic segmentation network using Pytorch. It performs well on my computer. For better performance, we experimented by moving the network to a computer with a lot of GPU capacity. Basically, if you set the same environment for only the versions of Pytorch and torchvision and proceed with the experiment, performance degradation occurs even on the moved PC or Google Colab. I just copied and pasted the code and ran it as a test before doing other experiments.
The network structure is the same, but are there other external factors that will degrade performance? (ex: gpu, ram, etc...)...

Tensorflow uses GPU's that aren't visible

I am trying to train a neural network on a subset of the GPU's that I have access to.
The problem is that I want to use only two GPU's so I wrote the following command in the .bashrc:
export CUDA_VISIBLE_DEVICES=6,7
However, when I watch the gpus, I find that my program is mainly using other GPU's (4,5) with 100% memory utilization and using only the visible gpus with very low memory utilization (9% maximum).
I have already closed and reopened all the terminals to restart the ~/.bashrc but the problem still persists!
Any help is appreciated!
P.S: I already successfully ran the same program before on only two GPUs but on a different server, so the memory is not a constraint here.

VirtualBox memory exception - std::bad_alloc using TensorFlow and Docker

I have a Python application where I'm using TensorFlow and I'm running it inside of a Docker container. When running locally I see the memory usage stay well under 4GB of Ram, but there are some large files being written and processed. When TensorFlow reaches the point of creating its first checkpoint file I get the following exception:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
My model is complex so this file may be above 1GB, but also my data is images so I've already downloaded about 30GB of data just to begin running the model so I don't know if it's just by chance this keeps happening here or if this file actually happens to be too large. I'm only loading a small batch of images into memory for model training per epoch so I'm trying to keep the RAM usage low. My VirtualBox config looks like so:
The error appears to be using C++ so I assume it's coming from TensorFlow code internally. Has anyone seen anything like this or know what I can change? I feel that there is enough RAM allocated but maybe my disk access isn't configured correctly?

Most likely not enough RAM. 4GB is very small for running TensorFlow (particularly in python with a default install) for training. Your model size may well be below 1GB, but during training - and when writing out a checkpoint - TensorFlow will temporarily allocate more memory for buffering, and that's probably where you're getting an OOM error. If you run fine normally with 4GB until checkpointing, 8GB should be fine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.