Theano only using 40% of my GPU

Theano only using 40% of my GPU - python

I am starting to get into deep learning and I am trying out the example from Chapter 6 on neuralnetworksanddeeplearning.com. Theano is telling me, that it is using my GPU (a GTX 780). However, the GPU usage hovers only at around 40~50% and the clockspeed is only at ~800 MHz (normal Boost clock in games is ~1100 MHz).
Is this normal? Or is something wrong here?

It is normal. Actually, 40~50% should be considered high usage. Some operations like vector concatenation are performed on CPU. The GPU has to wait these operations to be completed before using the results as input. Besides, the overhead can be caused by loading data from memory.
So people commonly run 2~3 programs on the same GPU to take full advantage of it.

Related

How to execute python script (face detection on very large dataset) on Nvidia GPU

I have a python script that loops through a dataset of videos and applies a face and lip detector function to each video. The function returns a 3D numpy array of pixel data centered on the human lips in each frame.
The dataset is quite large (70GB total, ~500,000 videos each about 1 second in duration) and executing on a normal CPU would take days. I have a Nvidia 2080 Ti that I would like to use to execute code. Is it possible to include some code that executes my entire script on the available GPU? Or am I oversimplifying a complex problem?
So far I have been trying to implement using numba and pycuda and havent made any progress as the examples provided don't really fit my problem well.

Your first problem is actually getting your Python code to run on all CPU cores!
Python is not fast, and this is pretty much by design. More accurately, the design of Python emphasizes other qualities. Multi-threading is fairly hard in general, and Python can't make it easy due to those design constraints. A pity, because modern CPU's are highly parallel. In your case, there's a lucky escape. Your problem is also highly parallel. You can just divide those 500,000 video's over CPU cores. Each core runs a copy of the Python script over its own input. Even a quad-core would process 4x125.000 files using that strategy.
As for the GPU, that's not going to help much with Python code. Python simply doesn't know how to send data to the GPU, send commands to the CPU, or get results back. Some Pythons extensions can use the GPU, such as Tensorflow. But they use the GPU for their own internal purposes, not to run Python code.

Python/ Pycharm memory and CPU allocation for faster runtime?

I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)

You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.

You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link

Sharing RAM with GPU

I have a model for deep learning which is at the edge of allocation memory error(weight matrices). I trimmed the model's complexity to the level where it works fine for my predictions(but it could be better) and it works fine with my RAM memory, however when I switch theano to use gpu for much faster training (GPU with 2GB gddr5 vram), it throws allocation error.
I searched a lot for how to share RAM with GPU and many people state that it is not possible(without references or explanation) and that even if you could, it would be slow. And there are always one or two people on the forums who state it could be done (I checked the whole page 1 on google search), but again a very unreliable information without anything to support it.
I understand their slowness argument, but is it slower to use GPU + RAM than using CPU + RAM for matrix heavy computations in deep learning? Nobody ever mentions that. Because all arguments I've read (like buy new card, set lower settings) were about gaming and that makes sense to me as you head to better just-in-time performance and not the overall speed.
My blind guess is that the bus that connects GPU to RAM is just the narrowest pipe in the system(slower than RAM), so it makes better sense to use CPU + RAM(which has really fast bus) over faster GPU (+ RAM). Otherwise, it wouldn't make much sense.

Since you tagged your question with CUDA I can give you the following answer.
With Managed memory you can from a kernel reference memory that may or may not be on the CPU memory. In this way, the GPU memory kind of works like cache, and you are not limited by the actual GPU memory size.
Since this is a software technique I would say this is on-topic for SO.

TensorFlow: critical graph operations assigned to cpu rather than gpu

I have implemented a TensorFlow DNN model (2 hidden layers with elu activation functions trained on MNIST) as a Python class in order to wrap TF calls within another library with its own optimization routines and tools.
When running some tests on a TeslaK20 I noticed that the GPU was being used at 4% of the total capacity. Therefore I looked a bit more closely to the log-device-placement and figured that all critical operations like MatMul, Sum, Add, Mean etc were being assigned to the CPU.
The first thing that came to mind was that it was because I was using dtype=float64, therefore I switched to dtype=float32. While a lot more operations were assigned to GPU, still a good number were assigned to the CPU, like Mean, gradient/Mean_grad/Prod, gradient/Mean.
So here comes my first question (I'm linking a working code example at the end),
1) why would that be? I have written different TF models that consist of simple tensor multiplications and reductions and they run fully on GPU as long as I use single precision.
So here comes the second question,
2) why does TF assign the graph to different devices depending on the data type? I understand that not all kernels are implemented for GPU but I would have thought that things like MatMul could run on GPU for both single and double precision.
3) Could the fact that the model is wrapped within a Python class have an effect? I do not think this is the case because as I said, it did not happen for other models wrapped similarly but that were simpler.
4) What sort of steps can I take to run the model fully on a GPU?
Here is a full working example of my code that I have isolated from my library
https://gist.github.com/smcantab/8ecb679150a327738102 .
If you run it and look at the output you'll see how different parts of the graph have been assigned to different devices. To see how this changes with types and devices change dtype and device within main() at the end of the example. Note that if I set allow_soft_placement=False the graph fails to initialize.
Any word of advice would be really appreciated.

As Yaroslav noted: Mean, in particular, was not yet implemented for GPU, but it is now available so these operations should run on the GPU with the latest TensorFlow. (as per the DEVICE_GPU registration at that link)
Prior to availability of mean, the status of this was:
(a) You can implement mean by hand, because reduce_sum is available on GPU.
(b) I've re-pinged someone to see if there's an easy way to add the GPU support, but we'll see.
Re float64 on GPU, someone opened an issue three days ago with a patch for supporting float64 reductions on GPU. Currently being reviewed and tested.
No, it doesn't matter if it's wrapped in Python - it's really just about whether a kernel has been defined for it to execute on the GPU or not. In many cases, the answer to "why is X supported on GPU by Y not?" comes down to whether or not there's been demand for Y to run on the GPU. The answer for float64 is simpler: float32 is a lot faster, so in most cases, people work to make their models work in float32 when possible because it gives all-around speed benefits.

Most of the graphics cards like the GTX 980, 1080, etc are stripped of the double precision floating point hardware units. Since these are much cheaper and therefore more ubiquitous than the Newer Tesla Units (which have FP64 double precision hardware), doing double precision calculations on the graphics cards is very slow compared to single precision. FP64 calculations on a GPU seem to be about 32 X slower than FP32 on a GPU without the FP64 hardware. I believe this is why the FP32 calculations tend to be set up for the GPU while FP64 for the CPU (which is faster in most systems.) Hopefully in the future, the frameworks will test the GPU capabilities at runtime to decide where to assign the FP64 calculations.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.

There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.