Run out of VRAM using Theano on Amazon cluster

Run out of VRAM using Theano on Amazon cluster - python

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.

There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

Related

Torch.cuda.empty_cache() replacement in case of CPU only enviroment

Currently, I am using PyTorch built with CPU only support. When I run inference, somehow information for that input file is stored in cache and memory keeps on increasing for every new unique file used for inference. On the other hand, memory usage does not increase if i use the same file again and again.
Is there a way to clear cache like cuda.empty_cache() in case of CPUs only.

Python/ Pycharm memory and CPU allocation for faster runtime?

I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)

You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.

You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link

Benchmark of HowTo: Reading Data

I'm using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorflow, using the same MNIST example.
I was surprised by the results and I was wondering if anyone has enough low-level understanding to explain what is happening.
In the HowTo there are basically 3 methods to read in data:
Feeding: building the mini-batch in python and passing it with sess.run(..., feed_dict={x: mini_batch})
Reading from files: use tf operations to open the files and create mini-batches. (Bypass handling data in python.)
Preloaded data: load all the data in either a single tf variable or constant and use tf functions to break that up in mini-batches. The variable or constant is pinned to the cpu, not gpu.
The scripts I used to run my benchmarks are found within tensorflow:
Feeding: examples/tutorials/mnist/fully_connected_feed.py
Reading from files: examples/how_tos/reading_data/convert_to_records.py and examples/how_tos/reading_data/fully_connected_reader.py
Preloaded data (constant): examples/how_tos/reading_data/fully_connected_preloaded.py
Preloaded data (variable): examples/how_tos/reading_data/fully_connected_preloaded_var.py
I ran those scripts unmodified, except for the last two because they crash --for version 0.10 at least-- unless I add an extra sess.run(tf.initialize_local_variables()).
Main Question
The time to execute 100 mini-batches of 100 examples running on a GTX1060:
Feeding: ~0.001 s
Reading from files: ~0.010 s
Preloaded data (constant): ~0.010 s
Preloaded data (variable): ~0.010 s
Those results are quite surprising to me. I would have expected Feeding to be the slowest since it does almost everything in python, while the other methods use lower-level tensorflow/C++ to carry similar operations. It is the complete opposite of what I expected. Does anyone understand what is going on?
Secondary question
I have access to another machine which has a Titan X and older NVidia drivers. The relative results were roughly in line with the above, except for Preloaded data (constant) which was catastrophically slow, taking many seconds for a single mini-batch.
Is this some known issue that performance can vary greatly with hardware/drivers?

Update Oct 9 the slowness comes because the computation runs too fast for Python to pre-empt the computation thread and to schedule the pre-fetching threads. Computation in main thread takes 2ms and apparently that's too little for the pre-fetching thread to grab the GIL. Pre-fetching thread has larger delay and hence can always be pre-empted by computation thread. So the computation thread runs through all of the examples, and then spends most of the time blocked on GIL as some prefetching thread gets scheduled and enqueues a single example. The solution is to increase number of Python threads, increase queue size to fit the entire dataset, start queue runners, and then pause main thread for a couple of seconds to give queue runners to pre-populate the queue.
Old stuff
That's surprisingly slow.
This looks some kind of special cases making the last 3 examples unnecessarily slow (most effort went into optimizing large models like ImageNet, so MNIST didn't get as much attention).
You can diagnose the problems by getting timelines, as described here
Here are 3 of those examples with timeline collection enabled.
Here's the timeline for feed_dict implementation
The important thing to notice is that matmul takes a good chunk of the time, so the reading overhead is not significant
Now here's the timeline for reader implementation
You can see that operation is bottlenecked on QueueDequeueMany which takes whopping 45ms.
If you zoom in, you'll see a bunch of tiny MEMCPY and Cast operations, which is a sign of some op being CPU only (parse_single_example), and the dequeue having to schedule multiple independent CPU->GPU transfers
For the var example below with GPU disabled, I don't see tiny little ops, but QueueDequeueMany still takes over 10ms. The timing seems to scale linearly with batch size, so there's some fundamental slowness there. Filed #4740

Yaroslav nails the problem well. With small models you'll need to speed up the data import. One way to do this is with the Tensorflow function, tf.TFRecordReader.read_up_to, that reads multiple records in each session.run() call, thereby removing the excess overhead caused by multiple calls.
enqueue_many_size = SOME_ENQUEUE_MANY_SIZE
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
_, queue_batch = reader.read_up_to(filename_queue, enqueue_many_size)
batch_serialized_example = tf.train.shuffle_batch(
[queue_batch],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True)
This was also addressed in this SO question.

The main question is that why the example with the preloaded data (constant)
examples/how_tos/reading_data/fully_connected_preloaded.py is significantly slower than other data loading example codes when using GPU.
I had the same problem, that fully_connected_preloaded.py is unexpectedly slow on my Titan X. The problem was that the whole dataset was pre-loaded on CPU, not GPU.
First, let me share my initial attempts. I applied the following performance tips by Yaroslav.
set capacity=55000 for tf.train.slice_input_producer.(55000 is the size of MNIST training set in my case)
set num_threads=5 for tf.train.batch.
set capacity=500 for tf.train.batch.
put time.sleep(10) after tf.train.start_queue_runners.
However, the average speed per each batch stays the same. I tried timeline visualization for profiling, and still got QueueDequeueManyV2 dominating.
The problem was the line 65 of fully_connected_preloaded.py. The following code loads entire dataset to CPU, still providing a bottleneck for CPU-GPU data transmission.
with tf.device('/cpu:0'):
input_images = tf.constant(data_sets.train.images)
input_labels = tf.constant(data_sets.train.labels)
Hence, I switched the device allocation.
with tf.device('/gpu:0')
Then I got x100 speed-up per each batch.
Note:
This was possible because Titan X has enough memory space to preload entire dataset.
In the original code(fully_connected_preloaded.py), the comment in the line 64 says "rest of pipeline is CPU-only". I am not sure about what this comment intended.

TensorFlow PoolAllocator huge number of requests

using Tensorflow r0.9/r.10 I get the following message, that makes me worried I've set my neural network model in the wrong way.
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 6206792 get requests, put_count=6206802 evicted_count=5000 eviction_rate=0.000805568 and unsatisfied allocation rate=0.000806536
The network I use is similar to AlexNet/VGG-M, I create the variables and the ops in a function called once, and then I just loop over multiple epochs calling the same omptimizer, loss and prediction function for each mini-batch iteration.
Another thing that makes me worried is that the network can be unstable when using a large batch size: it runs fine for few epochs, and then it goes out of memory (trying to allocate...).
Is there any way to check if there is something wrong and what it is?

This is an info-level log statement (the "I" prefix). It does not necessarily mean that anything is wrong: however, the pool allocator (a cache for allocations) is finding that it frequently has to fall back on the underlying allocator. This may indicate memory pressure.
For your instability problem: as you observe, large batches can lead to out-of-memory errors. There is some nondeterminism to operator scheduling, which is why you may not see it fail every time. Try lowering your batch size until you consistently no longer see out of memory errors.

PyCUDA/CUDA: Causes of non-deterministic launch failures?

Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance)
Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation algorithm for a particular system.
On my dev machine (Geforce 9800GT, running under CUDA 4.0) this works perfectly, all the time, no matter what I throw at it (up to a computational limit based on the stated exponential nature), but on a test machine (4xTesla S1070's, only one used, under CUDA 3.1) the exact same code (Python base, PyCUDA interface to CUDA kernels), produces the exact results for 'small' cases, but in mid-range cases, the solving stage fails on random iterations.
Previous problems I've had with this code have been to do with the numeric instability of the problem, and have been deterministic in nature (i.e fails at exactly the same stage every time), but this one is frankly pissing me off, as it will fail whenever it wants to.
As such, I don't have a reliable way to breaking the CUDA code out from the Python framework and doing proper debugging, and PyCUDA's debugger support is questionable to say the least.
I've checked the usual things like pre-kernel-invocation checking of free memory on the device, and occupation calculations say that the grid and block allocations are fine. I'm not doing any crazy 4.0 specific stuff, I'm freeing everything I allocate on the device at each iteration and I've fixed all the data types as being floats.
TL;DR, Has anyone come across any gotchas regarding CUDA 3.1 that I haven't seen in the release notes, or any issues with PyCUDA's autoinit memory management environment that would cause intermittent launch failures on repeated invocations?

Have you tried:
cuda-memcheck python yourapp.py
You likely have an out of bounds memory access.

You can use nVidia CUDA Profiler and see what gets executed before the failure.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.