Sharing RAM with GPU - python

I have a model for deep learning which is at the edge of allocation memory error(weight matrices). I trimmed the model's complexity to the level where it works fine for my predictions(but it could be better) and it works fine with my RAM memory, however when I switch theano to use gpu for much faster training (GPU with 2GB gddr5 vram), it throws allocation error.
I searched a lot for how to share RAM with GPU and many people state that it is not possible(without references or explanation) and that even if you could, it would be slow. And there are always one or two people on the forums who state it could be done (I checked the whole page 1 on google search), but again a very unreliable information without anything to support it.
I understand their slowness argument, but is it slower to use GPU + RAM than using CPU + RAM for matrix heavy computations in deep learning? Nobody ever mentions that. Because all arguments I've read (like buy new card, set lower settings) were about gaming and that makes sense to me as you head to better just-in-time performance and not the overall speed.
My blind guess is that the bus that connects GPU to RAM is just the narrowest pipe in the system(slower than RAM), so it makes better sense to use CPU + RAM(which has really fast bus) over faster GPU (+ RAM). Otherwise, it wouldn't make much sense.

Since you tagged your question with CUDA I can give you the following answer.
With Managed memory you can from a kernel reference memory that may or may not be on the CPU memory. In this way, the GPU memory kind of works like cache, and you are not limited by the actual GPU memory size.
Since this is a software technique I would say this is on-topic for SO.

Related

How to know if python process is garbage collecting too often?

I am tuning the performance of a python job (that uses Pytorch and Nvidia Cuda). My python process runs on a shared cluster where the max RAM needed must be explicitly specified. Since the lower is better, we tend to set lower RAM that doesn't cause Out of Memory.
In particular, I noticed that the RAM could make quite a difference on performance. For instance, if I set RAM=6GB, my job takes 27Hours. If I set max RAM=10GB (keeping all other variables same), the same job takes about ~16Hrs. Most of the work is done on GPU using GPU RAM, so the CPU and CPU RAM is only for housekeeping and moving tensors.
My suspect is garbage collector runs too often when I set less RAM. I had observed this kind of behavior when I was dealing with JavaVM, and I had tools to inspect how much time my JVM process spends in the garbage collector.
However, with python, I am clueless.
Are there any ways to inspect the memory management, in particular, Garbage Collector time (fraction of runtime)?

Python/ Pycharm memory and CPU allocation for faster runtime?

I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)
You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.
You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link

Theano only using 40% of my GPU

I am starting to get into deep learning and I am trying out the example from Chapter 6 on neuralnetworksanddeeplearning.com. Theano is telling me, that it is using my GPU (a GTX 780). However, the GPU usage hovers only at around 40~50% and the clockspeed is only at ~800 MHz (normal Boost clock in games is ~1100 MHz).
Is this normal? Or is something wrong here?
It is normal. Actually, 40~50% should be considered high usage. Some operations like vector concatenation are performed on CPU. The GPU has to wait these operations to be completed before using the results as input. Besides, the overhead can be caused by loading data from memory.
So people commonly run 2~3 programs on the same GPU to take full advantage of it.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.
There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

what changes when your input is giga/terabyte sized?

I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer.
For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind.
What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)
I'm currently engaged in high-performance computing in a small corner of the oil industry and regularly work with datasets of the orders of magnitude you are concerned about. Here are some points to consider:
Databases don't have a lot of traction in this domain. Almost all our data is kept in files, some of those files are based on tape file formats designed in the 70s. I think that part of the reason for the non-use of databases is historic; 10, even 5, years ago I think that Oracle and its kin just weren't up to the task of managing single datasets of O(TB) let alone a database of 1000s of such datasets.
Another reason is a conceptual mismatch between the normalisation rules for effective database analysis and design and the nature of scientific data sets.
I think (though I'm not sure) that the performance reason(s) are much less persuasive today. And the concept-mismatch reason is probably also less pressing now that most of the major databases available can cope with spatial data sets which are generally a much closer conceptual fit to other scientific datasets. I have seen an increasing use of databases for storing meta-data, with some sort of reference, then, to the file(s) containing the sensor data.
However, I'd still be looking at, in fact am looking at, HDF5. It has a couple of attractions for me (a) it's just another file format so I don't have to install a DBMS and wrestle with its complexities, and (b) with the right hardware I can read/write an HDF5 file in parallel. (Yes, I know that I can read and write databases in parallel too).
Which takes me to the second point: when dealing with very large datasets you really need to be thinking of using parallel computation. I work mostly in Fortran, one of its strengths is its array syntax which fits very well onto a lot of scientific computing; another is the good support for parallelisation available. I believe that Python has all sorts of parallelisation support too so it's probably not a bad choice for you.
Sure you can add parallelism on to sequential systems, but it's much better to start out designing for parallelism. To take just one example: the best sequential algorithm for a problem is very often not the best candidate for parallelisation. You might be better off using a different algorithm, one which scales better on multiple processors. Which leads neatly to the next point.
I think also that you may have to come to terms with surrendering any attachments you have (if you have them) to lots of clever algorithms and data structures which work well when all your data is resident in memory. Very often trying to adapt them to the situation where you can't get the data into memory all at once, is much harder (and less performant) than brute-force and regarding the entire file as one large array.
Performance starts to matter in a serious way, both the execution performance of programs, and developer performance. It's not that a 1TB dataset requires 10 times as much code as a 1GB dataset so you have to work faster, it's that some of the ideas that you will need to implement will be crazily complex, and probably have to be written by domain specialists, ie the scientists you are working with. Here the domain specialists write in Matlab.
But this is going on too long, I'd better get back to work
In a nutshell, the main differences IMO:
You should know beforehand what your likely
bottleneck will be (I/O or CPU) and focus on the best algorithm and infrastructure
to address this. I/O quite frequently is the bottleneck.
Choice and fine-tuning of an algorithm often dominates any other choice made.
Even modest changes to algorithms and access patterns can impact performance by
orders of magnitude. You will be micro-optimizing a lot. The "best" solution will be
system-dependent.
Talk to your colleagues and other scientists to profit from their experiences with these
data sets. A lot of tricks cannot be found in textbooks.
Pre-computing and storing can be extremely successful.
Bandwidth and I/O
Initially, bandwidth and I/O often is the bottleneck. To give you a perspective: at the theoretical limit for SATA 3, it takes about 30 minutes to read 1 TB. If you need random access, read several times or write, you want to do this in memory most of the time or need something substantially faster (e.g. iSCSI with InfiniBand). Your system should ideally be able to do parallel I/O to get as close as possible to the theoretical limit of whichever interface you are using. For example, simply accessing different files in parallel in different processes, or HDF5 on top of MPI-2 I/O is pretty common. Ideally, you also do computation and I/O in parallel so that one of the two is "for free".
Clusters
Depending on your case, either I/O or CPU might than be the bottleneck. No matter which one it is, huge performance increases can be achieved with clusters if you can effectively distribute your tasks (example MapReduce). This might require totally different algorithms than the typical textbook examples. Spending development time here is often the best time spent.
Algorithms
In choosing between algorithms, big O of an algorithm is very important, but algorithms with similar big O can be dramatically different in performance depending on locality. The less local an algorithm is (i.e. the more cache misses and main memory misses), the worse the performance will be - access to storage is usually an order of magnitude slower than main memory. Classical examples for improvements would be tiling for matrix multiplications or loop interchange.
Computer, Language, Specialized Tools
If your bottleneck is I/O, this means that algorithms for large data sets can benefit from more main memory (e.g. 64 bit) or programming languages / data structures with less memory consumption (e.g., in Python __slots__ might be useful), because more memory might mean less I/O per CPU time. BTW, systems with TBs of main memory are not unheard of (e.g. HP Superdomes).
Similarly, if your bottleneck is the CPU, faster machines, languages and compilers that allow you to use special features of an architecture (e.g. SIMD like SSE) might increase performance by an order of magnitude.
The way you find and access data, and store meta information can be very important for performance. You will often use flat files or domain-specific non-standard packages to store data (e.g. not a relational db directly) that enable you to access data more efficiently. For example, kdb+ is a specialized database for large time series, and ROOT uses a TTree object to access data efficiently. The pyTables you mention would be another example.
While some languages have naturally lower memory overhead in their types than others, that really doesn't matter for data this size - you're not holding your entire data set in memory regardless of the language you're using, so the "expense" of Python is irrelevant here. As you pointed out, there simply isn't enough address space to even reference all this data, let alone hold onto it.
What this normally means is either a) storing your data in a database, or b) adding resources in the form of additional computers, thus adding to your available address space and memory. Realistically you're going to end up doing both of these things. One key thing to keep in mind when using a database is that a database isn't just a place to put your data while you're not using it - you can do WORK in the database, and you should try to do so. The database technology you use has a large impact on the kind of work you can do, but an SQL database, for example, is well suited to do a lot of set math and do it efficiently (of course, this means that schema design becomes a very important part of your overall architecture). Don't just suck data out and manipulate it only in memory - try to leverage the computational query capabilities of your database to do as much work as possible before you ever put the data in memory in your process.
The main assumptions are about the amount of cpu/cache/ram/storage/bandwidth you can have in a single machine at an acceptable price. There are lots of answers here at stackoverflow still based on the old assumptions of a 32 bit machine with 4G ram and about a terabyte of storage and 1Gb network. With 16GB DDR-3 ram modules at 220 Eur, 512 GB ram, 48 core machines can be build at reasonable prices. The switch from hard disks to SSD is another important change.

Categories