I am tuning the performance of a python job (that uses Pytorch and Nvidia Cuda). My python process runs on a shared cluster where the max RAM needed must be explicitly specified. Since the lower is better, we tend to set lower RAM that doesn't cause Out of Memory.
In particular, I noticed that the RAM could make quite a difference on performance. For instance, if I set RAM=6GB, my job takes 27Hours. If I set max RAM=10GB (keeping all other variables same), the same job takes about ~16Hrs. Most of the work is done on GPU using GPU RAM, so the CPU and CPU RAM is only for housekeeping and moving tensors.
My suspect is garbage collector runs too often when I set less RAM. I had observed this kind of behavior when I was dealing with JavaVM, and I had tools to inspect how much time my JVM process spends in the garbage collector.
However, with python, I am clueless.
Are there any ways to inspect the memory management, in particular, Garbage Collector time (fraction of runtime)?
Related
I am deploying an algorithm on an HPC system and want to tune the amount of resources I request to optimize the performance of the algorithm. I have come across several applications I can use to find the ideal number of CPUs (e.g. https://sebastianraschka.com/Articles/2014_multiprocessing.html), but would like to also tune the amount of memory I allocate. Are there any common benchmark functions or applications to test CPU and memory allocation? Ideally, the runtime of such a function would decrease as more resources are allocated, but should also reach a point of diminishing returns.
I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G.
I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order.
My naive conjecture is that the Python interpreter mistreated the 512G physical memory on the HPC as the total available memory, so it does not do garbage collection as often as actually needed. For example, when I load the data with 64G memory, it never threw me a MemoryError but the kernel is directly killed and restarted.
I was wondering whether the over-high usage of memory when loading is a regular behavior of pyarrow, or it is due to the special setting of my environment. If the latter, then is it possible to somehow limit the available memory during loading?
We fixed a memory use bug that's present in 0.14.0/0.14.1 (which is probably what you're using right now).
https://issues.apache.org/jira/browse/ARROW-6060
We also are introducing an option to read string columns as categorical (aka DictionaryArray in Arrow parlance) which also will reduce memory usage. See https://issues.apache.org/jira/browse/ARROW-3325 and discussion in
https://ursalabs.org/blog/2019-06-07-monthly-report/
I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)
You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.
You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link
I have a model for deep learning which is at the edge of allocation memory error(weight matrices). I trimmed the model's complexity to the level where it works fine for my predictions(but it could be better) and it works fine with my RAM memory, however when I switch theano to use gpu for much faster training (GPU with 2GB gddr5 vram), it throws allocation error.
I searched a lot for how to share RAM with GPU and many people state that it is not possible(without references or explanation) and that even if you could, it would be slow. And there are always one or two people on the forums who state it could be done (I checked the whole page 1 on google search), but again a very unreliable information without anything to support it.
I understand their slowness argument, but is it slower to use GPU + RAM than using CPU + RAM for matrix heavy computations in deep learning? Nobody ever mentions that. Because all arguments I've read (like buy new card, set lower settings) were about gaming and that makes sense to me as you head to better just-in-time performance and not the overall speed.
My blind guess is that the bus that connects GPU to RAM is just the narrowest pipe in the system(slower than RAM), so it makes better sense to use CPU + RAM(which has really fast bus) over faster GPU (+ RAM). Otherwise, it wouldn't make much sense.
Since you tagged your question with CUDA I can give you the following answer.
With Managed memory you can from a kernel reference memory that may or may not be on the CPU memory. In this way, the GPU memory kind of works like cache, and you are not limited by the actual GPU memory size.
Since this is a software technique I would say this is on-topic for SO.
To allocate space to a variable on GPU memory there must be enough space on continuous memory region. In other words you cannot have fragmented memory regions allocated to a variable on GPUS, unlike RAM. Having different shared variables stored on the GPU memory and continuously updating them could cause memory fragmentation. Therefore, even if there is enough free memory (in terms of bytes) on the GPU, you may not be able to use those memory regions as they are not in a continuous block.
My question is how does Theano deal with such problem?
Does shared_var.set_value([]) release all the memory assigned to that shared variable so that the next update (shared_var.set_value(newDataPoints)) will only allocate the amount of memory to the shared variable and therefore avoid memory fragmentation?
Here is is explained that updating a shared variable may still cause memory fragmentation. So I wonder whether the problem persists if the parameters borrow or allow_gc (in theanorc) is set to True?
How one can keep track of the amount of free memory in a block (continuous) on a GPU?