I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)
You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.
You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link
Related
I am looking to do something very basic. I have a piece of code that I did not write which performs some processing that takes approximately 10 minutes to run on a single data set. I have 50,000 data sets, so I would like to utilize many GPUs to run this in parallel. I am familiar with how to do this on CPUs, however I do not know how to do this on GPUs. I see many examples of how to increase the speed the of certain function calls with GPUs via numba, although I cannot find how I would run a for loop on a gpu. Is this possible? In essence I have 50,000 image names, and I want a loop which reads through all images and performs the processing, then writes the extracted information to a .csv
I'm participate in Supercomputer Challenge.
From my experience, it's a complicated job to boost CPU code with GPU.
But there are some projects/libraries about python may help you.
CuPy: Easy to convert numpy code to CUDA code
Numba: JIT compiler which you mention above
PyCUDA: run C CUDA coda in Python
RAPDIS: cuXX which developed by Nvidia
easy -> hard : CuPy/RAPDIS > Numba > PyCUDA
In summary, you should study CuPy if you are using numpy, or try to find similar graphics processing methods in RAPDIS Library (ex:cuGraph). PyCUDA is the most difficult option for this case.
Just some suggestions, Speed up!
I have a python script that loops through a dataset of videos and applies a face and lip detector function to each video. The function returns a 3D numpy array of pixel data centered on the human lips in each frame.
The dataset is quite large (70GB total, ~500,000 videos each about 1 second in duration) and executing on a normal CPU would take days. I have a Nvidia 2080 Ti that I would like to use to execute code. Is it possible to include some code that executes my entire script on the available GPU? Or am I oversimplifying a complex problem?
So far I have been trying to implement using numba and pycuda and havent made any progress as the examples provided don't really fit my problem well.
Your first problem is actually getting your Python code to run on all CPU cores!
Python is not fast, and this is pretty much by design. More accurately, the design of Python emphasizes other qualities. Multi-threading is fairly hard in general, and Python can't make it easy due to those design constraints. A pity, because modern CPU's are highly parallel. In your case, there's a lucky escape. Your problem is also highly parallel. You can just divide those 500,000 video's over CPU cores. Each core runs a copy of the Python script over its own input. Even a quad-core would process 4x125.000 files using that strategy.
As for the GPU, that's not going to help much with Python code. Python simply doesn't know how to send data to the GPU, send commands to the CPU, or get results back. Some Pythons extensions can use the GPU, such as Tensorflow. But they use the GPU for their own internal purposes, not to run Python code.
I have a xeon workstation with 16 cores and 32 threads. On that machine I run a large set of simulations in parellel, split on 32 processes that each involve 68,750 (total 2,2 Milion) simulations in a loop.
Data is written on a ssd drive and a python process (while loop with waiting time) gathers the output, consolidates it and stores it away at another regular harddisk, continuously.
Now, at when I start everything and for about the first day, a single simulation only takes a few seconds (all simulations have very similar complexity/load). But then, the time the single simulations take gets longer and longer, up to a hundred-folth after about a week. The cpu temperatures are fine, the disk-tempratures are fine, etc.
However, whereas at the beginning Windows uses all CPU power at 100% is throttles the power a lot when it gets slow (I have a tool running that shows the load of all 32 threads). I tried to hybernate and restart, just to check if it is something with the hardware, but this does not change it, or not very much.
Has anybody any explanation? Is there a tweak to apply to windows to change this behaviour? I can of course do some work around, like splitting the simulations further appart and restarting the machine completely every now and then, starting a new set of simulations again. But I am interested in experience and hopefully solutions to this problem, instead.
On comments 1,2: Yes, there are no leaks (tested the c++ code with valgrind) and the usage is stable. The system has 192 GB of Ram available and it only uses a small proportion of it (the 32 simulations each about 190MB), and in the c++ everything is closed. For python I guess too (leaving scope there closes handles, if I am not mistaken). However, closing the python programe doesn't change a thing.
Thanks!
Frederik
I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.
There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.
I've noticed some oddities in the memory usage of my program running under PyPy and Python. Under PyPy the program uses not only a substantially larger initial amount of memory than CPython, but this memory usage increases over time quite dramatically. At the end of the program under PyPy it's using around 170MB, compared to 14MB when run under CPython.
I found a user with the exact same problem, albeit on a smaller scale, but the solutions which worked for him provided only a minor help for my program pypy memory usage grows forever?
The two things I tried changing were setting the environment variables PYPY_GC_MAX to be 100MB and PYPY_GC_GROWTH = 1.1, and also manually calling gc.collect() at each generation.
I'm determining the memory usage with
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1000
Here's the runtime and memory usage under different conditions:
Version: time taken, memory used at end of run
PyPy 2.5.0: 100s, 173MB
PyPy with PYPY_GC_MAX = 100MB and PYPY_GC_GROWTH = 1.1: 102s, 178MB
PyPy with gc.collect(): 108s, 131MB
Python 2.7.3: 167s, 14MB
As you can see the program is much quicker under PyPy than CPython which is why I moved to it in the first place, but at the cost of a 10-fold increase in memory.
The program is an implementation of Genetic Programming, where I'm evolving an arithmetic binary tree over 100 generations, with 200 individuals in the population. Each node in the tree has a reference to its 2 children and these trees can increase in size although for this experiment they stay relatively stable. Depending on the application this program can be running for 10 minutes up to a few hours, but for the results here I've set it to a smaller dataset to highlight the issue.
Does anyone have any idea a) what could be causing this, and b) if it's possible to limit the memory usage to somewhat more respectable levels?
PyPy is known to use more baseline memory than CPython, and this number is known to increase over time, as the JIT compiles more and more machine code. It does (or at least should) converge --- what this means is that the memory usage should increase as your program runs, but only until a maximum. You should get roughly the same usage after running for 10 minutes or after several hours.
We can discuss endlessly if 170MB is too much or not for a "baseline". What I can tell is that a program that uses several GBs of memory on CPython uses not significantly more on PyPy --- that's our goal and our experience so far; but please report it as a bug if your experience is different.