Python consumes all RAM, process killed - python

I'm running Linux Mint via VirtualBox, and the Python code I'm using contains an iteration over a large data set to produce plots. The first couple of times I tried to run it, part way through the process it stopped with a message simple saying "Killed".
A bit of research showed that this is most likely due to a lack of RAM for the process. When repeating the process and monitoring the system resource usage (using command top -s), and sure enough I can watch the ram usage going up at a fairly constant rate as the program runs. I'm giving the VirtualBox all the RAM my system can afford (just over 2Gb), but it doesn't seem to be enough for the iterations I'm doing. The code looks like this:
for file in os.listdir('folder/'):
calledfunction('folder/'+file, 'output/'+file)
The calledfunction produces a png image, so it takes about 50mb of RAM per iteration and I want to do it about 40 times.
So, my question is, can I use a function to prevent the build up of RAM usage, or clear the RAM after each iteration? I've seen people talking garbage collection but I'm not really sure how to use it, or where I can/should put it in my loop. Any protips?
Thanks

Related

Is there a way to reduce resource usage when reading and writing large dataframes with polars?

For my specific problem I have been converting ".csv" files to ".parquet" files. The CSV files on disk are about 10-20 GB each.
Awhile back I have been using ".SAS7BDAT" files of similar size to convert to ".parquet" files of similar data but now I get them in CSVs so this might not be a good control, but I used the pyreadstat library to read these files in (with multi-threading on in the parameter, which didn't make a difference for some reason) and pandas to write. It was also a tiny bit faster but I feel the code ran on a single thread, and it took a week to convert all my data.
This time, I tried the polars library and it was blazing fast. The CPU usage was near 100%, memory usage was also quite high. I tested this on a single file which would have taken hours, only to complete in minutes. The problem is that it uses too much of my computer's resources and my PC stalls. VSCode has crashed on some occasions. I have tried passing in the low memory parameter but it still uses a lot of resources. My suspicion is with the "reader.next_batches(500)" variable but I don't know for sure.
Regardless, is there a way to limit the CPU and memory usage while running this operation so I can at least browse the internet/listen to music while this runs in the background? With pandas the process is too slow, with polars the process is fast but my PC becomes unusable at times. See image for the code I used.
Thanks.
I tried the low memory parameter with polars but memory usage was still quite high. I was expecting to at least use my PC while this worked in the background. My hope is to use 50-80% of my PC's resources such that enough resources are free for other work while the files are being converted.
I see you're on Windows so convert your notebook into a py script then from the command line run
start /low python yourscript.py
And/or use task manager to lower the priority of your python process once it's running.

Spacy, using nlp.pipe on large dataset in python, multiprocessing leads to processes going in sleep state. How to proper use all CPU cores?

I'm working on a NLP classification problem over a large database of emails (~1 million). I need to use spacy to parse texts and I'm using the nlp.pipe() method as nlp.pipe(emails,n_process=CPU_CORES, batch_size=20) to loop over the dataset.
The code works but I'm facing a (maybe not so)weird behavior:
the processes are being created but they are all in SLEEP state but one, casually some of them go in RUN state for a few seconds and then back to sleep. So I find myself with one single process using one core at 100% but of course the script not using all the CPU cores.
It's like the processes don't get "fed" input data from pipe.
Does anybody know how to properly use spacy nlp pipe or anyway how to avoid this situation? no way to use nlp.pipe with the GPU?
Thank you very much!
Sandro
EDIT: I still have no solution but i've noticed that if I set the batch_size=divmod(len(emails),CPU_CORES), the processes all starts running at 100% CPU and after a few seconds they all switch to sleeping state but one. It really looks like some element in spacy pipe gets locked while waiting for something to end.... any idea??
EDIT2: Setting batch_size=divmod(len(emails),CPU_CORES) while processing a large dataset leads inevitably to a spacy memory error:
MemoryError: Unable to allocate array with shape (1232821, 288) and data type float32
*thing that is maybe not so weird as my machine has 10GB of RAM and (1232821×288×32)bits / 8 = 1.4GB multiplied by 6 (CPU_CORES) leads to a 8.5GB of RAM needed. Therefore I guess that, having other stuff in memory already, it turns out to be plausible. *
I've found that using n_process=n works well for some models, like en_core_web_lg, but fails with others, such as en_core_web_trf.
For whatever reason, en_core_web_trf seems to use all cores with just a batch_size specified, whereas en_core_web_lg uses just one unless n_process=n is given. Likewise, en_core_web_trf fails with a closure error if n_process=n is specified.
Ok, I think I found an improvement but honestly the behavior it's not really clear to me. Now the sleeping processes are way less, with most of them stable running and a few sleeping or switching between the two states.
What I've done was to clean and speedup all the code inside the for loop and set the nlp.pipe args like this:
for e in nlp.pipe(emails,n_process=CPU_CORES-1, batch_size=200):
If anybody have any explanation about this or any suggestion on how to improve even more, it's of course more than welcome :)

Is it possible in Python to pre-allocate heap to fail-fast if memory is unavailable?

I am running a python program that processes a large dataset. Sometimes, it runs into a MemoryError when the machine runs out of memory.
I would like any MemoryError that is going to occur to happen at the start of execution, not in the middle. That is, the program should fail-fast: if the machine will not have enough memory to run to completion, the program should fail as soon as possible.
Is it possible for Python to pre-allocate space on the heap?
Is it possible to allocate heap memory at the start of python process,
Python uses as much memory as needed, so if your program is running out of memory, it would still run out of it even if there was a way to allocate the memory at the start.
One solution is trying to allow for swap to increase your total memory, although performance will be very bad in many scenarios.
The best solution, if possible, is to change the program to process data in chunks instead of loading it entirely.

dask execution gets stuck in LocalCluster

I am using an EC2 VM with 16 cores and 64GB ram. Wrote a Dask program that applies a filter on a dataframe and does a concat with another one and then writes the data back to disk. If I run it in LocalCluster mode by calling simply client = Client(), the execution gets stuck at some point after writing some data. During this period, the CPU utilisation is very very low and I can easily understand that nothing is getting executed. Also size of the part files stops increasing at this point. This goes on forever. But If I execute it without creating LocalCluster, it runs very slowly (low CPU utilisation) and finishes up the program. Trying to understand how I can fix this.
Note: Nobody else is using the VM and the data size ranges from 3GB to 25GB.
Dask version: 2.15.0 & 2.17.2
Unfortunately there is not enough information in your question to provide a helpful answer. There are many things that could be going on.
In this situation we recommend watching the Dask dashboard, which can give you much more information about what is going on. Hopefully that can help you identify your issue.
https://docs.dask.org/en/latest/diagnostics-distributed.html

Numpy: running out of memory on one machine while accomplishing the same task on another

For my project I need to store two large arrays in memory at once. I try to create them as follows:
matrix_for_words_train = numpy.zeros(shape=(435679, 542))
matrix_for_words_test = numpy.zeros(shape=(435679, 542))
However, on my desktop pc the second string resulted in MemoryError.When I tried to perform this on my laptop, I succeeded. What puzzles me here is that desktop has twice as much memory as laptop (8 GB versus 4). Both machines run on Ubuntu, desktop has 12.04 while laptop has 14.04 (both 32-bit) and on both machines I tried to execute above script with python 2.7
Just in case I checked the memory available with free and it seems OK (total memory is shown as expected and desktop has more than twice free memory). I guess I'm totally missing something here.
Thanks in advance!
If both of the computers are 32 bits, they are practically both 4 GB RAM (unless you did weird things to solve it, like having PAE enabled).
32-bit operating system usually cannot handle more than 4 GB RAM.
Besides, Out of Memory exceptions are thrown by the operating system not only when there's practically no memory free memory at all for the entire system, but when it decides that's it can't allocate more memory for this particular process.
Further more, notice that arrays require a continues memory fragment. Thus, even if the OS can find enough free memory, it might not find this amount of free memory as one continues fragment. It's even possible there's not one continues fragment of 0.5 GB RAM in your entire system.
Do you have to use arrays? Can you implement your solution differently? If so, I would recommend that. What are you trying to do?
Anyway, Out of Memory exceptions are usually not so deterministic to eyes of the programmer. You might even won't get the same results on the same PC on the same day.

Categories