dask execution gets stuck in LocalCluster - python

I am using an EC2 VM with 16 cores and 64GB ram. Wrote a Dask program that applies a filter on a dataframe and does a concat with another one and then writes the data back to disk. If I run it in LocalCluster mode by calling simply client = Client(), the execution gets stuck at some point after writing some data. During this period, the CPU utilisation is very very low and I can easily understand that nothing is getting executed. Also size of the part files stops increasing at this point. This goes on forever. But If I execute it without creating LocalCluster, it runs very slowly (low CPU utilisation) and finishes up the program. Trying to understand how I can fix this.
Note: Nobody else is using the VM and the data size ranges from 3GB to 25GB.
Dask version: 2.15.0 & 2.17.2

Unfortunately there is not enough information in your question to provide a helpful answer. There are many things that could be going on.
In this situation we recommend watching the Dask dashboard, which can give you much more information about what is going on. Hopefully that can help you identify your issue.
https://docs.dask.org/en/latest/diagnostics-distributed.html

Related

Is there a way to reduce resource usage when reading and writing large dataframes with polars?

For my specific problem I have been converting ".csv" files to ".parquet" files. The CSV files on disk are about 10-20 GB each.
Awhile back I have been using ".SAS7BDAT" files of similar size to convert to ".parquet" files of similar data but now I get them in CSVs so this might not be a good control, but I used the pyreadstat library to read these files in (with multi-threading on in the parameter, which didn't make a difference for some reason) and pandas to write. It was also a tiny bit faster but I feel the code ran on a single thread, and it took a week to convert all my data.
This time, I tried the polars library and it was blazing fast. The CPU usage was near 100%, memory usage was also quite high. I tested this on a single file which would have taken hours, only to complete in minutes. The problem is that it uses too much of my computer's resources and my PC stalls. VSCode has crashed on some occasions. I have tried passing in the low memory parameter but it still uses a lot of resources. My suspicion is with the "reader.next_batches(500)" variable but I don't know for sure.
Regardless, is there a way to limit the CPU and memory usage while running this operation so I can at least browse the internet/listen to music while this runs in the background? With pandas the process is too slow, with polars the process is fast but my PC becomes unusable at times. See image for the code I used.
Thanks.
I tried the low memory parameter with polars but memory usage was still quite high. I was expecting to at least use my PC while this worked in the background. My hope is to use 50-80% of my PC's resources such that enough resources are free for other work while the files are being converted.
I see you're on Windows so convert your notebook into a py script then from the command line run
start /low python yourscript.py
And/or use task manager to lower the priority of your python process once it's running.

Spacy, using nlp.pipe on large dataset in python, multiprocessing leads to processes going in sleep state. How to proper use all CPU cores?

I'm working on a NLP classification problem over a large database of emails (~1 million). I need to use spacy to parse texts and I'm using the nlp.pipe() method as nlp.pipe(emails,n_process=CPU_CORES, batch_size=20) to loop over the dataset.
The code works but I'm facing a (maybe not so)weird behavior:
the processes are being created but they are all in SLEEP state but one, casually some of them go in RUN state for a few seconds and then back to sleep. So I find myself with one single process using one core at 100% but of course the script not using all the CPU cores.
It's like the processes don't get "fed" input data from pipe.
Does anybody know how to properly use spacy nlp pipe or anyway how to avoid this situation? no way to use nlp.pipe with the GPU?
Thank you very much!
Sandro
EDIT: I still have no solution but i've noticed that if I set the batch_size=divmod(len(emails),CPU_CORES), the processes all starts running at 100% CPU and after a few seconds they all switch to sleeping state but one. It really looks like some element in spacy pipe gets locked while waiting for something to end.... any idea??
EDIT2: Setting batch_size=divmod(len(emails),CPU_CORES) while processing a large dataset leads inevitably to a spacy memory error:
MemoryError: Unable to allocate array with shape (1232821, 288) and data type float32
*thing that is maybe not so weird as my machine has 10GB of RAM and (1232821×288×32)bits / 8 = 1.4GB multiplied by 6 (CPU_CORES) leads to a 8.5GB of RAM needed. Therefore I guess that, having other stuff in memory already, it turns out to be plausible. *
I've found that using n_process=n works well for some models, like en_core_web_lg, but fails with others, such as en_core_web_trf.
For whatever reason, en_core_web_trf seems to use all cores with just a batch_size specified, whereas en_core_web_lg uses just one unless n_process=n is given. Likewise, en_core_web_trf fails with a closure error if n_process=n is specified.
Ok, I think I found an improvement but honestly the behavior it's not really clear to me. Now the sleeping processes are way less, with most of them stable running and a few sleeping or switching between the two states.
What I've done was to clean and speedup all the code inside the for loop and set the nlp.pipe args like this:
for e in nlp.pipe(emails,n_process=CPU_CORES-1, batch_size=200):
If anybody have any explanation about this or any suggestion on how to improve even more, it's of course more than welcome :)

Pandas inconsistent speed and memory usage when manipulating data frames on Windows

I am using pandas on Windows 10 and run into some inconsistent performance issues. The data sets I am working with are not small, but not huge either, perhaps of the order of 1GB.
I generally use code of the form.
df = pd.read_csv('...')
new_column = df.column_a / df.column_b
df = df.assign(new_column=new_column)
What I observe is that the speed and memory usage varies significantly from run to run.
Most of the time it should be <2GB and the calculation above runs smoothly. I use Atom's Hydrogen Plugin for interactive execution.
Every now and then, after several executions, things suddenly grind almost to a halt. Windows task manager reports memory usage of the Python session of over 6GB. But it is not repeatable. When I restart and walk through the code, this might happen at a different point, or not at all.
Any ideas what causes this and how I could avoid it? I am not sure whether the system tries to allocate memory for temporary results and squeezes the ram, or whether it is some interaction between Hydrogen and Pandas, or something completely different. If this were to happen at a merge, it would be more intuitive, as there is a potential for frames balooning, but this here happens on operations on 1d-numpy arrays.
I tried pandas modin, but ran into a lot of Windows issues. Unfortunately, the work requires me to use Windows so switching to Unix is not an option.
I would be appreciate any help.

Slower execution of AWS Lambda batch-writes to DynamoDB with multiple threads

Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.
Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!
Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.
As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.

Python consumes all RAM, process killed

I'm running Linux Mint via VirtualBox, and the Python code I'm using contains an iteration over a large data set to produce plots. The first couple of times I tried to run it, part way through the process it stopped with a message simple saying "Killed".
A bit of research showed that this is most likely due to a lack of RAM for the process. When repeating the process and monitoring the system resource usage (using command top -s), and sure enough I can watch the ram usage going up at a fairly constant rate as the program runs. I'm giving the VirtualBox all the RAM my system can afford (just over 2Gb), but it doesn't seem to be enough for the iterations I'm doing. The code looks like this:
for file in os.listdir('folder/'):
calledfunction('folder/'+file, 'output/'+file)
The calledfunction produces a png image, so it takes about 50mb of RAM per iteration and I want to do it about 40 times.
So, my question is, can I use a function to prevent the build up of RAM usage, or clear the RAM after each iteration? I've seen people talking garbage collection but I'm not really sure how to use it, or where I can/should put it in my loop. Any protips?
Thanks

Categories