Multiprocessing so slow - python

I have a function that does the following :
Take a file as input and does basic cleaning.
Extract the required items from the file and then write them in a pandas dataframe.
The dataframe is finally converted into csv and written into a folder.
This is the sample code:
def extract_function(filename):
with open(filename,'r') as f:
input_data=f.readlines()
try:
// some basic searching pattern matching extracting
// dataframe creation with 10 columns and then extracted values are filled in
empty dataframe
// finally df.to_csv()
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
filenames=os.listdir("/home/Desktop/input")
pool=multiprocessing.Pool(pool_size)
pool.map(extract_function,filenames)
pool.close()
pool.join()
The total number of files in the input folder is 4000. I used multiprocessing, as running the program normally with for loop was taking some time. Below are the executions times of both approaches:
Normal CPU processing = 139.22 seconds
Multiprocessing = 18.72 seconds
My system specification are :
Intel i5 7th gen, 12gb ram, 1Tb hdd, Ubuntu 16.04
While running the program for the 4000 files all the cores are getting fully used(averaging around 90% each core). So I decided to increase the file size and repeat the process. This time the input file number was increased from 4000 to 1,20,000. But this time while running the code the cpu usage was erratic at start and after some time the utilization went down (avearge usage around 10% per core). The ram utilization is also low averaging at 4gb max (remaining 8gb free). With the 4000 files as input the file writing to csv was fast as at an instant as i could see a jump or around 1000 files or more in an instant. But with the 1,20,000 files as input, the file writing slowed down to some 300 files and this slowing down goes linearly and after sometime the file writing became around 50-70 for an instant. All this time the majority of the ram is free. I restarted the machine and tried the same to clear any unwanted zombie process but still, the result is the same.
What is the reason for this ? How can I achieve the same multiprocessing for large files?
Note :
* Each file size average around 300kb.
* Each output file being written will be around 200bytes.
* Total number of files is 4080. Hence total size would be ~1.2gb.
* This same 4080 files was used to make copies to get 1,20,000 files.
* This program is an experiment to check multiprocessing for large number of files.
Update 1
I have tried the same code in a much more powerful machine.
Intel i7 8th gen 8700, 1Tb SSHD & 60gb ram.
. The file writing was much faster than in normal HDD. The program took:
For 4000 files - 3.7sec
For 1,20,000 files - 2min
Some point of time during the experiment, I got the fastest completion time which is 84sec. At that point in time, it was giving me consistent result while trying two times consecutively. Thinking that it may be because I had correctly set the number of thread factor in the pool size, I restarted and tried again. But this time it was much slower. To give a perspective, during normal runs around 3000-4000 files will be written in a second or two but this time it was writing below 600 files in a second. In this case, also the ram was not being used at all. The CPU even though the multiprocessing module is being used, all the cores just averages around 3-7% utilization.

Reading from and writing to disk is slow, compared to running code and data from RAM. It is extremely slow compared to running code and data from the internal cache in the CPU.
In an attempt to make this faster, several caches are used.
A harddisk generally has a built-in cache. In 2012 I did some write testing on this. With the harddisk's write cache disabled writing speed dropped from 72 MiB/s to 12 MiB/s.
Most operating systems today use otherwise unoccupied RAM as a disk cache.
The CPU has several levels of built-in caches as well.
(Usually there is a way to disable caches 1 and 2. If you try that you'll see read and write speed drop like a rock.)
So my guess is that once you pass a certain number of files, you exhaust one or more of the caches, and disk I/O becomes the bottleneck.
To verify, you would have to add code to extract_function to measure 3 things:
How long it takes to read the data from disk.
How long it takes to do the calculations.
How long it takes to write the CSV.
Have extract_function return a tuple of those three numbers, and analyse them. Instead of map, I would advise to use imap_unordered, so you can start evaluating the numbers as soon as they become available.
If disk I/O turns out to be the problem, consider using an SSD.

As #RolandSmith & #selbie suggested, I avoided the IO continuous write into CSV files by replacing it with data frames and appending to it. This I think cleared the inconsistencies. I checked the "feather" and "paraquet" high-performance IO modules as suggested by #CoMartel but I think it's for compressing large files into a smaller data frame structure. The appending options were not there for it.
Observations:
The program runs slow for the first run. The successive runs will be faster. This behavior was consistent.
I have checked for some trailing python process running after the program completion but couldn't find any. So some kind of caching is there within the CPU/RAM which make the program execution faster for the successive runs.
The program for 4000 input files took 72 sec for first-time execution and then an average of 14-15 sec for all successive runs after that.
Restarting the system clears those cache and causes the program to run slower for the first run.
Average fresh run time is 72 sec. But killing the program as soon as it starts and then running it took 40 sec for the first dry run after termination. The average of 14 sec after all successive runs.
During the fresh run, all core utilization will be around 10-13%. But after all the successive runs, the core utilization will be 100%.
Checked with the 1,20,000 files, it follows the same pattern. So, for now, the inconsistency is solved. So if such a code needs to be used as a server a dry run should be made for the CPU/RAM to get cached before it can start to accept API queries for faster results.

Related

How much does file write order matter for future sequential reading?

I'm preprocessing a large volume of raw files for future analysis where they'll be read back sequentially. The raw files are on a network file server and the the processed files are being written to a local external USB drive (12TB HDD, ReFS w/ 4k clusters). Processed files are ~100 KB each and I anticipate accessing them all a few times per year sorted by filename. There are 60+ million files.....
My general question pertains to how important is it that the preprocessed files are written in the same order they'll be read? Details follow...
Python's multiprocessing module is being used for the preprocessing to max out I/O and CPU. I have a loop within os.walk that lists N=1000 files from the fileserver, then passes that list into Pool.map(), which maps the work to a function that processes the files one by one and writes them to the USB drive. This seems to be working well from a performance standpoint. However I've noticed that the files are not being processed and written in perfect sequential order, which I assume is because the pools are uncoordinated. It's fairly close but not exact. Here's a fictional sample (pretend these are filenames) listed in the order they're written to the USB drive:
1
4
2
3
6
5
9
11
7
10
8
12
17
21
In the future I'll be reading these back in small batches of "adjacent filenames" (N = 16 or 32) like this:
(1,2, 3,...14,15,16). All files will be read sequentially but they're processed in batches (feeding them into a neural network). I assume this reading will be single-treaded so it's actually requesting one file at a time from the OS, although I may end up relying on a data-loader in TensorFlow or PyTorch, which could do multiple reads in parallel....I'm still learning about that part and am not sure how it will work exactly. But gist of it is that the files will e read sequentially based on their filename.
My questions are:
How important is it that the files be written in the exact same order they'll be read? Am I causing excessive disk thrashing even though the files are written in approximate filename order such that most files within a given "read-batch" are "close together" on the disk platters? I'm worried about decreased performance and increased risk of drive failure.
If I do want to address this, how would I do so? My untested noob ideas are below:
Have the pool return the processed files as in-memory objects, and then use single-treaded code to write them to disk. This would consume more memory forcing me to decrease the number and/or depth of the pools, leading to greater overhead. (I assume the single-threaded writing would not become a bottleneck)
Perhaps this single-threaded write activity could be parallelized next to the os.walk? They're hitting different I/O channels and neither is CPU intensive.
My second thought was to somehow coordinate the pools to get them to write files in sequential order, but I'm not sure how to do that or if it's even possible or desirable.
Third idea is to have the pools write files to the system's SSD instead of the USB drive, then at some regular interval use single-treaded code to move them to the external USB drive in the desired order. Essentially treating the SSD like a large buffer.
Anyways, if anyone has any thoughts on how to approach this, I'd love to hear them!

DASK Memory Per Worker Guide

I'm currently working on refactoring some legacy analytics into Python/DASK to show the efficacy of this as a solution going forward.
I'm trying to set up a demo scenario and am having problems with memory and would like some advice.
My scenario; I have my data split into 52 gzip compressed parquet files on S3, each one, uncompressed in memory is around 100MB, giving a total dataset size of ~5.5GB and exactly 100,000,000 rows.
My scheduler is on a T2.Medium (4GB/2vCPUs) as are my 4 workers.
Each worker is being run with 1 process, 1 thread and a memory limit of 4GB, I.e. dask-worker MYADDRESS --nprocs 1 --nthreads=1 --memory-limit=4GB.
Now, I'm pulling the parquet files and immediately repartitioning on a column in such a way that I end up with roughly 480 partitions each of ~11MB.
Then I'm using map_partitions to do the main body of work.
This works fine for small datasets, however for the 100 mil dataset, my workers keep crashing due to not having enough memory.
What am I doing wrong here?
For implementation specific info, the function i'm passing to map_partitions can sometimes need roughly 1GB, due to what is essentially a cross join on the partition dataframe.
Am I not understanding something to do with DASK's architecture? Between my scheduler and my 4 workers there is 20GB of memory to work with, yet this is proving to be not enough.
From what I've read from the DASK documentation is that, so long as each partition, and what you do with that partition, fits in the memory of the worker then you should be ok?
Is 4GB just not enough? Does it need way more to handle scheduler/inter process communication oerheads?
Thanks for reading.
See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
I'll copy the text here for convenience
Your chunks of data should be small enough so that many of them fit in a worker’s available memory at once. You often control this when you select partition size in Dask DataFrame or chunk size in Dask Array.
Dask will likely manipulate as many chunks in parallel on one machine as you have cores on that machine. So if you have 1 GB chunks and ten cores, then Dask is likely to use at least 10 GB of memory. Additionally, it’s common for Dask to have 2-3 times as many chunks available to work on so that it always has something to work on.
If you have a machine with 100 GB and 10 cores, then you might want to choose chunks in the 1GB range. You have space for ten chunks per core which gives Dask a healthy margin, without having tasks that are too small

How to find the maximum memory that my Python code needs?

I need to add some lines to my code that shows the maximum memory that has been used in the runtime of the code. My code opens a CSV file and processes it and write a new CSV file. I need to know the maximum memory that has been allocated throughout the time of running.
I need this to compare different alternatives in terms of memory usage. I have tried memory_usage() and df.memory_usage(deep=True).max() but I don't know what is the numbers they generate.
I need to know that for example: this code to process this CSV file has allocated (for example) 12 MB of RAM in the most-memory-consuming-moment of runtime.

Python/ Pycharm memory and CPU allocation for faster runtime?

I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)
You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.
You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link

Time needed of simulations in loop (c++ app, but also python) increases in time

I have a xeon workstation with 16 cores and 32 threads. On that machine I run a large set of simulations in parellel, split on 32 processes that each involve 68,750 (total 2,2 Milion) simulations in a loop.
Data is written on a ssd drive and a python process (while loop with waiting time) gathers the output, consolidates it and stores it away at another regular harddisk, continuously.
Now, at when I start everything and for about the first day, a single simulation only takes a few seconds (all simulations have very similar complexity/load). But then, the time the single simulations take gets longer and longer, up to a hundred-folth after about a week. The cpu temperatures are fine, the disk-tempratures are fine, etc.
However, whereas at the beginning Windows uses all CPU power at 100% is throttles the power a lot when it gets slow (I have a tool running that shows the load of all 32 threads). I tried to hybernate and restart, just to check if it is something with the hardware, but this does not change it, or not very much.
Has anybody any explanation? Is there a tweak to apply to windows to change this behaviour? I can of course do some work around, like splitting the simulations further appart and restarting the machine completely every now and then, starting a new set of simulations again. But I am interested in experience and hopefully solutions to this problem, instead.
On comments 1,2: Yes, there are no leaks (tested the c++ code with valgrind) and the usage is stable. The system has 192 GB of Ram available and it only uses a small proportion of it (the 32 simulations each about 190MB), and in the c++ everything is closed. For python I guess too (leaving scope there closes handles, if I am not mistaken). However, closing the python programe doesn't change a thing.
Thanks!
Frederik

Categories