I have ~1000 separate HDF5 files stored on disk. Each only takes around 10ms to load into memory, so I was wondering what's the best way to load them parallely such that I get around a linear performance boost.
I've tried multiprocessing but that ends up being slower than just serially loading them in due to the overhead of setting up the processes. I've looked into Cython, specifically prange but having trouble optimizing it to get it faster. Any pointers would be appreciated!
This sounds like a job for mapreduce but if you only have one machine then I would recommend using pipes. Write one script to open the files and print the data to stdout, then in another script you read the data from stdin and process. The you redirect script1 into script2.
# script1.py
FILES_TO_READ = ...
for filename in FILES_TO_READ:
# open the file
# do work
# print data
# script2.py
while True:
line = input()
# do work
$> ./script1.py | ./script2.py
Related
Let's say I have a python script to read and process a csv in which each line can be processed independently. Then lets say I have another python script that I am using to call the original script using os.system() as such:
Script A:
with open(sys.argv[1], r) as f:
# do some processing for each line
Script B:
import os
os.system('python Script_A.py somefile.csv')
How are computing resources shared between Script A and Script B? How does the resource allocation change if I call Script A as a subprocess of Script B instead of a system command? What happens if I multiprocess within either of those scenarios?
To further complicate, how would the GIL play with these different instances of python?
I am not looking for a library or a solution, but rather I'd like to understand, from the lens of python, how resources would be allocated in such scenarios so that I can optimize my code to my processing use-case.
Cheers!
I have a python script that is called from a java program
The java program feeds data to the python script sys.stdin and the java program receives data from the python process outputstream .
What is know is this .
running the command 'python script.py' from the java program on 10MB of data takes about 35 seconds.
However running the commands 'python script.py > temp.data ' and then cat temp.data is significantly faster.
The order of magnitude of performance is even more drastic as the data gets larger.
In order to address this , I am thinking maybe there is a way to change the sys.stdout to mimic what I am doing.
Or maybe I can pipe the python script output to a virtual file .
Any recommendations ?
This is probably a buffering problem when you have the Java program writing to one filehandle and reading from another filehandle. The order of those in the Java and the size of the writes is suboptimal and it's slowing itself down.
I would try "python -u script.py" to see what it does when you ask python to unbuffer, which should be slower but might trick your calling program into racing a different way, perhaps faster.
The larger fix, I think, is to batch your code, as you are testing with, and read the resulting file, or to use posix select() or filehandle events to handle how your java times its writes and reads.
I have a large number of small files to download and process from s3.
The downloading is fairly fast as the individual files are only a few megabytes each. Together, they are about 100gb. The processing takes roughly twice as long as the downloading takes and is purely cpu bound. Therefore, by completing the processing in multiple threads while downloading other files, it should be possible to shorten the overall runtime.
Currently, I am downloading a file, processing it and moving on the next file. Is there a way in python where I download all files one after another and process each one as soon as it completes downloading? The key difference here is that while each file is processing, the another is always downloading.
My code looks like:
files = {'txt': ['filepath1', 'filepath2', ...],
'tsv': ['filepath1', 'filepath2', ...]
}
for kind in files.keys():
subprocess.check_call(f'mkdir -p {kind}', shell=True)
subprocess.call(f'mkdir -p {kind}/normalized', shell=True)
for i, file in enumerate(files[kind]):
subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
f = file.split('/')[-1]
subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)
I've also written a multiprocessing solution where I can simultaneously download and process multiple files, but this doesn't result in a speed improvement as the network speed was already saturated. The bottleneck is in the processing. I've included it in case it helps you guys.
from contextlib import closing
from os import cpu_count
from multiprocessing import Pool
def download_and_proc(file, kind='txt'):
subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
f = file.split('/')[-1]
subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)
with closing(Pool(processes=cpu_count()*2)) as pool:
pool.map(download_and_proc, files)
Your current multiprocessing code should be pretty close to optimal over the long term. It won't always be downloading at maximum speed, since the same threads of execution that are responsible for downloading a file will wait until the file has been processed before downloading another one. But it should usually have all the CPU consumed in processing, even if some network capacity is going unused. If you tried to always be downloading too, you'd eventually run out of files to download and the network would go idle for the same amount of time, just all at the end of the batch job.
One possible exception is if the time taken to process a file is always exactly the same. Then you might find your workers running in lockstep, where they all download at the same time, then all process at the same time, even though there are more workers than there are CPUs for them to run on. Unless the processing is somehow tied to a real time clock, that doesn't seem likely to occur for very long. Most of the time you'd have some processes finishing before others, and so the downloads would end up getting staggered.
So improving your code is not likely to give you much in the way of a speedup. If you think you need it though, you could split the downloading and processing into two separate pools. It might even work to do one of them as a single-process loop in the main process, but I'll show the full two-pool version here:
def download_worker(file, kind='txt'):
subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
return file
def processing_worker(file, kind='txt')
f = file.split('/')[-1]
subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)
with Pool() as download_pool, Pool() as processing_pool:
downloaded_iterator = download_pool.imap(download_worker, files) # imap returns an iterator
processing_pool.map(processing_worker, downloaded_iterator)
This should both download and process as fast as your system is capable. If the downloading of a file takes less time that its processing, then it's pretty likely that the first pool will be done before the second one, which the code will handle just fine. If the processing is not the bottleneck, it will support that too (the second pool will be idle some of the time, waiting on files to finish downloading).
I have a list of files that I need to access and process from S3 buckets through a lambda function and the idea is to loop through each of the files and collect data from all files in parallel. My first thought was to use threading which resulted in an issue that only allowed my max pool size to be 10, whereas I'm processing many files. I want to be able to continuously append processes until all files have been accessed instead of creating a list of processes and then running them in parallel which seems to be the case in multiprocessing's Pool. I'd appreciate any suggestions.
You may not be able to achieve performance gains using multi-threading under a single process in python due to the GIL. However you could use a bash script to start multiple python processes simultaneously.
For example if you wanted to perform some tasks and write results to a common file you could use the following prep.sh file to create an empty results file.
#!/bin/bash
if [ -e results.txt ]
then
rm results.txt
fi
touch results.txt
And the following control.sh file to spawn your multiple python processes.
#!/bin/bash
Arr=( Do Many Things )
for i in "${Arr[#]}"; do
python process.py $i &
done
With the following process.py file which simply takes a command line argument and writes it to the result file followed by a return carriage.
#!/bin/python
import sys
target = sys.argv[1]
def process(argument):
with open("results.txt", "a") as results:
results.write(argument+"\n")
process(target)
Ovbiously you would need to edit the array in control.sh to reflect the files you need to access and the process in process.py to reflect the retrieval and analysis of those files.
So I have two scripts. Main.py, which is ran upon startup and is ran in the background. otherscript.py which is ran whenever the user invokes it.
main.py crunches some data then writes it out to a file every iteration of the while loop. (this data is about ~ 1.17 mb), and erases old data. So data.txt contains the latest crunched data.
otherscript.py will read data.txt (the current data at that instant) then do something with it.
main.py
while True:
file = "data.txt"
data = crunchData()
file.write(data)
otherscript.py
data = file.read("data.txt")
doSomethingWithData(data)
How can I make the connection between the two scripts process faster? Are there any alternatives to file writing the data?
This is a problem of Inter-Process Communication (IPC). In your case, you basically have a producer process, and a consumer process.
One way of doing IPC, as you've found, is using files. However, it'll saturate the disk quickly if there's lots of data going through.
If you had a straight consumer that wanted to read all the data all the time, the easiest way to do this would probably be a pipe - at least if you're on a unix platform (mac, linux).
If you want a cross-platform solution, my advice, in this case, would be to use a socket. Basically, you open a network port on the producer process, and every time a consumer connects, you dump the latest data. You can find a simple how-to on sockets here.