Parallel Processing AWS S3 Data in Python

Parallel Processing AWS S3 Data in Python - python

I have a list of files that I need to access and process from S3 buckets through a lambda function and the idea is to loop through each of the files and collect data from all files in parallel. My first thought was to use threading which resulted in an issue that only allowed my max pool size to be 10, whereas I'm processing many files. I want to be able to continuously append processes until all files have been accessed instead of creating a list of processes and then running them in parallel which seems to be the case in multiprocessing's Pool. I'd appreciate any suggestions.

You may not be able to achieve performance gains using multi-threading under a single process in python due to the GIL. However you could use a bash script to start multiple python processes simultaneously.
For example if you wanted to perform some tasks and write results to a common file you could use the following prep.sh file to create an empty results file.
#!/bin/bash
if [ -e results.txt ]
then
rm results.txt
fi
touch results.txt
And the following control.sh file to spawn your multiple python processes.
#!/bin/bash
Arr=( Do Many Things )
for i in "${Arr[#]}"; do
python process.py $i &
done
With the following process.py file which simply takes a command line argument and writes it to the result file followed by a return carriage.
#!/bin/python
import sys
target = sys.argv[1]
def process(argument):
with open("results.txt", "a") as results:
results.write(argument+"\n")
process(target)
Ovbiously you would need to edit the array in control.sh to reflect the files you need to access and the process in process.py to reflect the retrieval and analysis of those files.

Related

How can I simultaneously complete two sets of tasks in python when one depends on the other?

I have a large number of small files to download and process from s3.
The downloading is fairly fast as the individual files are only a few megabytes each. Together, they are about 100gb. The processing takes roughly twice as long as the downloading takes and is purely cpu bound. Therefore, by completing the processing in multiple threads while downloading other files, it should be possible to shorten the overall runtime.
Currently, I am downloading a file, processing it and moving on the next file. Is there a way in python where I download all files one after another and process each one as soon as it completes downloading? The key difference here is that while each file is processing, the another is always downloading.
My code looks like:
files = {'txt': ['filepath1', 'filepath2', ...],
'tsv': ['filepath1', 'filepath2', ...]
}
for kind in files.keys():
subprocess.check_call(f'mkdir -p {kind}', shell=True)
subprocess.call(f'mkdir -p {kind}/normalized', shell=True)
for i, file in enumerate(files[kind]):
subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
f = file.split('/')[-1]
subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)
I've also written a multiprocessing solution where I can simultaneously download and process multiple files, but this doesn't result in a speed improvement as the network speed was already saturated. The bottleneck is in the processing. I've included it in case it helps you guys.
from contextlib import closing
from os import cpu_count
from multiprocessing import Pool
def download_and_proc(file, kind='txt'):
subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
f = file.split('/')[-1]
subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)
with closing(Pool(processes=cpu_count()*2)) as pool:
pool.map(download_and_proc, files)

Your current multiprocessing code should be pretty close to optimal over the long term. It won't always be downloading at maximum speed, since the same threads of execution that are responsible for downloading a file will wait until the file has been processed before downloading another one. But it should usually have all the CPU consumed in processing, even if some network capacity is going unused. If you tried to always be downloading too, you'd eventually run out of files to download and the network would go idle for the same amount of time, just all at the end of the batch job.
One possible exception is if the time taken to process a file is always exactly the same. Then you might find your workers running in lockstep, where they all download at the same time, then all process at the same time, even though there are more workers than there are CPUs for them to run on. Unless the processing is somehow tied to a real time clock, that doesn't seem likely to occur for very long. Most of the time you'd have some processes finishing before others, and so the downloads would end up getting staggered.
So improving your code is not likely to give you much in the way of a speedup. If you think you need it though, you could split the downloading and processing into two separate pools. It might even work to do one of them as a single-process loop in the main process, but I'll show the full two-pool version here:
def download_worker(file, kind='txt'):
subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
return file
def processing_worker(file, kind='txt')
f = file.split('/')[-1]
subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)
with Pool() as download_pool, Pool() as processing_pool:
downloaded_iterator = download_pool.imap(download_worker, files) # imap returns an iterator
processing_pool.map(processing_worker, downloaded_iterator)
This should both download and process as fast as your system is capable. If the downloading of a file takes less time that its processing, then it's pretty likely that the first pool will be done before the second one, which the code will handle just fine. If the processing is not the bottleneck, it will support that too (the second pool will be idle some of the time, waiting on files to finish downloading).

Running executables in parallel and consecutively using python

I have two folders. FolderA contains dozens of executables, FolderB contains an initial input file and a subsequent input file (both text files).
I would like to write a script that will do the following:
Create folder for each of the executables
Copy corresponding executable and a copy of the initial input file into this new folder
Run executable for this input file
Once process is complete, copy subsequent input file and run executable again
End when this second process is done
This could easily be a for loop and I could accomplish this using the os package, unfortunately I'd like to see if there is a way to run this process in parallel for all the executables, or some strategic number of executables at a given iteration.
I've never done parallel processing before and I also have no idea how it can be accomplished for such a two-step execution process. Any help would be most appreciated. Thanks.

You can easily use multiprocessing for that.
Write a function which runs the entire process for a given executable:
def foo(exe_path):
do stuff
Then feed it into map:
import multiprocessing
pool = multiprocessing.Pool(os.cpu_count() - 1)
pool.map(foo, list_of_paths)

Python - reading hdf5 files in parallel

I have ~1000 separate HDF5 files stored on disk. Each only takes around 10ms to load into memory, so I was wondering what's the best way to load them parallely such that I get around a linear performance boost.
I've tried multiprocessing but that ends up being slower than just serially loading them in due to the overhead of setting up the processes. I've looked into Cython, specifically prange but having trouble optimizing it to get it faster. Any pointers would be appreciated!

This sounds like a job for mapreduce but if you only have one machine then I would recommend using pipes. Write one script to open the files and print the data to stdout, then in another script you read the data from stdin and process. The you redirect script1 into script2.
# script1.py
FILES_TO_READ = ...
for filename in FILES_TO_READ:
# open the file
# do work
# print data
# script2.py
while True:
line = input()
# do work
$> ./script1.py | ./script2.py

Python: parsing logs from remote server

I need to process log files from a remote directory and parse their contents. This task can be divided into the following:
Run ssh find with criteria to determine files to process
Get the relevant contents of these files with zgrep
Process the results from 2 with a python algorithm and store to a local db
Steps 1 and 3 are very fast, so I am looking to improve Step 2.
The files are stored as gz or plaintext and the order in which they are processed is very important. Newer logs need to be processed first to avoid discrepancies with older logs.
To get and filter the log lines, I have tried the following approaches:
Download logs to a temp folder and process them as they are downloaded, in parallel. A python process triggers an scp command and a parallel thread inspects the temp folder for completed downloads until scp is finished. If the file is downloaded run zgrep, process and delete the file.
Run ssh remote zgrep "regex" file1 file2 file3, grab the results and process them.
Method 2 is a more readable and elegant solution, however it is also much slower. Using method 1 I can download and parse 280 files in about 1:30 minutes. Using method 2, it will take closer to 5:00 minutes. One of the main problems with the download-process approach is that the directory can be altered while the script is running, leading to several checks being needed in the code.
To run the shell commands from python I am currently using subprocess.check_output and the multiprocessing and threading modules.
Can you think of a way this algorithm can be improved?

Multiple processes reading&deleting files in the same directory

I have a directory with thousands of files and each of them has to be processed (by a python script) and subsequently deleted.
I would like to write a bash script that reads a file in the folder, processes it, deletes it and moves onto another file - the order is not important. There will be n running instances of this bash script (e.g. 10), all operating on the same directory. They quit when there are no more files left in the directory.
I think this creates a race condition. Could you give me an advice (or a code snippet) how to make sure that no two bash scripts operate on the same file?
Or do you think I should rather implement multithreading in Python (instead of running n different bash scripts)?

You can use the fact the file renames (on the same file system) are atomic on Unix systems, i.e. a file was either renamed or not. For the sake of clarity, let us assume that all files you need to process have name beginning with A (you can avoid this by having some separate folder for the files you are processing right now).
Then, your bash script iterates over the files, tries to rename them, calls the python script (I call it process here) if it succeeds and else just continues. Like this:
#!/bin/bash
for file in A*; do
pfile=processing.$file
if mv "$file" "$pfile"; then
process "$pfile"
rm "$pfile"
fi
done
This snippet uses the fact that mv returns a 0 exit code if it was able to move the file and a non-zero exit code else.

The only sure way that no two scripts will act on the same file at the same time is to employ some kind of file locking mechanism. A simple way to do this could be to rename the file before beginning work, by appending some known string to the file name. The work is then done and the file deleted. Each script tests the file name before doing anything, and moves on if it is 'special'.
A more complex approach would be to maintain a temporary file containing the names of files that are 'in process'. This file would obviously need to be removed once everything is finished.

I think the solution to your problem is a consumer producer pattern. I think this solution is the right way to start:
producer/consumer problem with python multiprocessing

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.