Using Python's concurrent.futures to process objects in parallel - python

I just started using the library concurrent.futures from Python 3 to apply to a list of images a number of functions, in order to process these images and reshape them.
The functions are resize(height, width) and opacity(number).
On the other hand, I have the images() function that yield file-like objects,
so I tried this code to process my images in parallel:
import concurrent.futures
From mainfile import images
From mainfile import shape
def parallel_image_processing :
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future = executor.submit(images)
for fileobject in future.result() :
future1 = executor.submit( shape.resize, fileobject, "65","85")
future2 = executor.submit( shape.opacity, fileobject, "0.5")
Could somebody tell if I am on the right path to accomplish this?

I would recommend making images just return a path, rather than an open file object:
def images():
...
yield os.path.join(image_dir[0], filename)
And then using this:
from functools import partial
def open_and_call(func, filename, args=(), kwargs={}):
with open(filename, 'rb') as f:
return func(f, *args, **kwargs)
def parallel_image_processing():
resize_func = partial(open_and_call, shape.resize, args=("65", "85"))
opacity_func = partial(open_and_call, shape.opacity, args=("0.5"))
img_list = list(images())
with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
futures1 = executor.map(resize_func, img_list)
futures2 = executor.map(opacity_func, img_list)
concurrent.futures.wait([futures1, futures2])
if __name__ == "__main__":
# Make sure the entry point to the function that creates the executor
# is inside an `if __name__ == "__main__"` guard if you're on Windows.
parallel_image_processing()
If you're using CPython (as opposed to an alternative implementation that doesn't have a GIL, like Jython) you don't want to use ThreadPoolExecutor, because image processing is CPU-intensive; due to the GIL, only one thread can run at a time in CPython, so you won't actually do anything in parallel if you use threads for your use-case. Instead, use ProcessPoolExecutor, which will use processes instead of threads, avoiding the GIL altogether. Note that this is why I recommended not returning file-like objects from images - you can't pass an open file handle to the worker processes. You have to open the files in the workers instead.
To do this, we have our executor call a little shim function (open_and_call), which will open the file in the worker process, and then call the resize/opacity functions with the correct arguments.
I'm also using executor.map instead of executor.submit, so that we can call resize/opacity for every item returned by images() without an explicit for loop. I use functools.partial to make it easier to call a function taking multiple arguments with executor.map (which only allows you to call functions that take a single argument).
There's also no need to call images() in the executor, since you're going to wait for its results before continuing anyway. Just call it like a normal function. I convert the generator object returned by images() to a list prior to calling map, as well. If you're concerned about memory usage, you can call images() directly in each map call, but if not, it's probably faster to just call images() once and store it as a list.

Related

Multiprocessing where new process starts hafway through other process

I have a Python script that does two things; 1) it downloads a large file by making an API call, and 2) preprocess that large file. I want to use Multiprocessing to run my script. Each individual part (1 and 2) takes quite long. Everything happens in-memory due to the large size of the files, so ideally a single core would do both (1) and (2) consecutively. I have a large amount of cores available (100+), but I can only have 4 API calls running at the same time (limitation set by the API developers). So what I want to do is spawn 4 cores that start downloading by making an API-call, and as soon as one of those cores is done downloading and starts preprocessing I want a new core to start the whole process as well. This so there's always 4 cores downloading, and as many cores as needed doing the pre-processing. I do not know however how to have a new core spawn as soon as another core is finished with the first part of the script.
My actual code is way too complex to just dump here, but let's say I have the following two functions:
import requests
def make_api_call(val):
"""Function that does part 1); makes an API call, stores it in memory and returns a large
satellite GeoTIFF
"""
large_image = requests.get(val)
return(large_image)
def preprocess_large_image(large_image):
"""Function that does part 2); preprocesses a large image, and returns the relevant data
"""
results = preprocess(large_image)
return(results)
how then can I make sure that as soon as a single core/process is finished with 'make_api_call' and starts with 'preprocess_large_image', another core spawns and starts the entire process as well? This so there is always 4 images downloading side-by-side. Thank you in advance for the help!
This is a perfect application for a multiprocessing.Semaphore (or for safety, use a BoundedSemaphore)! Basically you put a lock around the api call part of the process, but let up to 4 worker processes hold the lock at any given time. For various reasons, things like Lock, Semaphore, Queue, etc all need to be passed at the creation of a Pool, rather than when a method like map or imap is called. This is done by specifying an initialization function in the pool constructor.
def api_call(arg):
return foo
def process_data(foo):
return "done"
def map_func(arg):
global semaphore
with semaphore:
foo = api_call(arg)
return process_data(foo)
def init_pool(s):
global semaphore = s
if __name__ == "__main__":
s = mp.BoundedSemaphore(4) #max concurrent API calls
with mp.Pool(n_workers, init_pool, (s,)) as p: #n_workers should be great enough that you always have a free worker waiting on semaphore.acquire()
for result in p.imap(map_func, arglist):
print(result)
If both the downloading (part 1) and the conversion (part 2) take long, there is not much reason to do everything in memory.
Keep in mind that networking is generally slower than disk operations.
So I would suggest to use two pools, saving the downloaded files to disk, and send file names to workers.
The first Pool is created with four workers and does the downloading. The worker saves the image to a file and returns the filename. With this Pool you use the imap_unordered method, because that starts yielding values as soon as they become available.
The second Pool does the image processing. It gets fed by apply_async, which returns an AsyncResult object.
We need to save those to keep track of when all the conversions are finished.
Note that map or imap_unordered are not suitable here because they require a ready-made iterable.
def download(url):
large_image = requests.get(url)
filename = url_to_filename(url) # you need to write this
with open(filename, "wb") as imgf:
imgf.write(large_image)
def process_image(name):
with open(name, "rb") as f:
large_image = f.read()
# File processing goes here
with open(name, "wb") as f:
f.write(large_image)
return name
dlp = multiprocessing.Pool(processes=4)
# Default pool size is os.cpu_count(); might be too much.
imgp = multiprocessing.Pool(processes=20)
urllist = ['http://foo', 'http://bar'] # et cetera
in_progress = []
for name in dlp.imap_unordered(download, urllist):
in_progress.append(imgp.apply_async(process_image, (name,)), )
# Wait for the conversions to finish.
while in_progress:
finished = []
for res in in_progress:
if res.ready():
finished.append(res)
for f in finished:
in_progress.remove(f)
print(f"Finished processing '{f.get()}'.")
time.sleep(0.1)

Python Multiprocessing return results as set in chunksize

I would like to process a large amount of csv files stored in file_list with a function called get_scores_dataframe. This function takes a second argument phenotypes stored in another list. The function then writes the result back to csv files. I managed to parallelize this task using the ProcessPoolExecutor() and as such, it works.
with concurrent.futures.ProcessPoolExecutor() as executor:
phenotypes = [phenotype for i in range(len(file_list))]
futures = executor.map(get_scores_dataframe, file_list, phenotypes,
chunksize=25)
filenames = executor.map(os.path.basename, file_list)
for future, filename in zip(futures, filenames):
futures.to_csv(os.path.join(f'{output_path}',f'{filename}.csv'),
index = False)
As you can see, I am using a context manager for this and within the context manager the method map() where I can set the option chunksize. However, I would like that the program writes the csv files as it finishes processing each dataframe. It appears that the context manager waits until all jobs are done and then writes the results to the csv files.
Do you have an idea how I can achieve this?
First, executor.map does not return Future instances, so your variable futures is poorly named. It does return an iterator that yields the return values of applying get_scores_dataframe to each element of file_list in turn. Second, seeing how this is used next, it would appear that these return values are input files (which may or may not be the same file as the input argument -- can't be sure from the lack of code shown). Also, using the process pool map function rather than the builtin map function to get the base name of the filename arguments seems like overkill. Finally, in your code, it would not be futures.to_csv, but rather future.to_csv. So I am confused as to how your code could have worked at all.
If you modify your function get_scores_dataframe to return a tuple consisting of a dataframe and the original passed filename argument, then we can process the results in completion order using as_competed:
from concurrent.futures import as_completed
import multiprocessing
with concurrent.futures.ProcessPoolExecutor(multiprocessing.cpu_count() - 1) as executor:
futures = [executor.submit(get_scores_dataframe, file, phenotype) for file in file_list]
for future in as_completed(futures):
# it is assumed return value is tuple: (data frame, original filename argument):
df, file = future.result()
csv_filename = os.path.basename(file)
df.to_csv(os.path.join(f'{output_path}', f'{csv_filename}.csv'), index = False)
Now by using submit you are losing the ability to "chunking" up job submissions. We can switch to using multiprocessing.Pool with imap_unordered. But imap_unordered can only pass a single argument to the worker function. So, if you are able to modify your worker to change the order of the arguments, we can make phenotype the first one and use a partial (see manual):
import multiprocessing
from functools import partial
POOL_SIZE = multiprocessing.cpu_count() - 1 # leave 1 for main process
def compute_chunksize(iterable_size):
if iterable_size == 0:
return 0
chunksize, extra = divmod(iterable_size, POOL_SIZE * 4)
if extra:
chunksize += 1
return chunksize
with multiprocessing.Pool(POOL_SIZE) as pool:
chunksize = compute_chunksize(len(file_list))
worker = partial(get_scores_dataframe, phenotype)
# it is assumed that start_processing returns a tuple: (data frame, original filename argument)
for df, file in pool.imap_unordered(worker, file_list, chunksize):
csv_filename = os.path.basename(file)
df.to_csv(os.path.join(f'{output_path}', f'{csv_filename}.csv'), index = False)

multiprocessing.Pool: calling helper functions when using apply_async's callback option

How does the flow of apply_async work between calling the iterable (?) function and the callback function?
Setup: I am reading some lines of all the files inside a 2000 file directory, some with millions of lines, some with only a few. Some header/formatting/date data is extracted to charecterize each file. This is done on a 16 CPU machine, so it made sense to multiprocess it.
Currently, the expected result is being sent to a list (ahlala) so I can print it out; later, this will be written to *.csv. This is a simplified version of my code, originally based off this extremely helpful post.
import multiprocessing as mp
def dirwalker(directory):
ahlala = []
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# results() is the callback function
def results(r):
ahlala.extend(r) # or .append, haven't yet decided
# helper function
def Z(arr):
return fileinfo # to X() or Y()!
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count()
for f in files:
if (filetype(f) == filetypeX):
pool.apply_async(X, args=(f,), callback=results)
elif (filetype(f) == filetypeY):
pool.apply_async(Y, args=(f,), callback=results)
pool.close(); pool.join()
return ahlala
Note, the code works if I put all of Z(), the helper function, into either X(), Y(), or results(), but is this either repetitive or possibly slower than possible? I know that the callback function is called for every function call, but when is the callback function called? Is it after pool.apply_async()...finishes all the jobs for the processes? Shouldn't it be faster if these helper functions were called within the scope (?) of the first function pool.apply_async() takes (in this case, X())? If not, should I just put the helper function in results()?
Other related ideas: Are daemon processes why nothing shows up? I am also very confused about how to queue things, and if this is the problem. This seems like a place to start learning it, but can queuing be safely ignored when using apply_async, or only at a noticable time inefficiency?
You're asking about a whole bunch of different things here, so I'll try to cover it all as best I can:
The function you pass to callback will be executed in the main process (not the worker) as soon as the worker process returns its result. It is executed in a thread that the Pool object creates internally. That thread consumes objects from a result_queue, which is used to get the results from all the worker processes. After the thread pulls the result off the queue, it executes the callback. While your callback is executing, no other results can be pulled from the queue, so its important that the callback finishes quickly. With your example, as soon as one of the calls to X or Y you make via apply_async completes, the result will be placed into the result_queue by the worker process, and then the result-handling thread will pull the result off of the result_queue, and your callback will be executed.
Second, I suspect the reason you're not seeing anything happen with your example code is because all of your worker function calls are failing. If a worker function fails, callback will never be executed. The failure won't be reported at all unless you try to fetch the result from the AsyncResult object returned by the call to apply_async. However, since you're not saving any of those objects, you'll never know the failures occurred. If I were you, I'd try using pool.apply while you're testing so that you see errors as soon as they occur.
The reason the workers are probably failing (at least in the example code you provided) is because X and Y are defined as function inside another function. multiprocessing passes functions and objects to worker processes by pickling them in the main process, and unpickling them in the worker processes. Functions defined inside other functions are not picklable, which means multiprocessing won't be able to successfully unpickle them in the worker process. To fix this, define both functions at the top-level of your module, rather than embedded insice the dirwalker function.
You should definitely continue to call Z from X and Y, not in results. That way, Z can be run concurrently across all your worker processes, rather than having to be run one call at a time in your main process. And remember, your callback function is supposed to be as quick as possible, so you don't hold up processing results. Executing Z in there would slow things down.
Here's some simple example code that's similar to what you're doing, that hopefully gives you an idea of what your code should look like:
import multiprocessing as mp
import os
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(f)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(f)
return fileinfo
# helper function
def Z(arr):
return arr + "zzz"
def dirwalker(directory):
ahlala = []
# results() is the callback function
def results(r):
ahlala.append(r) # or .append, haven't yet decided
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count())
for f in files:
if len(f) > 5: # Just an arbitrary thing to split up the list with
pool.apply_async(X, args=(f,), callback=results) # ,error_callback=handle_error # In Python 3, there's an error_callback you can use to handle errors. It's not available in Python 2.7 though :(
else:
pool.apply_async(Y, args=(f,), callback=results)
pool.close()
pool.join()
return ahlala
if __name__ == "__main__":
print(dirwalker("/usr/bin"))
Output:
['ftpzzz', 'findhyphzzz', 'gcc-nm-4.8zzz', 'google-chromezzz' ... # lots more here ]
Edit:
You can create a dict object that's shared between your parent and child processes using the multiprocessing.Manager class:
pool = mp.Pool(mp.cpu_count())
m = multiprocessing.Manager()
helper_dict = m.dict()
for f in files:
if len(f) > 5:
pool.apply_async(X, args=(f, helper_dict), callback=results)
else:
pool.apply_async(Y, args=(f, helper_dict), callback=results)
Then make X and Y take a second argument called helper_dict (or whatever name you want), and you're all set.
The caveat is that this worked by creating a server process that contains a normal dict, and all your other processes talk to that one dict via a Proxy object. So every time you read or write to the dict, you're doing IPC. This makes it a lot slower than a real dict.

python mutiprocessing.pool.imap() alternatives

I running an issue when using pool, so I need to change it with some alternative.
I trying joblib.Parallel but it works more similar to map, not imap but I hope to get results like a iterator and being able to start handling results as soon the first one available, not waiting others to complete.
You could use pool.apply_async with a callback function, which will be called for each one as it finishes.
The callback can only take one argument, so if your functions return more than one thing, your callback function will need to be aware of this and unpack the arguments (it will receive one argument, a tuple, containing the return values of the function.)
Example:
from multiprocessing import Pool
import time
def f(x):
time.sleep(1/(x * 100))
return x**2
def callback(value):
print(value)
# in Python 3.3 you can use a context manager with the pool.
pool = Pool(3)
for value in range(10):
pool.apply_async(f, (value,), callback=callback)
pool.close()
pool.join()

How to efficiently iterate over multiple generators?

I've got three different generators, which yields data from the web. Therefore, each iteration may take a while until it's done.
I want to mix the calls to the generators, and thought about roundrobin (Found here).
The problem is that every call is blocked until it's done.
Is there a way to loop through all the generators at the same time, without blocking?
You can do this with the iter() method on my ThreadPool class.
pool.iter() yields threaded function return values until all of the decorated+called functions finish executing. Decorate all of your async functions, call them, then loop through pool.iter() to catch the values as they happen.
Example:
import time
from threadpool import ThreadPool
pool = ThreadPool(max_threads=25, catch_returns=True)
# decorate any functions you need to aggregate
# if you're pulling a function from an outside source
# you can still say 'func = pool(func)' or 'pool(func)()
#pool
def data(ID, start):
for i in xrange(start, start+4):
yield ID, i
time.sleep(1)
# each of these calls will spawn a thread and return immediately
# make sure you do either pool.finish() or pool.iter()
# otherwise your program will exit before the threads finish
data("generator 1", 5)
data("generator 2", 10)
data("generator 3", 64)
for value in pool.iter():
# this will print the generators' return values as they yield
print value
In short, no: there's no good way to do this without threads.
Sometimes ORMs are augmented with some kind of peek function or callback that will signal when data is available. Otherwise, you'll need to spawn threads in order to do this. If threads are not an option, you might try switching out your database library for an asynchronous one.

Categories