Using threading/multiprocessing in Python to download images concurrently - python

I have a list of search queries to build a dataset:
classes = [...]. There are 100 search queries in this list.
Basically, I divide the list into 4 chunks of 25 queries.
def divide_chunks(l, n):
for i in range(0, len(l), n):
yield classes[i:i + n]
classes = list(divide_chunks(classes, 25))
And below, I've created a function that downloads queries from each chunk iteratively:
def download_chunk(n):
for label in classes[n]:
try:
downloader.download(label, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False,verbose=True)
except:
pass
However, I want to run each 4 chunks concurrently. In other words, I want to run 4 separate iterative operations concurrently. I took both the Threading and Multiprocessing approaches but both of them don't work:
process_1 = Process(target=download_chunk(0))
process_1.start()
process_2 = Process(target=download_chunk(1))
process_2.start()
process_3 = Process(target=download_chunk(2))
process_3.start()
process_4 = Process(target=download_chunk(3))
process_4.start()
process_1.join()
process_2.join()
process_3.join()
process_4.join()
###########################################################
thread_1 = threading.Thread(target=download_chunk(0)).start()
thread_2 = threading.Thread(target=download_chunk(1)).start()
thread_3 = threading.Thread(target=download_chunk(2)).start()
thread_4 = threading.Thread(target=download_chunk(3)).start()

You're running download_chunk outside of the thread/process. You need to provide the function and arguments separately in order to delay execution:
For example:
Process(target=download_chunk, args=(0,))
Refer to the multiprocessing docs for more information about using the multiprocessing.Process class.
For this use-case, I would suggest using multiprocessing.Pool:
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(4) as pool:
pool.map(download_chunk, range(4))
It handles the work of creating, starting, and later joining the 4 processes. Each process calls download_chunk with each of the arguments provided in the iterable, which is range(4) in this case.
More info about multiprocessing.Pool can be found in the docs.

Related

Programmatically setting number of processes with ray

I want to use Ray to parallelize some computations in python. As part of this, I want a method which takes the desired number of worker processes as an argument.
The introductory articles on Ray that I can find say to specify the number of processes at the top level, which is different from what I want. Is it possible to specify similarly to how one would do when instantiating e.g. a multiprocessing Pool object, as illustrated below?
Example using multiprocessing:
import multiprocessing as mp
def f(x):
return 2*x
def compute_results(x, n_jobs=4):
with mp.Pool(n_jobs) as pool:
res = pool.map(f, x)
return res
data = [1,2,3]
results = compute_results(data, n_jobs=4)
Example using ray
import ray
# Tutorials say to designate the number of cores already here
ray.remote(4)
def f(x):
return 2*x
def compute_results(x):
result_ids = [f.remote(val) for val in x]
res = ray.get(result_ids)
return res
If you run f.remote() four times then Ray will create four worker processes to run it.
Btw, you can use multiprocessing.Pool with Ray: https://docs.ray.io/en/latest/ray-more-libs/multiprocessing.html

Can multiprocessing.Queue replace Manager.list() in python?

I am working on a project where I am using multiprocessing and trying to achieve the minimal time. (I have tested that my one process takes around 4secs and if there are 8 processes working in parallel they should take around the same time or lets say around 6 to 7secs at max.
In the list of arguments, A Manager.List() (lets call it main_list) is a common argument that is passed to each process to append a list in the main_list after processing a txt file ( includes conversions, transformations and multiplications of hex data).
Same procedure is followed in all 8 processes.
By using Manager.List(), it was taking around 22 secs. I wanted a way around so I could reduce this time. Now, I am using Queue to achieve my goal but it seems like that the queue will not be effective for this method?
def square(x, q):
q.put((x,x*x))
if __name__=='__main__':
qout = mp.Queue()
processes=[]
t1=time.perf_counter()
for i in range(10):
p = mp.Process(target=square, args=(i, qout))
p.start()
processes.append(p)
for p in processes:
p.join()
unsorted_result = [qout.get() for p in processes]
result = [t[1] for t in sorted(unsorted_result)]
t2=time.perf_counter()
print(t2-t1)
print(result)
OUTPUT
0.7646916
I want to be sure if i can consider using Queue this way instead of Manager.list() to reduce this time.
I am sorry for not sharing the actual code.
See my comment to your question. This would be the solution using a multiprocessing pool with method map:
from multiprocessing import Pool
def square(x):
return x * x
if __name__=='__main__':
# Create a pool with 10 processes:
pool = Pool(10)
result = pool.map(square, range(10))
print(result)
pool.close()
pool.join()
Prints:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
The managed list that you were using is represented by a proxy object. Every append operation you do on that list results in a message being sent to a thread running in a process started by the multiprocessing.SyncManager instance that was created when you presumably called multiprocessing.Manager(). It is in that process where the actual list resides. So managed lists are generally not the most efficient solution available.

Plotting the pool map for multi processing Python

How can I run multiple processes pool where I process run1-3 asynchronously, with a multi processing tool in python. I am trying to pass the values (10,2,4),(55,6,8),(9,8,7) for run1,run2,run3 respectively?
import multiprocessing
def Numbers(number,number2,divider):
value = number * number2/divider
return value
if __name__ == "__main__":
with multiprocessing.Pool(3) as pool: # 3 processes
run1, run2, run3 = pool.map(Numbers, [(10,2,4),(55,6,8),(9,8,7)]) # map input & output
You just need to use method starmap instead of map, which, according to the documentation:
Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
import multiprocessing
def Numbers(number,number2,divider):
value = number * number2/divider
return value
if __name__ == "__main__":
with multiprocessing.Pool(3) as pool: # 3 processes
run1, run2, run3 = pool.starmap(Numbers, [(10,2,4),(55,6,8),(9,8,7)]) # map input & output
print(run1, run2, run3)
Prints:
5.0 41.25 10.285714285714286
Note
This is the correct way of doing what you want to do, but you will not find that using multiprocessing for such a trivial worker function will improve performance; in fact, it will degrade performance due to the overhead in creating the pool and passing arguments and results to and from one address space to another.
Python's multiprocessing library does however have a wrapper for piping data between a parent and child process, the Manager which has shared data utilities such as a shared dictionary. There is a good stack overflow post here about the topic.
Using multiprocessing you can pass unique arguments and a shared dictionary to each process, and you must ensure each process writes to a different key in the dictionary.
An example of this in use given your example is as follows:
import multiprocessing
def worker(process_key, return_dict, compute_array):
"""worker function"""
number = compute_array[0]
number2 = compute_array[1]
divider = compute_array[2]
return_dict[process_key] = number * number2/divider
if __name__ == "__main__":
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
compute_arrays = [[10, 2, 4], [55, 6, 8], [9, 8, 7]]
for i in range(len(compute_arrays)):
p = multiprocessing.Process(target=worker, args=(
i, return_dict, compute_arrays[i]))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print(return_dict)
Edit: Information from Booboo is much more precise, I had a recommendation for threading which I'm removing as it's certainly not the right utility in Python due to the GIL.

How to concurrently call multiple separate functions and get ordered results in Python's concurrent.futures?

Suppose I have two independent functions. I'd like to call them concurrently, using python's concurrent.futures.ThreadPoolExecutor. Is there a way to call them using Executor and ensure they are returned in order of submission?
I understand this is possible with the Executor.map, but I am looking to parallelize two separate functions, and not one function with a interable input.
I have example code below, but it doesn't guarantee that fn_a will return first, (by design of the wait function).
from concurrent.futures import ThreadPoolExecutor, wait
import time
def fn_a():
t_sleep = 0.5
print("fn_a: Wait {} seconds".format(t_sleep))
time.sleep(t_sleep)
ret = t_sleep * 5 # Do unique work
return "fn_a: return {}".format(ret)
def fn_b():
t_sleep = 1.0
print("fn_b: Wait {} seconds".format(t_sleep))
time.sleep(t_sleep)
ret = t_sleep * 10 # Do unique work
return "fn_b: return {}".format(ret)
if __name__ == "__main__":
with ThreadPoolExecutor() as executor:
futures = []
futures.append(executor.submit(fn_a))
futures.append(executor.submit(fn_b))
complete_futures, incomplete_futures = wait(futures)
for f in complete_futures:
print(f.result())
I'm also interested in knowing if there is a way to do this with joblib
Think I found a reasonable option using lambda and partials. The partials allow me to pass arguments to some functions in the parallelized iterable, but not others.
from functools import partial
import concurrent.futures
fns = [partial(fn_a), partial(fn_b)]
data = []
with concurrent.futures.ThreadPoolExecutor() as executor:
try:
for result in executor.map(lambda x: x(), fns):
data.append(result)
Since it is using executor.map, it returns in order.

Python process merge results

I am currently trying to implement a class to do intensive calculus :
import random
import multiprocessing as mp
class IntensiveStuff:
def __init__(self):
self.N = 20
self.nb_process = 4
set_of_things = set()
def lunch_multiprocessing(self):
processes = []
for i in range(self.nb_process):
processes.append(mp.Process(target=self.process_method, args=()))
[x.start() for x in processes]
[x.join() for x in processes]
set_of_things = ... # I want all the sub_set of 'process_method' updated in set_of_things
def process_method(self):
sub_set = set()
for _ in range(self.N):
sub_set.add(random.randint(100))
I want to compute independent calculus, put the results in a sub_set for each process and merge all the sub_set in the set_of_things (which are object in the real code).
I have trying to use Queue without success, any advise ?
P.S : have tried to reproduce the code in Can a set() be shared between Python processes? but without any luck as well.
Processes can't share memory, but they may communicate via Pipes, sockets, etc. multiprocessing module has special objects (i believe, they use pipes under the hood). multiprocessing.Queue should also work, but I use often these two objects:
multiprocessing.Manager().list() and
multiprocessing.Manager().dict()
results = multiprocessing.Manager().list()
# now a bit of your code
processes = []
for i in range(self.nb_process):
processes.append(mp.Process(target=self.process_method, args=(), results))
def process_method(self, results):
sub_set = set()
for _ in range(self.N):
sub_set.add(random.randint(100))
results.append(sub_set) # or what you really need to add to results

Categories