How to "batch write" from output Queue using multiprocessing? - python

Suppose I have the following multiprocessing structure:
import multiprocessing as mp
def worker(working_queue, output_queue):
while True:
if working_queue.empty() == True:
break
else:
picked = working_queue.get()
res_item = "Number " + str(picked)
output_queue.put(res_item)
return
if __name__ == '__main__':
static_input = xrange(100)
working_q = mp.Queue()
output_q = mp.Queue()
results_bank = []
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker,args=(working_q, output_q)) for i in range(2)]
for proc in processes:
proc.start()
for proc in processes:
proc.join()
results_bank = []
while True:
if output_q.empty() == True:
break
results_bank.append(output_q.get_nowait())
if len(results_bank) == len(static_input):
print "Good run"
else:
print "Bad run"
My question: How would I 'batch' write my results to a single file while the working_queue is still 'working' (or at least, not finished)?
Note: My actual data structure is not sensitive to unordered results relative to inputs (despite my example using integers).
Also, I think that batch/set writing from the output queue is best practice rather than from the growing results bank object. However, I am open to solutions relying on either approach. I am new to multiprocessing so unsure of best practice or most efficient solution(s) to this question.

If you wish to use mp.Processes and mp.Queues, here is a way to process the results in batches. The main idea is in the writer function, below:
import itertools as IT
import multiprocessing as mp
SENTINEL = None
static_len = 100
def worker(working_queue, output_queue):
for picked in iter(working_queue.get, SENTINEL):
res_item = "Number {:2d}".format(picked)
output_queue.put(res_item)
def writer(output_queue, threshold=10):
result_length = 0
items = iter(output_queue.get, SENTINEL)
for batch in iter(lambda: list(IT.islice(items, threshold)), []):
print('\n'.join(batch))
result_length += len(batch)
state = 'Good run' if result_length == static_len else 'Bad run'
print(state)
if __name__ == '__main__':
num_workers = 2
static_input = range(static_len)
working_q = mp.Queue()
output_q = mp.Queue()
writer_proc = mp.Process(target=writer, args=(output_q,))
writer_proc.start()
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker, args=(working_q, output_q))
for i in range(num_workers)]
for proc in processes:
proc.start()
# Put SENTINELs in the Queue to tell the workers to exit their for-loop
working_q.put(SENTINEL)
for proc in processes:
proc.join()
output_q.put(SENTINEL)
writer_proc.join()
When passed two arguments, iter expects a callable and a sentinel:
iter(callable, sentinel). The callable (i.e. a function) gets called repeatedly until it returns a value equal to the sentinel. So
items = iter(output_queue.get, SENTINEL)
defines items to be an iterable which, when iterated over, will return items from output_queue
until output_queue.get() returns SENTINEL.
The for-loop:
for batch in iter(lambda: list(IT.islice(items, threshold)), []):
calls the lambda function repeatedly until an empty list is returned. When called, the lambda function returns a list of up to threshold number of items from the iterable items. Thus, this is an idiom for "grouping by n items without padding". See this post for more on this idiom.
Note that it is not a good practice to test working_q.empty(). It could lead to a race condition. For example, suppose we have the 2 worker processes on these lines when the working_q has only 1 item left in it:
def worker(working_queue, output_queue):
while True:
if working_queue.empty() == True: <-- Process-1
break
else:
picked = working_queue.get() <-- Process-2
res_item = "Number " + str(picked)
output_queue.put(res_item)
return
Suppose Process-1 calls working_queue.empty() while there is still one item in the queue. So it returns False. Then Process-2 calls working_queue.get() and obtains the last item. Then Process-1 gets to line picked = working_queue.get() and hangs because there are no more items in the queue.
Therefore, use sentinels (as shown above) to concretely signal when a for-loop
or while-loop should stop instead of checking queue.empty().

There is no operation like "batch q.get". But it is a good practice to put/pop a batch of items instead of items one by one.
Which is exactly what multiprocessing.Pool.map is doing with its parameter chunksize :)
For writing output as soon as possible there is Pool.imap_unordered which returns an iterable instead of list.
def work(item):
return "Number " + str(item)
import multiprocessing
static_input = range(100)
chunksize = 10
with multiprocessing.Pool() as pool:
for out in pool.imap_unordered(work, static_input, chunksize):
print(out)

Related

Concurrent Python for loops with multiprocessing

I've got a list of links in my program, and I want to visit these links and collect some data from them, also I've got three other empty lists which I will append to based on the result. I don't know how to use multiprocessing on this task.
All processes should read from the main list and will write(append) to one of other three lists.
How to distribute links between processes.
Is multiprocessing even suitable for this task? If yes witch method should be used?
def remove_redirect():
main_list = get_data_from_db()
empty_list_1 = []
empty_list_2 = []
empty_list_3 = []
for link in main_list:
req_result = manage_requests(link)
if isinstance(req_result, bool):
pass
else:
try:
if req_result.find('div', class_="errorName"):
empty_list_1.append(link)
elif req_result.find('section', class_='filter'):
empty_list_2.append(link)
elif req_result.find('div', class_="proStatus"):
empty_list_3.append(link)
except Exception as error:
print(error)
pass
if __name__ == "__main__":
processes = []
for i in range(os.cpu_count()):
print('Registering process %d' % i)
processes.append(Process(target=remove_redirect))
for process in processes:
process.start()
for process in processes:
process.join()

Why does the following code not work with queue.SimpleQueue, but works with multiprocessing.SimpleQueue?

The following code is taken from the book "Fluent in Python". The comments were put there by me, from different sources.
import sys
from time import perf_counter
from typing import NamedTuple
from multiprocessing import Process, SimpleQueue, cpu_count
from multiprocessing import queues
from primes import is_prime, NUMBERS
class PrimeResult(NamedTuple):
n: int
prime: bool
elapsed: float
JobQueue = queues.SimpleQueue[int]
ResultQueue = queues.SimpleQueue[PrimeResult]
def check(n: int) -> PrimeResult:
t0 = perf_counter()
res = is_prime(n)
return PrimeResult(n, res, perf_counter() - t0)
def worker(jobs: JobQueue, results: ResultQueue) -> None:
# SimpleQueue.get(block=True, timeout=None)
# Remove and return an item from the queue. If optional args block is true
# and timeout is None (the default), block if necessary until an item is
# available. If timeout is a positive number, it blocks at most timeout
# seconds and raises the Empty exception if no item was available within
# that time. Otherwise (block is false), return an item if one is immediately
# available, else raise the Empty exception (timeout is ignored in that case).
while n := jobs.get():
print(n)
results.put(check(n))
# the following line, will tell main to increase procs_done by 1.
results.put(PrimeResult(0, False, 0.0))
def start_jobs(procs: int, jobs: JobQueue, results: ResultQueue) -> None:
for n in NUMBERS:
jobs.put(n)
for _ in range(procs):
proc = Process(target=worker, args=(jobs, results))
proc.start()
# zero will evaluate to False in the While loop of worker function
jobs.put(0)
def report(procs: int, results: ResultQueue) -> int:
checked = 0
procs_done = 0
while procs_done < procs:
n, prime, elapsed = results.get()
if n == 0:
procs_done += 1
else:
checked += 1
label = "P" if prime else " "
print(f"{n:16} {label} {elapsed:9.6f}s")
return checked
def main() -> None:
if len(sys.argv) < 2:
procs = cpu_count()
else:
procs = int(sys.argv[1])
print(f"Checking {len(NUMBERS)} numbers with {procs} processes:")
t0 = perf_counter()
jobs: JobQueue = SimpleQueue()
results: ResultQueue = SimpleQueue()
start_jobs(procs, jobs, results)
checked = report(procs, results)
elapsed = perf_counter() - t0
print(f"{checked} checks in {elapsed:.2f}s")
if __name__ == "__main__":
main()
If I now change the import to from queue import SimpleQueue, this same code does not work, even though in the python documentation
The [multiprocessing.]Queue, SimpleQueue and JoinableQueue types are
multi-producer, multi-consumer FIFO queues modelled on the queue.Queue
class in the standard library. They differ in that Queue lacks the
task_done() and join() methods introduced into Python 2.5’s
queue.Queue class.
The problem seems to be that the processes will never get the zero valued tasked, and I'm not sure why... This will never allows to exit the loop.
After much debugging, I figured out that somehow the queues were not being shared... The results queue seemed to be restarted, and the jobs also, but with increasing number of zeros...
Then it dawned on me that I should be using something intended for threads which will by definition share the same memory... and it's that!
The queue module seems to be used for threads, whereas, if we want specifically queues for processes, we should use necessarily the multiprocessing module.
Below is the code that worked for me:
import sys
from time import perf_counter
from typing import NamedTuple
from multiprocessing import cpu_count, #Process,SimpleQueue
from threading import Thread
# below is for multithreading
from queue import (
Queue as SimpleQueue,
Empty,
Full
)
# The Queue, SimpleQueue and JoinableQueue types are multi-producer, multi-consumer
# FIFO queues modelled on the queue.Queue class in the standard library.
# They differ in that Queue lacks the task_done() and join() methods introduced into
# Python 2.5’s queue.Queue class.
# from multiprocessing import queues
from primes import is_prime, NUMBERS
class PrimeResult(NamedTuple):
n: int
prime: bool
elapsed: float
JobQueue = SimpleQueue[int]
ResultQueue = SimpleQueue[PrimeResult]
def check(n: int) -> PrimeResult:
t0 = perf_counter()
res = is_prime(n)
print(f"inside check with {n}")
return PrimeResult(n, res, perf_counter() - t0)
def worker(jobs: JobQueue, results: ResultQueue) -> None:
# SimpleQueue.get(block=True, timeout=None)
# Remove and return an item from the queue. If optional args block is true
# and timeout is None (the default), block if necessary until an item is
# available. If timeout is a positive number, it blocks at most timeout
# seconds and raises the Empty exception if no item was available within
# that time. Otherwise (block is false), return an item if one is immediately
# available, else raise the Empty exception (timeout is ignored in that case).
try:
n = jobs.get()
while n!=0:
print(f"\nGetting element {n} in jobs queue")
# Queue.put(item, block=True, timeout=None)
# Put item into the queue. If optional args block is true and timeout is
# None (the default), block if necessary until a free slot is available.
# If timeout is a positive number, it blocks at most timeout seconds and
# raises the Full exception if no free slot was available within that time.
# Otherwise (block is false), put an item on the queue if a free slot is
# immediately available, else raise the Full exception (timeout is ignored in
# that case).
results.put(check(n))
jobs.task_done()
print(f"exited check with {n}")
n = jobs.get()
print(f"exited get with new {n}")
except Empty or Full as e:
print(f"exception :{repr(e)}")
# the following line, will tell main to increase procs_done by 1.
print(f"exited while with n = {n}")
results.put(PrimeResult(0, False, 0.0))
print(f"after results.put(PrimeResult(0, False, 0.0))")
def start_jobs(procs: int, jobs: JobQueue, results: ResultQueue) -> None:
for n in NUMBERS:
# we're putting all the numbers on the queue.
jobs.put(n)
print(f"putting element {n} in jobs queue")
for _ in range(procs):
# proc = Process(target=worker, args=(jobs, results))
# proc.start()
# start()
# Start the process’s activity.
# This must be called at most once per process object. It arranges for the
# object’s run() method to be invoked in a separate process.
# run()
# Method representing the process’s activity.
# You may override this method in a subclass.
# The standard run() method invokes the callable object passed to the
# object’s constructor as the target argument, if any, with sequential and
# keyword arguments taken from the args and kwargs arguments, respectively.
thrd = Thread(target=worker, args=(jobs, results))
thrd.start()
# zero will evaluate to False in the While loop of worker function
# The jobs queue already has all the numbers. Now, we're putting the poison
# pill to make the worker function finish, thus killing that corresponding
# process.
jobs.put(0)
print(f"putting element {0} in jobs queue")
def report(procs: int, results: ResultQueue) -> int:
checked = 0
procs_done = 0
while procs_done < procs:
n, prime, elapsed = results.get()
if n == 0:
procs_done += 1
else:
checked += 1
label = "P" if prime else " "
print(f"{n:16} {label} {elapsed:9.6f}s")
print(f"\tprocs_done = {procs_done}")
return checked
def main() -> None:
# sys.argv - The list of command line arguments passed to a Python script.
# argv[0] is the script name (it is operating system dependent whether this
# is a full pathname or not). If the command was executed using the -c
# command line option to the interpreter, argv[0] is set to the string '-c'.
# If no script name was passed to the Python interpreter, argv[0] is the empty
# string. To loop over the standard input, or the list of files given on the
# command line, see the fileinput module.
if len(sys.argv) < 2:
procs = cpu_count()
else:
procs = int(sys.argv[1])
print(f"Checking {len(NUMBERS)} numbers with {procs} processes:")
t0 = perf_counter()
jobs: JobQueue = SimpleQueue()
results: ResultQueue = SimpleQueue()
start_jobs(procs, jobs, results)
checked = report(procs, results)
elapsed = perf_counter() - t0
print(f"{checked} checks in {elapsed:.2f}s")
if __name__ == "__main__":
main()

Python multiprocessing possible deadlock with two queue as producer-consumer pattern?

I'm wondering if there can be a sort of deadlock in the following code. I have to read each element of a database (about 1 million items), process it, then collect the results in a unique file.
I've parallelized the execution with multiprocessing using two Queue's and three types of processes:
Reader: Main process which reads the database and adds the read items in a task_queue
Worker: Pool of processes. Each worker gets an item from task_queue, processes the item, saves the results in an intermediate file stored in item_name/item_name.txt and puts the item_name in a completed_queue
Writer: Process which gets an item_name from completed_queue, gets the intermediate result from item_name/item_name.txt and writes it in results.txt
from multiprocessing import Pool, Process, Queue
class Computation():
def __init__(self,K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self,):
pool = Pool(n_cpus, self.worker, args=())
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
pool.close()
pool.join()
self.completed_queue.put("DONE")
writer.join()
The code works, but it seems that sometimes the writer or the pool stops working (or they are very slow). Is a deadlock possible in this scenario?
There are a couple of issues with your code. First, by using the queues as you are, you are in effect creating your own process pool and have no need for using the multiprocessing.Pool class at all. You are using a pool initializer as an actual pool worker and it's a bit of a misuse of this class; you would be better off to just use regular Process instances (my opinion, anyway).
Second, although it is well and good that you are putting message DONE to the writer_process to signal it to terminate, you have not done similarly for the self.n_cpus worker processes, which are looking for 'STOP' messages, and therefore the reader function needs to put self.n_cpus STOP messages in the task queue:
from multiprocessing import Process, Queue
class Computation():
def __init__(self, K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
self.task_queue.put('STOP')
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
processes = [Process(target=self.worker) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
for p in processes:
p.join()
self.completed_queue.put("DONE")
writer.join()
Personally, instead of using 'STOP' and 'DONE' as the sentinel messages, I would use None instead, assuming that is not a valid actual message. I have tested the above code where reader just processed strings in a list and self.process_item(item) simply appended ' done' to the each of those strings and put the modified string on the completed_queue and replaced self.write_f in the writer_process with a print call. I did not see any problems with the code as is.
Update to use a Managed Queue
Disclaimer: I have had no experience using mpi4py and have no idea how the queue proxies would get distributed across different computers. The above code may not be sufficient as suggested by the following article, How to share mutliprocessing queue object between multiple computers. However, that code is creating instances of Queue.Queue (that code is Python 2 code) and not the proxies that are returned by the multiprocessing.SyncManager. The documentation on this is very poor. Try the above change to see if it works better (it will be slower).
Because the proxy returned by manager.Queue(), I have had to rearrange the code a bit; the queues are now being passed explicitly as arguments to the process functions:
from multiprocessing import Process, Manager
class Computation():
def __init__(self, K):
self.n_cpus = K
def reader(self, task_queue):
with open(db, "r") as db:
... # Read an item
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
task_queue.put('STOP')
def worker(self, task_queue, completed_queue):
while True:
item = task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self, completed_queue):
while True:
f = completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
with Manager() as manager:
task_queue = manager.Queue()
completed_queue = manager.Queue()
processes = [Process(target=self.worker, args=(task_queue, completed_queue)) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=(completed_queue,))
writer.start()
self.reader(task_queue)
for p in processes:
p.join()
completed_queue.put("DONE")
writer.join()

Implementing "competing" processes in python

I'm trying to implement a function that takes 2 functions as arguments, runs both, returns the value of the function that returns first and kills the slower function before it finishes its execution.
My problem is that when I try to empty the Queue object I use to collect the return values, I get stuck.
Is there a more 'correct' way to handle this scenario or even an existing module? If not, can anyone explain what I'm doing wrong?
Here is my code (the implementation of the above function is 'run_both()'):
import multiprocessing as mp
from time import sleep
Q = mp.Queue()
def dump_queue(queue):
result = []
for i in iter(queue.get, 'STOP'):
result.append(i)
return result
def rabbit(x):
sleep(10)
Q.put(x)
def turtle(x):
sleep(30)
Q.put(x)
def run_both(a,b):
a.start()
b.start()
while a.is_alive() and b.is_alive():
sleep(1)
if a.is_alive():
a.terminate()
else:
b.terminate()
a.join()
b.join()
return dump_queue(Q)
p1 = mp.Process(target=rabbit, args=(1,))
p1 = mp.Process(target=turtle, args=(2,))
run_both(p1, p2)
Here's an example to call 2 or more functions with multiprocessing and return the fastest result. There are a few important things to note however.
Running multiprocessing code in IDLE sometimes causes problems. This example works, but I did run into that issue trying to solve this.
Multiprocessing code should start from inside a if __name__ == '__main__' clause, or else it will be run again if the main module is re-imported by another process. read the multiprocessing doc page for more info.
The result queue is passed directly to each process that uses it. When you use the queue by referencing a global name in the module, the code fails on windows because a new instance of the queue is used by each process. Read more here Multiprocessing Queue.get() hangs
I have also added a bit of a feature here to know which process' result was actually used.
import multiprocessing as mp
import time
import random
def task(value):
# our dummy task is to sleep for a random amount of time and
# return the given arg value
time.sleep(random.random())
return value
def process(q, idx, fn, args):
# simply call function fn with args, and push its result in the queue with its index
q.put([fn(*args), idx])
def fastest(calls):
queue = mp.Queue()
# we must pass the queue directly to each process that may use it
# or else on Windows, each process will have its own copy of the queue
# making it useless
procs = []
# create a 'mp.Process' that calls our 'process' for each call and start it
for idx, call in enumerate(calls):
fn = call[0]
args = call[1:]
p = mp.Process(target=process, args=(queue, idx, fn, args))
procs.append(p)
p.start()
# wait for the queue to have something
result, idx = queue.get()
for proc in procs: # kill all processes that may still be running
proc.terminate()
# proc may be using queue, so queue may be corrupted.
# https://docs.python.org/3.8/library/multiprocessing.html?highlight=queue#multiprocessing.Process.terminate
# we no longer need queue though so this is fine
return result, idx
if __name__ == '__main__':
from datetime import datetime
start = datetime.now()
print(start)
# to be compatible with 'fastest', each call is a list with the first
# element being callable, followed by args to be passed
calls = [
[task, 1],
[task, 'hello'],
[task, [1,2,3]]
]
val, idx = fastest(calls)
end = datetime.now()
print(end)
print('elapsed time:', end-start)
print('returned value:', val)
print('from call at index', idx)
Example output:
2019-12-21 04:01:09.525575
2019-12-21 04:01:10.171891
elapsed time: 0:00:00.646316
returned value: hello
from call at index 1
Apart from the typo on the penultimate line which should read:
p2 = mp.Process(target=turtle, args=(2,)) # not p1
the simplest change you can make to get the program to work is to add:
Q.put('STOP')
to the end of turtle() and rabbit().
You also don't really need to keep looping watching if the processes are alive, by definition if you just read the message queue and receive STOP, one of them has finished, so you could replace run_both() with:
def run_both(a,b):
a.start()
b.start()
result = dump_queue(Q)
a.terminate()
b.terminate()
return result
You may also need to think about what happens if both processes put some messages in the queue at much the same time. They could get mixed up. Maybe consider using 2 queues, or joining all the results up into a single message rather than appending multiple values together from queue.get()

Python multiprocessing with Queue (split loads dynamically)

I am trying to use multiprocessing to process very large number of files.
I tried to put the list of files into queue and make 3 workers split the load with a common Queue data type. However this seems not working. Probably I am misunderstanding about the queue in multiprocessing package.
Below is the example source code:
import multiprocessing
from multiprocessing import Queue
def worker(i, qu):
"""worker function"""
while ~qu.empty():
val=qu.get()
print 'Worker:',i, ' start with file:',val
j=1
for k in range(i*10000,(i+1)*10000): # some time consuming process
for j in range(i*10000,(i+1)*10000):
j=j+k
print 'Worker:',i, ' end with file:',val
if __name__ == '__main__':
jobs = []
qu=Queue()
for j in range(100,110): # files numbers are from 100 to 110
qu.put(j)
for i in range(3): # 3 multiprocess
p = multiprocessing.Process(target=worker, args=(i,qu))
jobs.append(p)
p.start()
p.join()
Thanks for the comments.
I come to know that using Pool is the best solution.
import multiprocessing
import time
def worker(val):
"""worker function"""
print 'Worker: start with file:',val
time.sleep(1.1)
print 'Worker: end with file:',val
if __name__ == '__main__':
file_list=range(100,110)
p = multiprocessing.Pool(2)
p.map(worker, file_list)
Two issues:
1) you are joining only on the 3rd process
2) Why not use multiprocessing.Pool?
3) race condition on qu.get()
1 & 3)
import multiprocessing
from multiprocessing import Queue
def worker(i, qu):
"""worker function"""
while 1:
try:
val=qu.get(timeout)
except Queue.Empty: break# Yay no race condition
print 'Worker:',i, ' start with file:',val
j=1
for k in range(i*10000,(i+1)*10000): # some time consuming process
for j in range(i*10000,(i+1)*10000):
j=j+k
print 'Worker:',i, ' end with file:',val
if __name__ == '__main__':
jobs = []
qu=Queue()
for j in range(100,110): # files numbers are from 100 to 110
qu.put(j)
for i in range(3): # 3 multiprocess
p = multiprocessing.Process(target=worker, args=(i,qu))
jobs.append(p)
p.start()
for p in jobs: #<--- join on all processes ...
p.join()
2)
for how to use the Pool, see:
https://docs.python.org/2/library/multiprocessing.html
You are joining only the last of your created processes. That means if the first or the second process is still working while the third is finished, your main process is goning down and kills the remaining processes before they are finished.
You should join them all in order to wait until they are finished:
for p in jobs:
p.join()
Another thing is you should consider using qu.get_nowait() in order to get rid of the race condition between qu.empty() and qu.get().
For example:
try:
while 1:
message = self.queue.get_nowait()
""" do something fancy here """
except Queue.Empty:
pass
I hope that helps

Categories