Python multiprocessing for dataset preparation

Python multiprocessing for dataset preparation - python

I'm looking for shorter ways to prepare my dataset for a machine-learning task. I found that the multiprocessing library might helpful. However, because I'm a newbie in multiprocessing, I couldn't find a proper way.
I first wrote some codes like below:
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = []
def _ready_data(self, ex, idx):
# Some complex functions that takes several minutes
def _dataset_creator(self, queue):
for idx, ex in enumerate(self.data_list):
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
total_mem = 0.0
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for idx in t:
ins = queue.get()
self.data.append(ins)
gc.collect()
def _build_dataset(self):
queue = Queue()
creator = Process(target=self._dataset_creator, args=(queue,))
consumer = Process(target=self._dataset_consumer, args=(queue,))
creator.start()
consumer.start()
queue.close()
queue.join_thread()
creator.join()
consumer.join()
However, in my opinion, because the _dataset_creator processes data (here _ready_data) in serial manner, this would not be helpful for reducing time consumption.
So, I modified the code to generate multiple processes that process one datum:
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = []
def _ready_data(self, ex, idx):
# Some complex functions that takes several minutes
def _dataset_creator(self, ex, idx, queue):
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
total_mem = 0.0
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for idx in t:
ins = queue.get()
self.data.append(ins)
gc.collect()
def _build_dataset(self):
queue = Queue()
for idx, ex in enumerate(self.data_list):
p = Process(target=self._dataset_creator, args=(ex, idx, queue,))
p.start()
consumer = Process(target=self._dataset_consumer, args=(queue,))
consumer.start()
queue.close()
queue.join_thread()
consumer.join()
However, this returns me errors:
Process Process-18:
Traceback ~~~
RuntimeError: can't start new thread
Traceback ~~~
OSError: [Errno 12] Cannot allocate memory
Could you help me to process complex data in a parallel way?
EDIT 1:
Thanks to #tdelaney, I can reduce the time consumption by generating self.num_worker processes (16 in my experiment):
def _dataset_creator(self, pid, queue):
for idx, ex in list(enumerate(self.data_list))[pid::self.num_worker]:
queue.put(self._ready_data(ex, idx))
def _dataset_consumer(self, queue):
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
for _ in t:
ins = queue.get()
self.data[ins['idx']] = ins
def _build_dataset(self):
queue = Queue()
procs = []
for pid in range(self.num_worker):
p = Process(target=self._dataset_creator, args=(pid, queue,))
procs.append(p)
p.start()
consumer = Process(target=self._dataset_consumer, args=(queue,))
consumer.start()
queue.close()
queue.join_thread()
for p in procs:
p.join()
consumer.join()

I'm trying to sketch out what a solution with a multiprocessing pool would look like. I got rid of the consumer process completely because it looks like the parent process is just waiting anyway (and needs the data eventually) so it can be the consumer. So, I set up a pool and use imap_unordered to handle passing the data to the worker.
I guessed that the data processing doesn't really need the DatasetReader at all and moved it out to its own function. On Windows, either the entire DataReader object is serialized to the subprocess (including data you don't want) or the child version of the object is incomplete and may crash when you try to use it.
Either way, changes made to a DatasetReader object in the child processes aren't seen in the parent. This can be unexpected if the parent is dependent on updated state in that object. Its best to severely bracket what's happening in subprocesses, in my opinion.
from multiprocessing import Pool, get_start_method, cpu_count
# moved out of class (assuming it is not class dependent) so that
# the entire DatasetReader object isn't pickled and sent to
# the child on spawning systems like Microsoft Windows
def _ready_data(idx_ex):
idx, ex = idx_ex
# Some complex functions that take several minutes
result = complex_functions(ex)
return (idx, result)
class DatasetReader:
def __init__(self):
self.data_list = Read_Data_from_file
self.data = [None] * len(data_list)
def _ready_data_fork(self, idx):
# on forking system, call worker with object data
return _ready_data((idx, self.data_list[idx]))
def run(self):
t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ',
bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) '
'[{elapsed}<{remaining},{rate_fmt}{postfix}]')
pool = Pool(min(cpu_count, len(self.data_list)))
if get_start_method() == 'fork':
# on forking system, self.data_list is in child process and
# we only pass the index
result_iter = pool.imap_unordered(self._ready_data_fork,
(idx for idx in range(len(data_list))),
chunksize=1)
else:
# on spawning system, we need to pass the data
result_iter = pool.imap_unordered(_ready_data,
enumerate(self.data_list,
chunksize=1)
for idx, result in result_iter:
next(t)
self.data[idx] = result
pool.join()

Related

Python multiprocessing possible deadlock with two queue as producer-consumer pattern?

I'm wondering if there can be a sort of deadlock in the following code. I have to read each element of a database (about 1 million items), process it, then collect the results in a unique file.
I've parallelized the execution with multiprocessing using two Queue's and three types of processes:
Reader: Main process which reads the database and adds the read items in a task_queue
Worker: Pool of processes. Each worker gets an item from task_queue, processes the item, saves the results in an intermediate file stored in item_name/item_name.txt and puts the item_name in a completed_queue
Writer: Process which gets an item_name from completed_queue, gets the intermediate result from item_name/item_name.txt and writes it in results.txt
from multiprocessing import Pool, Process, Queue
class Computation():
def __init__(self,K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self,):
pool = Pool(n_cpus, self.worker, args=())
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
pool.close()
pool.join()
self.completed_queue.put("DONE")
writer.join()
The code works, but it seems that sometimes the writer or the pool stops working (or they are very slow). Is a deadlock possible in this scenario?

There are a couple of issues with your code. First, by using the queues as you are, you are in effect creating your own process pool and have no need for using the multiprocessing.Pool class at all. You are using a pool initializer as an actual pool worker and it's a bit of a misuse of this class; you would be better off to just use regular Process instances (my opinion, anyway).
Second, although it is well and good that you are putting message DONE to the writer_process to signal it to terminate, you have not done similarly for the self.n_cpus worker processes, which are looking for 'STOP' messages, and therefore the reader function needs to put self.n_cpus STOP messages in the task queue:
from multiprocessing import Process, Queue
class Computation():
def __init__(self, K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
self.task_queue.put('STOP')
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
processes = [Process(target=self.worker) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
for p in processes:
p.join()
self.completed_queue.put("DONE")
writer.join()
Personally, instead of using 'STOP' and 'DONE' as the sentinel messages, I would use None instead, assuming that is not a valid actual message. I have tested the above code where reader just processed strings in a list and self.process_item(item) simply appended ' done' to the each of those strings and put the modified string on the completed_queue and replaced self.write_f in the writer_process with a print call. I did not see any problems with the code as is.
Update to use a Managed Queue
Disclaimer: I have had no experience using mpi4py and have no idea how the queue proxies would get distributed across different computers. The above code may not be sufficient as suggested by the following article, How to share mutliprocessing queue object between multiple computers. However, that code is creating instances of Queue.Queue (that code is Python 2 code) and not the proxies that are returned by the multiprocessing.SyncManager. The documentation on this is very poor. Try the above change to see if it works better (it will be slower).
Because the proxy returned by manager.Queue(), I have had to rearrange the code a bit; the queues are now being passed explicitly as arguments to the process functions:
from multiprocessing import Process, Manager
class Computation():
def __init__(self, K):
self.n_cpus = K
def reader(self, task_queue):
with open(db, "r") as db:
... # Read an item
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
task_queue.put('STOP')
def worker(self, task_queue, completed_queue):
while True:
item = task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self, completed_queue):
while True:
f = completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
with Manager() as manager:
task_queue = manager.Queue()
completed_queue = manager.Queue()
processes = [Process(target=self.worker, args=(task_queue, completed_queue)) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=(completed_queue,))
writer.start()
self.reader(task_queue)
for p in processes:
p.join()
completed_queue.put("DONE")
writer.join()

Python - Incrementing a count producing unexpected results

In below code I expect print('q.count' , q.count) to be 2 as count is a varialble initialised once using q = QueueFun() and then incremented in the read_queue method, instead print('q.count' , q.count) prints 0. What is the correct method of sharing a counter between multiprocessesing Processes ?
Complete code:
from multiprocessing import Process, Queue, Pool, Lock
class QueueFun():
def __init__(self):
self.count = 0
self.lock = Lock()
def write_queue(self, work_tasks, max_size):
for i in range(0, max_size):
print("Writing to queue")
work_tasks.put(1)
def read_queue(self, work_tasks, max_size):
while self.count != max_size:
self.lock.acquire()
self.count += 1
self.lock.release()
print('self.count' , self.count)
print('')
print('Reading from queue')
work_tasks.get()
if __name__ == '__main__':
q = QueueFun()
max_size = 1
work_tasks = Queue()
write_processes = []
for i in range(0,2):
write_processes.append(Process(target=q.write_queue,
args=(work_tasks,max_size)))
for p in write_processes:
p.start()
read_processes = []
for i in range(0, 2):
read_processes.append(Process(target=q.read_queue,
args=(work_tasks,max_size)))
for p in read_processes:
p.start()
for p in read_processes:
p.join()
for p in write_processes:
p.join()
print('q.count' , q.count)

Unlike threads, different processes have different address
spaces: they do not share memory with each other. Writing
to a variable in one process will not change an (unshared)
variable in another process.
In the original example, the count was 0 at the end, because
the main process never changed it (no matter what the other
spawned processes did).
Probably better to communicate between processes with Queue.
If it's really necessary, Value or Array could be used:
17.2.1.5. Sharing state between processes
As mentioned above, when doing concurrent programming it is usually
best to avoid using shared state as far as possible. This is
particularly true when using multiple processes.
However, if you really do need to use some shared data then
multiprocessing provides a couple of ways of doing so.
Shared memory Data can be stored in a shared memory map using Value or
Array.
...
These shared objects will be process and thread-safe.
multiprocessing.Value
Operations like += which involve a read and write are not atomic.
A slightly modified version of the question's code:
from multiprocessing import Process, Queue, Value
class QueueFun():
def __init__(self):
self.readCount = Value('i', 0)
self.writeCount = Value('i', 0)
def write_queue(self, work_tasks, MAX_SIZE):
with self.writeCount.get_lock():
if self.writeCount != MAX_SIZE:
self.writeCount.value += 1
work_tasks.put(1)
def read_queue(self, work_tasks, MAX_SIZE):
with self.readCount.get_lock():
if self.readCount.value != MAX_SIZE:
self.readCount.value += 1
work_tasks.get()
if __name__ == '__main__':
q = QueueFun()
MAX_SIZE = 2
work_tasks = Queue()
write_processes = []
for i in range(MAX_SIZE):
write_processes.append(Process(target=q.write_queue,
args=(work_tasks,MAX_SIZE)))
for p in write_processes: p.start()
read_processes = []
for i in range(MAX_SIZE):
read_processes.append(Process(target=q.read_queue,
args=(work_tasks,MAX_SIZE)))
for p in read_processes: p.start()
for p in read_processes: p.join()
for p in write_processes: p.join()
print('q.writeCount.value' , q.writeCount.value)
print('q.readCount.value' , q.readCount.value)
Note: printing to standard output from multiple processes,
can result in output getting mixed up (not synchronized).

Python port scanner, how to determine optimal number of threads

I have a multi threaded Python port scanner where each thread in a loop gets something (an IP address/port pair) from a common queue, does some work on it (connects, does a handshake and grabs the server's version) and loops again.
Here's some partial code:
import threading, queue, multiprocessing
class ScanProcess(multiprocessing.Process):
threads = []
def __init__(self, squeue, dqueue, count):
self.squeue = squeue
self.dqueue = dqueue
self.count = count
self._init_threads()
super(ScanProcess, self).__init__()
def _init_threads(self):
self.threads = [ScanThread(self.squeue, self.dqueue) for _ in range(0, self.count)]
def _start_threads(self):
for thread in self.threads:
thread.start()
def _join_threads(self):
for thread in self.threads:
thread.join()
def run(self):
self._start_threads()
self._join_threads()
class ScanThread(threading.Thread):
def __init__(self, squeue, dqueue):
self.squeue = squeue
self.dqueue = dqueue
super(ScanThread, self).__init__()
def run(self):
while not self.squeue.empty():
try:
target = self.squeue.get(block=False)
# do the actual work then put the result in dqueue
except queue.Empty:
continue
# how many threads/processes
process_count = 2
thread_count = 10
# load tasks from file or network and fill the queues
squeue = multiprocessing.Queue()
dqueue = multiprocessing.Queue()
# create and start everything
processes = [ScanProcess(squeue, dqueue, thread_count) for _ in range(0, process_count)]
for process in processes:
process.start()
for process in processes:
process.join()
# enjoy the show!
The problem I've got is that the number of threads is currently set manually. I'd like to set it automatically to saturate the network connection while not dropping packets, but I have no idea how to begin implementing that. Could anyone summarize how nmap/zmap do it?
Any help is appreciated.

How to process input in parallel with python, but without processes?

I have a list of input data and would like to process it in parallel, but processing each takes time as network io is involved. CPU usage is not a problem.
I would not like to have the overhead of additional processes since I have a lot of things to process at a time and do not want to setup inter process communication.
# the parallel execution equivalent of this?
import time
input_data = [1,2,3,4,5,6,7]
input_processor = time.sleep
results = map(input_processor, input_data)
The code I am using makes use of twisted.internet.defer so a solution involving that is fine as well.

You can easily define Worker threads that work in parallel till a queue is empty.
from threading import Thread
from collections import deque
import time
# Create a new class that inherits from Thread
class Worker(Thread):
def __init__(self, inqueue, outqueue, func):
'''
A worker that calls func on objects in inqueue and
pushes the result into outqueue
runs until inqueue is empty
'''
self.inqueue = inqueue
self.outqueue = outqueue
self.func = func
super().__init__()
# override the run method, this is starte when
# you call worker.start()
def run(self):
while self.inqueue:
data = self.inqueue.popleft()
print('start')
result = self.func(data)
self.outqueue.append(result)
print('finished')
def test(x):
time.sleep(x)
return 2 * x
if __name__ == '__main__':
data = 12 * [1, ]
queue = deque(data)
result = deque()
# create 3 workers working on the same input
workers = [Worker(queue, result, test) for _ in range(3)]
# start the workers
for worker in workers:
worker.start()
# wait till all workers are finished
for worker in workers:
worker.join()
print(result)
As expected, this runs ca. 4 seconds.
One could also write a simple Pool class to get rid of the noise in the main function:
from threading import Thread
from collections import deque
import time
class Pool():
def __init__(self, n_threads):
self.n_threads = n_threads
def map(self, func, data):
inqueue = deque(data)
result = deque()
workers = [Worker(inqueue, result, func) for i in range(self.n_threads)]
for worker in workers:
worker.start()
for worker in workers:
worker.join()
return list(result)
class Worker(Thread):
def __init__(self, inqueue, outqueue, func):
'''
A worker that calls func on objects in inqueue and
pushes the result into outqueue
runs until inqueue is empty
'''
self.inqueue = inqueue
self.outqueue = outqueue
self.func = func
super().__init__()
# override the run method, this is starte when
# you call worker.start()
def run(self):
while self.inqueue:
data = self.inqueue.popleft()
print('start')
result = self.func(data)
self.outqueue.append(result)
print('finished')
def test(x):
time.sleep(x)
return 2 * x
if __name__ == '__main__':
data = 12 * [1, ]
pool = Pool(6)
result = pool.map(test, data)
print(result)

You can use the multiprocessing module. Without knowing more about how you want it to process, you can use a pool of workers:
import multiprocessing as mp
import time
input_processor = time.sleep
core_num = mp.cpu_count()
pool=Pool(processes = core_num)
result = [pool.apply_async(input_processor(i)) for for i in range(1,7+1) ]
result_final = [p.get() for p in results]
for n in range(1,7+1):
print n, result_final[n]
The above keeps track of the order each task is done. It also does not allow the processes to talk to each other.
Editted:
To call this as a function, you should input the input data and number of processors:
def parallel_map(processor_count, input_data):
pool=Pool(processes = processor_count)
result = [pool.apply_async(input_processor(i)) for for i in input_data ]
result_final = np.array([p.get() for p in results])
result_data = np.vstack( (input_data, result_final))
return result_data

I assume you are using Twisted. In that case, you can launch multiple deferreds and wait for the completion of all of them using DeferredList:
http://twistedmatrix.com/documents/15.4.0/core/howto/defer.html#deferredlist
If input_processor is a non-blocking call (returns deferred):
def main():
input_data = [1,2,3,4,5,6,7]
input_processor = asyn_function
for entry in input_data:
requests.append(defer.maybeDeferred(input_processor, entry))
deferredList = defer.DeferredList(requests, , consumeErrors=True)
deferredList.addCallback(gotResults)
return deferredList
def gotResults(results):
for (success, value) in result:
if success:
print 'Success:', value
else:
print 'Failure:', value.getErrorMessage()
In case input_processor is a long/blocking function, you can use deferToThread instead of maybeDeferred:
def main():
input_data = [1,2,3,4,5,6,7]
input_processor = syn_function
for entry in input_data:
requests.append(threads.deferToThread(input_processor, entry))
deferredList = defer.DeferredList(requests, , consumeErrors=True)
deferredList.addCallback(gotResults)
return deferredList

How to use multiprocessing queue in Python?

I'm having much trouble trying to understand just how the multiprocessing queue works on python and how to implement it. Lets say I have two python modules that access data from a shared file, let's call these two modules a writer and a reader. My plan is to have both the reader and writer put requests into two separate multiprocessing queues, and then have a third process pop these requests in a loop and execute as such.
My main problem is that I really don't know how to implement multiprocessing.queue correctly, you cannot really instantiate the object for each process since they will be separate queues, how do you make sure that all processes relate to a shared queue (or in this case, queues)

My main problem is that I really don't know how to implement multiprocessing.queue correctly, you cannot really instantiate the object for each process since they will be separate queues, how do you make sure that all processes relate to a shared queue (or in this case, queues)
This is a simple example of a reader and writer sharing a single queue... The writer sends a bunch of integers to the reader; when the writer runs out of numbers, it sends 'DONE', which lets the reader know to break out of the read loop.
You can spawn as many reader processes as you like...
from multiprocessing import Process, Queue
import time
import sys
def reader_proc(queue):
"""Read from the queue; this spawns as a separate Process"""
while True:
msg = queue.get() # Read from the queue and do nothing
if msg == "DONE":
break
def writer(count, num_of_reader_procs, queue):
"""Write integers into the queue. A reader_proc() will read them from the queue"""
for ii in range(0, count):
queue.put(ii) # Put 'count' numbers into queue
### Tell all readers to stop...
for ii in range(0, num_of_reader_procs):
queue.put("DONE")
def start_reader_procs(qq, num_of_reader_procs):
"""Start the reader processes and return all in a list to the caller"""
all_reader_procs = list()
for ii in range(0, num_of_reader_procs):
### reader_p() reads from qq as a separate process...
### you can spawn as many reader_p() as you like
### however, there is usually a point of diminishing returns
reader_p = Process(target=reader_proc, args=((qq),))
reader_p.daemon = True
reader_p.start() # Launch reader_p() as another proc
all_reader_procs.append(reader_p)
return all_reader_procs
if __name__ == "__main__":
num_of_reader_procs = 2
qq = Queue() # writer() writes to qq from _this_ process
for count in [10**4, 10**5, 10**6]:
assert 0 < num_of_reader_procs < 4
all_reader_procs = start_reader_procs(qq, num_of_reader_procs)
writer(count, len(all_reader_procs), qq) # Queue stuff to all reader_p()
print("All reader processes are pulling numbers from the queue...")
_start = time.time()
for idx, a_reader_proc in enumerate(all_reader_procs):
print(" Waiting for reader_p.join() index %s" % idx)
a_reader_proc.join() # Wait for a_reader_proc() to finish
print(" reader_p() idx:%s is done" % idx)
print(
"Sending {0} integers through Queue() took {1} seconds".format(
count, (time.time() - _start)
)
)
print("")

Here's a dead simple usage of multiprocessing.Queue and multiprocessing.Process that allows callers to send an "event" plus arguments to a separate process that dispatches the event to a "do_" method on the process. (Python 3.4+)
import multiprocessing as mp
import collections
Msg = collections.namedtuple('Msg', ['event', 'args'])
class BaseProcess(mp.Process):
"""A process backed by an internal queue for simple one-way message passing.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.queue = mp.Queue()
def send(self, event, *args):
"""Puts the event and args as a `Msg` on the queue
"""
msg = Msg(event, args)
self.queue.put(msg)
def dispatch(self, msg):
event, args = msg
handler = getattr(self, "do_%s" % event, None)
if not handler:
raise NotImplementedError("Process has no handler for [%s]" % event)
handler(*args)
def run(self):
while True:
msg = self.queue.get()
self.dispatch(msg)
Usage:
class MyProcess(BaseProcess):
def do_helloworld(self, arg1, arg2):
print(arg1, arg2)
if __name__ == "__main__":
process = MyProcess()
process.start()
process.send('helloworld', 'hello', 'world')
The send happens in the parent process, the do_* happens in the child process.
I left out any exception handling that would obviously interrupt the run loop and exit the child process. You can also customize it by overriding run to control blocking or whatever else.
This is really only useful in situations where you have a single worker process, but I think it's a relevant answer to this question to demonstrate a common scenario with a little more object-orientation.

I had a look at multiple answers across stack overflow and the web while trying to set-up a way of doing multiprocessing using queues for passing around large pandas dataframes. It seemed to me that every answer was re-iterating the same kind of solutions without any consideration of the multitude of edge cases one will definitely come across when setting up calculations like these. The problem is that there is many things at play at the same time. The number of tasks, the number of workers, the duration of each task and possible exceptions during task execution. All of these make synchronization tricky and most answers do not address how you can go about it. So this is my take after fiddling around for a few hours, hopefully this will be generic enough for most people to find it useful.
Some thoughts before any coding examples. Since queue.Empty or queue.qsize() or any other similar method is unreliable for flow control, any code of the like
while True:
try:
task = pending_queue.get_nowait()
except queue.Empty:
break
is bogus. This will kill the worker even if milliseconds later another task turns up in the queue. The worker will not recover and after a while ALL the workers will disappear as they randomly find the queue momentarily empty. The end result will be that the main multiprocessing function (the one with the join() on the processes) will return without all the tasks having completed. Nice. Good luck debugging through that if you have thousands of tasks and a few are missing.
The other issue is the use of sentinel values. Many people have suggested adding a sentinel value in the queue to flag the end of the queue. But to flag it to whom exactly? If there is N workers, assuming N is the number of cores available give or take, then a single sentinel value will only flag the end of the queue to one worker. All the other workers will sit waiting for more work when there is none left. Typical examples I've seen are
while True:
task = pending_queue.get()
if task == SOME_SENTINEL_VALUE:
break
One worker will get the sentinel value while the rest will wait indefinitely. No post I came across mentioned that you need to submit the sentinel value to the queue AT LEAST as many times as you have workers so that ALL of them get it.
The other issue is the handling of exceptions during task execution. Again these should be caught and managed. Moreover, if you have a completed_tasks queue you should independently count in a deterministic way how many items are in the queue before you decide that the job is done. Again relying on queue sizes is bound to fail and returns unexpected results.
In the example below, the par_proc() function will receive a list of tasks including the functions with which these tasks should be executed alongside any named arguments and values.
import multiprocessing as mp
import dill as pickle
import queue
import time
import psutil
SENTINEL = None
def do_work(tasks_pending, tasks_completed):
# Get the current worker's name
worker_name = mp.current_process().name
while True:
try:
task = tasks_pending.get_nowait()
except queue.Empty:
print(worker_name + ' found an empty queue. Sleeping for a while before checking again...')
time.sleep(0.01)
else:
try:
if task == SENTINEL:
print(worker_name + ' no more work left to be done. Exiting...')
break
print(worker_name + ' received some work... ')
time_start = time.perf_counter()
work_func = pickle.loads(task['func'])
result = work_func(**task['task'])
tasks_completed.put({work_func.__name__: result})
time_end = time.perf_counter() - time_start
print(worker_name + ' done in {} seconds'.format(round(time_end, 5)))
except Exception as e:
print(worker_name + ' task failed. ' + str(e))
tasks_completed.put({work_func.__name__: None})
def par_proc(job_list, num_cpus=None):
# Get the number of cores
if not num_cpus:
num_cpus = psutil.cpu_count(logical=False)
print('* Parallel processing')
print('* Running on {} cores'.format(num_cpus))
# Set-up the queues for sending and receiving data to/from the workers
tasks_pending = mp.Queue()
tasks_completed = mp.Queue()
# Gather processes and results here
processes = []
results = []
# Count tasks
num_tasks = 0
# Add the tasks to the queue
for job in job_list:
for task in job['tasks']:
expanded_job = {}
num_tasks = num_tasks + 1
expanded_job.update({'func': pickle.dumps(job['func'])})
expanded_job.update({'task': task})
tasks_pending.put(expanded_job)
# Use as many workers as there are cores (usually chokes the system so better use less)
num_workers = num_cpus
# We need as many sentinels as there are worker processes so that ALL processes exit when there is no more
# work left to be done.
for c in range(num_workers):
tasks_pending.put(SENTINEL)
print('* Number of tasks: {}'.format(num_tasks))
# Set-up and start the workers
for c in range(num_workers):
p = mp.Process(target=do_work, args=(tasks_pending, tasks_completed))
p.name = 'worker' + str(c)
processes.append(p)
p.start()
# Gather the results
completed_tasks_counter = 0
while completed_tasks_counter < num_tasks:
results.append(tasks_completed.get())
completed_tasks_counter = completed_tasks_counter + 1
for p in processes:
p.join()
return results
And here is a test to run the above code against
def test_parallel_processing():
def heavy_duty1(arg1, arg2, arg3):
return arg1 + arg2 + arg3
def heavy_duty2(arg1, arg2, arg3):
return arg1 * arg2 * arg3
task_list = [
{'func': heavy_duty1, 'tasks': [{'arg1': 1, 'arg2': 2, 'arg3': 3}, {'arg1': 1, 'arg2': 3, 'arg3': 5}]},
{'func': heavy_duty2, 'tasks': [{'arg1': 1, 'arg2': 2, 'arg3': 3}, {'arg1': 1, 'arg2': 3, 'arg3': 5}]},
]
results = par_proc(task_list)
job1 = sum([y for x in results if 'heavy_duty1' in x.keys() for y in list(x.values())])
job2 = sum([y for x in results if 'heavy_duty2' in x.keys() for y in list(x.values())])
assert job1 == 15
assert job2 == 21
plus another one with some exceptions
def test_parallel_processing_exceptions():
def heavy_duty1_raises(arg1, arg2, arg3):
raise ValueError('Exception raised')
return arg1 + arg2 + arg3
def heavy_duty2(arg1, arg2, arg3):
return arg1 * arg2 * arg3
task_list = [
{'func': heavy_duty1_raises, 'tasks': [{'arg1': 1, 'arg2': 2, 'arg3': 3}, {'arg1': 1, 'arg2': 3, 'arg3': 5}]},
{'func': heavy_duty2, 'tasks': [{'arg1': 1, 'arg2': 2, 'arg3': 3}, {'arg1': 1, 'arg2': 3, 'arg3': 5}]},
]
results = par_proc(task_list)
job1 = sum([y for x in results if 'heavy_duty1' in x.keys() for y in list(x.values())])
job2 = sum([y for x in results if 'heavy_duty2' in x.keys() for y in list(x.values())])
assert not job1
assert job2 == 21
Hope that is helpful.

in "from queue import Queue" there is no module called queue, instead multiprocessing should be used. Therefore, it should look like "from multiprocessing import Queue"

Just made a simple and general example for demonstrating passing a message over a Queue between 2 standalone programs. It doesn't directly answer the OP's question but should be clear enough indicating the concept.
Server:
multiprocessing-queue-manager-server.py
import asyncio
import concurrent.futures
import multiprocessing
import multiprocessing.managers
import queue
import sys
import threading
from typing import Any, AnyStr, Dict, Union
class QueueManager(multiprocessing.managers.BaseManager):
def get_queue(self, ident: Union[AnyStr, int, type(None)] = None) -> multiprocessing.Queue:
pass
def get_queue(ident: Union[AnyStr, int, type(None)] = None) -> multiprocessing.Queue:
global q
if not ident in q:
q[ident] = multiprocessing.Queue()
return q[ident]
q: Dict[Union[AnyStr, int, type(None)], multiprocessing.Queue] = dict()
delattr(QueueManager, 'get_queue')
def init_queue_manager_server():
if not hasattr(QueueManager, 'get_queue'):
QueueManager.register('get_queue', get_queue)
def serve(no: int, term_ev: threading.Event):
manager: QueueManager
with QueueManager(authkey=QueueManager.__name__.encode()) as manager:
print(f"Server address {no}: {manager.address}")
while not term_ev.is_set():
try:
item: Any = manager.get_queue().get(timeout=0.1)
print(f"Client {no}: {item} from {manager.address}")
except queue.Empty:
continue
async def main(n: int):
init_queue_manager_server()
term_ev: threading.Event = threading.Event()
executor: concurrent.futures.ThreadPoolExecutor = concurrent.futures.ThreadPoolExecutor()
i: int
for i in range(n):
asyncio.ensure_future(asyncio.get_running_loop().run_in_executor(executor, serve, i, term_ev))
# Gracefully shut down
try:
await asyncio.get_running_loop().create_future()
except asyncio.CancelledError:
term_ev.set()
executor.shutdown()
raise
if __name__ == '__main__':
asyncio.run(main(int(sys.argv[1])))
Client:
multiprocessing-queue-manager-client.py
import multiprocessing
import multiprocessing.managers
import os
import sys
from typing import AnyStr, Union
class QueueManager(multiprocessing.managers.BaseManager):
def get_queue(self, ident: Union[AnyStr, int, type(None)] = None) -> multiprocessing.Queue:
pass
delattr(QueueManager, 'get_queue')
def init_queue_manager_client():
if not hasattr(QueueManager, 'get_queue'):
QueueManager.register('get_queue')
def main():
init_queue_manager_client()
manager: QueueManager = QueueManager(sys.argv[1], authkey=QueueManager.__name__.encode())
manager.connect()
message = f"A message from {os.getpid()}"
print(f"Message to send: {message}")
manager.get_queue().put(message)
if __name__ == '__main__':
main()
Usage
Server:
$ python3 multiprocessing-queue-manager-server.py N
N is a integer indicating how many servers should be created. Copy one of the <server-address-N> output by the server and make it the first argument of each multiprocessing-queue-manager-client.py.
Client:
python3 multiprocessing-queue-manager-client.py <server-address-1>
Result
Server:
Client 1: <item> from <server-address-1>
Gist: https://gist.github.com/89062d639e40110c61c2f88018a8b0e5
UPD: Created a package here.
Server:
import ipcq
with ipcq.QueueManagerServer(address=ipcq.Address.AUTO, authkey=ipcq.AuthKey.AUTO) as server:
server.get_queue().get()
Client:
import ipcq
client = ipcq.QueueManagerClient(address=ipcq.Address.AUTO, authkey=ipcq.AuthKey.AUTO)
client.get_queue().put('a message')

We implemented two versions of this, one a simple multi thread pool that can execute many types of callables, making our lives much easier and the second version that uses processes, which is less flexible in terms of callables and requires and extra call to dill.
Setting frozen_pool to true will freeze execution until finish_pool_queue is called in either class.
Thread Version:
'''
Created on Nov 4, 2019
#author: Kevin
'''
from threading import Lock, Thread
from Queue import Queue
import traceback
from helium.loaders.loader_retailers import print_info
from time import sleep
import signal
import os
class ThreadPool(object):
def __init__(self, queue_threads, *args, **kwargs):
self.frozen_pool = kwargs.get('frozen_pool', False)
self.print_queue = kwargs.get('print_queue', True)
self.pool_results = []
self.lock = Lock()
self.queue_threads = queue_threads
self.queue = Queue()
self.threads = []
for i in range(self.queue_threads):
t = Thread(target=self.make_pool_call)
t.daemon = True
t.start()
self.threads.append(t)
def make_pool_call(self):
while True:
if self.frozen_pool:
#print '--> Queue is frozen'
sleep(1)
continue
item = self.queue.get()
if item is None:
break
call = item.get('call', None)
args = item.get('args', [])
kwargs = item.get('kwargs', {})
keep_results = item.get('keep_results', False)
try:
result = call(*args, **kwargs)
if keep_results:
self.lock.acquire()
self.pool_results.append((item, result))
self.lock.release()
except Exception as e:
self.lock.acquire()
print e
traceback.print_exc()
self.lock.release()
os.kill(os.getpid(), signal.SIGUSR1)
self.queue.task_done()
def finish_pool_queue(self):
self.frozen_pool = False
while self.queue.unfinished_tasks > 0:
if self.print_queue:
print_info('--> Thread pool... %s' % self.queue.unfinished_tasks)
sleep(5)
self.queue.join()
for i in range(self.queue_threads):
self.queue.put(None)
for t in self.threads:
t.join()
del self.threads[:]
def get_pool_results(self):
return self.pool_results
def clear_pool_results(self):
del self.pool_results[:]
Process Version:
'''
Created on Nov 4, 2019
#author: Kevin
'''
import traceback
from helium.loaders.loader_retailers import print_info
from time import sleep
import signal
import os
from multiprocessing import Queue, Process, Value, Array, JoinableQueue, Lock,\
RawArray, Manager
from dill import dill
import ctypes
from helium.misc.utils import ignore_exception
from mem_top import mem_top
import gc
class ProcessPool(object):
def __init__(self, queue_processes, *args, **kwargs):
self.frozen_pool = Value(ctypes.c_bool, kwargs.get('frozen_pool', False))
self.print_queue = kwargs.get('print_queue', True)
self.manager = Manager()
self.pool_results = self.manager.list()
self.queue_processes = queue_processes
self.queue = JoinableQueue()
self.processes = []
for i in range(self.queue_processes):
p = Process(target=self.make_pool_call)
p.start()
self.processes.append(p)
print 'Processes', self.queue_processes
def make_pool_call(self):
while True:
if self.frozen_pool.value:
sleep(1)
continue
item_pickled = self.queue.get()
if item_pickled is None:
#print '--> Ending'
self.queue.task_done()
break
item = dill.loads(item_pickled)
call = item.get('call', None)
args = item.get('args', [])
kwargs = item.get('kwargs', {})
keep_results = item.get('keep_results', False)
try:
result = call(*args, **kwargs)
if keep_results:
self.pool_results.append(dill.dumps((item, result)))
else:
del call, args, kwargs, keep_results, item, result
except Exception as e:
print e
traceback.print_exc()
os.kill(os.getpid(), signal.SIGUSR1)
self.queue.task_done()
def finish_pool_queue(self, callable=None):
self.frozen_pool.value = False
while self.queue._unfinished_tasks.get_value() > 0:
if self.print_queue:
print_info('--> Process pool... %s' % (self.queue._unfinished_tasks.get_value()))
if callable:
callable()
sleep(5)
for i in range(self.queue_processes):
self.queue.put(None)
self.queue.join()
self.queue.close()
for p in self.processes:
with ignore_exception: p.join(10)
with ignore_exception: p.terminate()
with ignore_exception: del self.processes[:]
def get_pool_results(self):
return self.pool_results
def clear_pool_results(self):
del self.pool_results[:]
def test(eg):
print 'EG', eg
Call with either:
tp = ThreadPool(queue_threads=2)
tp.queue.put({'call': test, 'args': [random.randint(0, 100)]})
tp.finish_pool_queue()
or
pp = ProcessPool(queue_processes=2)
pp.queue.put(dill.dumps({'call': test, 'args': [random.randint(0, 100)]}))
pp.queue.put(dill.dumps({'call': test, 'args': [random.randint(0, 100)]}))
pp.finish_pool_queue()

A multi-producers and multi-consumers example, verified. It should be easy to modify it to cover other cases, single/multi producers, single/multi consumers.
from multiprocessing import Process, JoinableQueue
import time
import os
q = JoinableQueue()
def producer():
for item in range(30):
time.sleep(2)
q.put(item)
pid = os.getpid()
print(f'producer {pid} done')
def worker():
while True:
item = q.get()
pid = os.getpid()
print(f'pid {pid} Working on {item}')
print(f'pid {pid} Finished {item}')
q.task_done()
for i in range(5):
p = Process(target=worker, daemon=True).start()
# send thirty task requests to the worker
producers = []
for i in range(2):
p = Process(target=producer)
producers.append(p)
p.start()
# make sure producers done
for p in producers:
p.join()
# block until all workers are done
q.join()
print('All work completed')
Explanation:
Two producers and five consumers in this example.
JoinableQueue is used to make sure all elements stored in queue will be processed. 'task_done' is for worker to notify an element is done. 'q.join()' will wait for all elements marked as done.
With #2, there is no need to join wait for every worker.
But it is important to join wait for every producer to store element into queue. Otherwise, program exit immediately.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing for dataset preparation - python

Related

Python multiprocessing possible deadlock with two queue as producer-consumer pattern?

Python - Incrementing a count producing unexpected results

Python port scanner, how to determine optimal number of threads

How to process input in parallel with python, but without processes?

How to use multiprocessing queue in Python?

Categories

Resources