Fill a Queue with Objects from several data loaders using multiprocessing - python

I work on a machine learning input pipeline. I wrote a data loader that reads in data from a large .hdf file and returns slices, which takes roughly 2 seconds per slice. Therefore I would like to use a queue, that takes in objects from several data loaders and can return single objects from the queue via a next function (like a generator). Furthermore the processes that fill the queue should run somehow in the background, refilling the queue when it is not full. I do not get it to work properly. It worked with a single dataloader, giving me 4 times the same slices..
import multiprocessing as mp
class Queue_Generator():
def __init__(self, data_loader_list):
self.pool = mp.Pool(4)
self.data_loader_list = data_loader_list
self.queue = mp.Queue(maxsize=16)
self.pool.map(self.fill_queue, self.data_loader_list)
def fill_queue(self,gen):
self.queue.put(next(gen))
def __next__(self):
yield self.queue.get()
What I get from this:
NotImplementedError: pool objects cannot be passed between processes or pickled
Thanks in advance

Your specific error means that you cannot have a pool as part of your class when you are passing class methods to a pool. What I would suggest could be the following:
import multiprocessing as mp
from queue import Empty
class QueueGenerator(object):
def __init__(self, data_loader_list):
self.data_loader_list = data_loader_list
self.queue = mp.Queue(maxsize=16)
def __iter__(self):
processes = list()
for _ in range(4):
pr = mp.Process(target=fill_queue, args=(self.queue, self.data_loader_list))
pr.start()
processes.append(pr)
return self
def __next__(self):
try:
return self.queue.get(timeout=1) # this should have a value, otherwise your loop will never stop. make it something that ensures your processes have enough time to update the queue but not too long that your program freezes for an extended period of time after all information is processed
except Empty:
raise StopIteration
# have fill queue as a separate function
def fill_queue(queue, gen):
while True:
try:
value = next(gen)
queue.put(value)
except StopIteration: # assumes the given data_loader_list is an iterator
break
print('stopping')
gen = iter(range(70))
qg = QueueGenerator(gen)
for val in qg:
print(val)
# test if it works several times:
for val in qg:
print(val)
The next issue for you to solve I think is to have the data_loader_list be something that provides new information in every separate process. But since you have not given any information about that I can't help you with that. The above does however provide you a way to have the processes fill your queue which is then passed out as an iterator.

Not quite sure why you are yielding in __next__, that doesn't look quite right to me. __next__ should return a value, not a generator object.
Here is a simple way that you can return the results of parallel functions as a generator. It may or may not meet your specific requirements but can be tweaked to suit. It will keep on processing data_loader_list until it is exhausted. This may use a lot of memory compared to keeping, for example, 4 items in a Queue at all times.
import multiprocessing as mp
def read_lines(data_loader):
from time import sleep
sleep(2)
return f'did something with {data_loader}'
def make_gen(data_loader_list):
with mp.Pool(4) as pool:
for result in pool.imap(read_lines, data_loader_list):
yield result
if __name__ == '__main__':
data_loader_list = [i for i in range(15)]
result_generator = make_gen(data_loader_list)
print(type(result_generator))
for i in result_generator:
print(i)
Using imap means that the results can be processed as they are produced. map and map_async would block in the for loop until all results were ready. See this question for more.

Related

Python: parallel execution of yield expressions in generator

I have a generator function that iterates over a big number of parameters and yields result of another function with this parameters. Inner function may have quite a long time of execution, so I would like to use multiprocessing to speed up process. Maybe it's important, I also would like to have an ability to stop this generator in middle of execution. But I'm not sure what is the right way to implement such logic. I need something like queue, giving the ability to add new tasks after old ones have been finished and to yield results as soon as they ready. I've looked over multiprocessing.Queue, but at first glance it seems not suitable for my case. May be somebody can advise what should I use in such scenario?
Here is approximate code of my task:
def gen(**kwargs):
for param in get_params():
yield inner_func(param)
Use a multiprocessing.pool.Pool class for multiprocessing since its terminate method will cancel both all running tasks as well as those scheduled to run (the concurrent.futures module terminate method will not cancel already running tasks). And as #MisterMiyakgi indicated, it should not be necessary to use a generator. However, you should use the imap_unordered method, which returns an iterable that can be iterated and allows you to get results as they are generated by your inner_function, whereas if you were to use map you would not be able to get the first generated value until all values had been generated.
from multiprocessing import Pool
def get_params():
""" Generator function. """
# For example:
for next_param in range(10):
yield next_param
def inner_function(param):
""" Long running function. """
# For example:
return param ** 2
def gen():
pool = Pool()
# Use imap_unordered if we do not care about the order of results else imap:
iterable = pool.imap_unordered(inner_function, get_params())
# The iterable can be iterated as if it were a generator
# Add terminate method to iterable:
def terminate():
pool.terminate()
pool.close()
pool.join()
iterable.terminate = terminate
return iterable
# Usage:
# Required for Windows
if __name__ == '__main__':
iterable = gen()
# iterable.terminate() should be called when done iterating the iterable
# but it can be called any time to kill all running tasks and scheduled tasks.
# After calling terminate() do not further iterate the iterable.
for result in iterable:
print(result)
if result == 36:
iterable.terminate() # kill all remaining tasks, if any
break
Prints:
0
1
4
9
16
25
36

Organizing the contents/outputs of Python's multiprocessing queue

I'm writing a script that processes several different instances of a Class object, which contains a number of attributes and methods. The objects are all placed in a single list (myobjects = [myClass(IDnumber=1), myClass(IDnumber=2), myClass(IDnumber=3)], and then modified by fairly simplistic for loops that call specific functions from the objects, of the form
for x in myobjects:
x.myfunction()
This script utilizes logging, to forward all output to a logfile that I can check later. I'm attempting to parallelize this script, because it's fairly straightforward to do so (example below), and need to utilize a queue in order to organize all the logging outputs from each Process. This aspect works flawlessly- I can define a new logfile for each process, and then pass the object-specific logfile back to my main script, which can then organize the main logfile by appending each minor logfile in turn.
from multiprocessing import Process, Queue
q = Queue()
threads = []
mainlog = 'mylogs.log' #this is set up in my __init__.py but included here as demonstration
for x in myobjects:
logfile = x.IDnumber+'.log'
thread = Process(target=x.myfunction(), args=(logfile, queue))
threads.append(thread)
thread.start()
for thread in threads:
if thread.is_alive():
thread.join()
while not queue.empty():
minilog = queue.get()
minilog_open = open(minilog, 'r')
mainlog_open = open(mainlog, 'a+')
mainlog_open.write(minilog_open.read())
My problem, now, is that I also need these objects to update a specific attribute, x.success, as True or False. Normally, in serial, x.success is updated at the end of x.myfunction() and is sent down the script where it needs to go, and everything works great. However, in this parallel implementation, x.myfunction populates x.success in the Process, but that information never makes it back to the main script- so if I add print(success) inside myfunction(), I see True or False, but if I add for x in myobjects: print(x.success) after the queue.get() block, I just see None. I realize that I can just use queue.put(success) in myfunction() the same way I use queue.put(logfile), but what happens when two or more processes finish simultaneously? There's no guarantee (that I know of) that my queue will be organized like
logfile (for myobjects[0])
success = True (for myobjects[0])
logfile (for myobjects[1])
success = False (for myobjects[1]) (etc etc)
How can I organize object-specific outputs from a queue, if this queue contains both logfiles and variables? I need to know the content of x.success for each x.myfunction(), so that information has to come back to the main process somehow.
OP has request an example to demonstrate concepts mentioned in my comment. Explanation follows the code:-
import concurrent.futures
class MyObject:
def __init__(self):
self._ID = str(id(self))
self._status = None
#property
def ID(self):
return self._ID
#property
def status(self):
return self._status
#status.setter
def status(self, status):
self._status = status
def MyFunction(self):
# do the real work here
self.status = True
def MyThreadFunc(args):
myObject = args[0]
myObject.MyFunction()
# note that the wrapper function returns a tuple
return myObject.status, myObject.ID
if __name__ == '__main__':
N = 10 # number of instances of MyObject
myObjects = [MyObject() for _ in range(N)]
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(MyThreadFunc, [o]): o for o in myObjects}
for future in concurrent.futures.as_completed(futures):
_status, _id = future.result()
print(f'Status is {_status} for ID {_id}')
The class MyObject obviously doesn't do very much. The key features are that it has a string version of its id, a status and a function that does something but implicitly returns None.
We write a wrapper function that takes a reference to an instance of MyObject (first element in the iterable args), executes MyFunction() on that particular class instance then return that class's ID and status as a tuple.
The main loop uses a pattern that I use a lot and I'm sure many others do too. Using a dictionary comprehension, we build the so-called "futures". Remember that the second argument to submit() must be an iterable even though MyThreadFunc only needs one value.
We then wait for the threads to complete and get their return values.

Make python generator run in background

Right now I have some code that does roughly the following
def generator():
while True:
value = do_some_lengthy_IO()
yield value
def model():
for datapoint in generator():
do_some_lengthy_computation(datapoint)
Right now, the I/O and the computation happen in serial. Ideally the should be running in parallel concurrently (the generator having ready the next value) since they share nothing but the value being passed. I started looking into this and got very confused with the multiprocessing, threading, and async stuff and could not get a minimal working example going. Also, since some of this seems to be recent features, I am using Python 3.6.
I ended up figuring it out. The simplest way is to use the multiprocessing package and use a pipe to communicate with the child process. I wrote a wrapper that can take any generator
import time
import multiprocessing
def bg(gen):
def _bg_gen(gen, conn):
while conn.recv():
try:
conn.send(next(gen))
except StopIteration:
conn.send(StopIteration)
return
parent_conn, child_conn = multiprocessing.Pipe()
p = multiprocessing.Process(target=_bg_gen, args=(gen, child_conn))
p.start()
parent_conn.send(True)
while True:
parent_conn.send(True)
x = parent_conn.recv()
if x is StopIteration:
return
else:
yield x
def generator(n):
for i in range(n):
time.sleep(1)
yield i
#This takes 2s/iteration
for i in generator(100):
time.sleep(1)
#This takes 1s/iteration
for i in bg(generator(100)):
time.sleep(1)
The only missing thing right now is that for infinite generators the process is never killed but that can be easily added by doing a parent_conn.send(False).

parallel writing to list in python

I got multiple parallel processes writing into one list in python. My code is:
global_list = []
class MyThread(threading.Thread):
...
def run(self):
results = self.calculate_results()
global_list.extend(results)
def total_results():
for param in params:
t = MyThread(param)
t.start()
while threading.active_count() > 1:
pass
return total_results
I don't like this aproach as it has:
An overall global variable -> What would be the way to have a local variable for the `total_results function?
The way I check when the list is returned seems somewhat clumsy, what would be the standard way?
Is your computation CPU-intensive? If so you should look at the multiprocessing module which is included with Python and offers a fairly easy to use Pool class into which you can feed compute tasks and later get all the results. If you need a lot of CPU time this will be faster anyway, because Python doesn't do threading all that well: only a single interpreter thread can run at a time in one process. Multiprocessing sidesteps that (and offers the Pool abstraction which makes your job easier). Oh, and if you really want to stick with threads, multiprocessing has a ThreadPool too.
1 - Use a class variable shared between all Worker's instances to append your results
from threading import Thread
class Worker(Thread):
results = []
...
def run(self):
results = self.calculate_results()
Worker.results.extend(results) # extending a list is thread safe
2 - Use join() to wait untill all the threads are done and let them have some computational time
def total_results(params):
# create all workers
workers = [Worker(p) for p in params]
# start all workers
[w.start() for w in workers]
# wait for all of them to finish
[w.join() for w in workers]
#get the result
return Worker.results

Get all items from thread Queue

I have one thread that writes results into a Queue.
In another thread (GUI), I periodically (in the IDLE event) check if there are results in the queue, like this:
def queue_get_all(q):
items = []
while 1:
try:
items.append(q.get_nowait())
except Empty, e:
break
return items
Is this a good way to do it ?
Edit:
I'm asking because sometimes the
waiting thread gets stuck for a few
seconds without taking out new
results.
The "stuck" problem turned out to be because I was doing the processing in the idle event handler, without making sure that such events are actually generated by calling wx.WakeUpIdle, as is recommended.
If you're always pulling all available items off the queue, is there any real point in using a queue, rather than just a list with a lock? ie:
from __future__ import with_statement
import threading
class ItemStore(object):
def __init__(self):
self.lock = threading.Lock()
self.items = []
def add(self, item):
with self.lock:
self.items.append(item)
def getAll(self):
with self.lock:
items, self.items = self.items, []
return items
If you're also pulling them individually, and making use of the blocking behaviour for empty queues, then you should use Queue, but your use case looks much simpler, and might be better served by the above approach.
[Edit2] I'd missed the fact that you're polling the queue from an idle loop, and from your update, I see that the problem isn't related to contention, so the below approach isn't really relevant to your problem. I've left it in in case anyone finds a blocking variant of this useful:
For cases where you do want to block until you get at least one result, you can modify the above code to wait for data to become available through being signalled by the producer thread. Eg.
class ItemStore(object):
def __init__(self):
self.cond = threading.Condition()
self.items = []
def add(self, item):
with self.cond:
self.items.append(item)
self.cond.notify() # Wake 1 thread waiting on cond (if any)
def getAll(self, blocking=False):
with self.cond:
# If blocking is true, always return at least 1 item
while blocking and len(self.items) == 0:
self.cond.wait()
items, self.items = self.items, []
return items
I think the easiest way of getting all items out of the queue is the following:
def get_all_queue_result(queue):
result_list = []
while not queue.empty():
result_list.append(queue.get())
return result_list
I'd be very surprised if the get_nowait() call caused the pause by not returning if the list was empty.
Could it be that you're posting a large number of (maybe big?) items between checks which means the receiving thread has a large amount of data to pull out of the Queue? You could try limiting the number you retrieve in one batch:
def queue_get_all(q):
items = []
maxItemsToRetrieve = 10
for numOfItemsRetrieved in range(0, maxItemsToRetrieve):
try:
if numOfItemsRetrieved == maxItemsToRetrieve:
break
items.append(q.get_nowait())
except Empty, e:
break
return items
This would limit the receiving thread to pulling up to 10 items at a time.
The simplest method is using a list comprehension:
items = [q.get() for _ in range(q.qsize())]
Use of the range function is generally frowned upon, but I haven't found a simpler method yet.
If you're done writing to the queue, qsize should do the trick without needing to check the queue for each iteration.
responseList = []
for items in range(0, q.qsize()):
responseList.append(q.get_nowait())
I see you are using get_nowait() which according to the documentation, "return[s] an item if one is immediately available, else raise the Empty exception"
Now, you happen to break out of the loop when an Empty exception is thrown. Thus, if there is no result immediately available in the queue, your function returns an empty items list.
Is there a reason why you are not using the get() method instead? It may be the case that the get_nowait() fails because the queue is servicing a put() request at that same moment.

Categories