I spent quite a bit of time looking on how to use the multiprocessing package, but couldn't find anything on how to use it inside a plugin in QGIS. I am developing a plugin that does some optimization for several elements. I would like to parallelize it.
I found a useful link on multi-threading inside a python plugin (http://snorf.net/blog/2013/12/07/multithreading-in-qgis-python-plugins/), but nothing on using the multiprocessing module, which might be easier?
I have been trying with a very basic example. I am only showing the run function from the plugin here:
def run(self):
"""Run method that performs all the real work"""
# show the dialog
self.dlg.show()
# Run the dialog event loop
result = self.dlg.exec_()
# See if OK was pressed and run code
if result:
#Get number of cores
nProcs = mp.cpu_count()
#Start a Process
p = mp.Pool(nProcs)
#Define function
def cube(x):
return x**3
#Run parallel
results = p.map(cube, range(1,7))
When I run this code from the plugin in QGIS, it opens several QGIS windows, which then return errors (can't load layers, etc.). What am I missing? Do I need to start a worker first on another thread and then use multiprocessing there? Or would we use another function from multiprocessing?
Please let me know if the question needs edits. I am working under windows 7, using QGIS 2.10.
Thanks,
UPDATE
I created a worker class to implement the function and sent it to a new thread, but I get the same problem when I use multiprocessing in that thread.
The class I created is as follows:
class Worker(QObject):
'''Example worker'''
def __init__(self, result_queue, f, attr=[], repet=None, nbCores=None):
QObject.__init__(self)
if not hasattr(f, '__call__'):
#Check if not a function
raise TypeError('Worker expected a function as second argument')
if not isinstance(attr, list) and not repet==None:
#Check if not a list if there is a repet command
raise TypeError('Input problem:\nThe arguments for the function should be in a list if repet is provided')
if not all(isinstance(elem, list) for elem in attr) and repet==None and len(inspect.getargspec(f).args) > 1:
#Check if not a list of lists if there isn't a repet command
raise TypeError('Input problem:\nThe arguments for the function should be a list of lists if repet is not provided')
if not repet == None and (not isinstance(repet, int) or repet == 0):
#Check that provided an integer greater than 0
raise TypeError('If provided, repet should be None or a strictly positive integer')
self.result_queue = result_queue
self.f = f
self.attr = attr
self.repet = repet
self.nbCores = nbCores
if self.nbCores == None:
self.nbCores = mp.cpu_count() - 1
def fStar(self, arg):
"""Convert the function to taking a list as arguments"""
return self.f(*arg)
def run(self):
ret = None
try:
if self.repet == 1:
# estimates the function based on provided arguments
ret = self.f(*self.attr) #The star unpacks the list into attributes
else:
pool = mp.Pool(processes=self.nbCores)
if self.repet > 1:
ret = pool.map(self.fStar, itools.repeat(self.attr,self.repet))
elif self.repet == None:
ret = pool.map(self.fStar, self.attr)
pool.close()
pool.join()
except Exception, e:
#I can't pass an exception, it makes qgis bug
pass
self.result_queue.put(ret) #Pass the result to the queue
finished = pyqtSignal(object)
error = pyqtSignal(Exception, basestring)
I start the worker and send it to a new thread using the following function:
def startWorker(f, attr, repet=None, nbCores=None):
#Create a result queue
result_queue = queue.Queue()
# create a new worker instance
worker = Worker(result_queue, f, attr, repet, nbCores)
# start the worker in a new thread
thread = QThread()
worker.moveToThread(thread)
thread.started.connect(worker.run)
thread.start()
#Clean up when the thread is finished
worker.deleteLater()
thread.quit()
thread.wait()
thread.deleteLater()
#Export the result to the queue
res = []
while not result_queue.empty():
r = result_queue.get()
if r is None:
continue
res.append(r)
return res
As in my initial question, I just replaced results = p.map(cube, range(1,7)) by calling the startWorker function
Please let me know if you have any idea how to make this work. I implemented the work in multiple threads, but it would be much faster to use several cores...
Related
I use a dedicated Python (3.8) library to control a motor drive via a USB port.
The Python library provided by the motor control drive manufacturers (ODrive) allows a single Python process to control one or more drives.
However, I would like to run 3 processes, each controlling 1 drive.
After researching options (I first considered virtual machines, Docker containers, and multi-threading) I began believing that the easiest way to do so would be to use multiprocessing.
My problem is that I would then need a way to manage (i.e., start, monitor, and stop independently) multiple processes. The practical reason behind it is that motors are connected to different setups. Each setup must be able to be stopped and restarted separate if malfunctioning, for instance, but other running setups should not be affected by this action.
After reading around the internet and Stack Overflow, I now understand how to create a Pool of processing, how to associate processes with processor cores, how to start a pool of processes, and queuing/joining them (the latter not being needed for me).
What I don't know is how to manage them independently.
How can I separately start/stop different processes without affecting the execution of the others?
Are there libraries to manage them (perhaps even with a GUI)?
I'd probably do something like this:
import random
import time
from multiprocessing import Process, Queue
class MotorProcess:
def __init__(self, name, com_related_params):
self.name = name
# Made up some parameters relating to communication
self._params = com_related_params
self._command_queue = Queue()
self._status_queue = Queue()
self._process = None
def start(self):
if self._process and self._process.is_alive():
return
self._process = Process(target=self.run_processing,
args=(self._command_queue, self._status_queue,
self._params))
self._process.start()
#staticmethod
def run_processing(command_queue, status_queue, params):
while True:
# Check for commands
if not command_queue.empty():
msg = command_queue.get(block=True, timeout=0.05)
if msg == "stop motor":
status_queue.put("Stopping motor")
elif msg == "exit":
return
elif msg.startswith("move"):
status_queue.put("moving motor to blah")
# TODO: msg parsing and move motor
else:
status_queue.put("unknown command")
# Update status
# TODO: query motor status
status_queue.put(f"Motor is {random.randint(0, 100)}")
time.sleep(0.5)
def is_alive(self):
if self._process and self._process.is_alive():
return True
return False
def get_status(self):
if not self.is_alive():
return ["not running"]
# Empty the queue
recent = []
while not self._status_queue.empty():
recent.append(self._status_queue.get(False))
return recent
def stop_process(self):
if not self.is_alive():
return
self._command_queue.put("exit")
# Empty the stats queue otherwise it could potentially stop
# the process from closing.
while not self._status_queue.empty():
self._status_queue.get()
self._process.join()
def send_command(self, command):
self._command_queue.put(command)
if __name__ == "__main__":
processes = [MotorProcess("1", None), MotorProcess("2", None)]
while True:
cmd = input()
if cmd == "start 1":
processes[0].start()
elif cmd == "move 1 to 100":
processes[0].send_command("move to 100")
elif cmd == "exit 1":
processes[0].stop_process()
else:
for n, p in enumerate(processes):
print(f"motor {n + 1}", end="\n\t")
print("\n\t".join(p.get_status()))
Not production ready (e.g. no exception handling, no proper command parsing, etc.) but shows the idea.
Shout if there are any problems :D
You can create multiple multriprocessing.Process instances manually like this:
def my_func(a, b):
pass
p = multiprocessing.Process(target=my_func, args=(100, 200)
p.start()
and manage them using multiprocessing primitives Queue, Event, Condition etc. Please refer to the official documentation for details: https://docs.python.org/3/library/multiprocessing.html
In the following example multiple processes are started and stopped independently. Event is used to determine when to stop a process. Queue is used for results passing from the child processes to the main process.
import multiprocessing
import queue
import random
import time
def worker_process(
process_id: int,
results_queue: multiprocessing.Queue,
to_stop: multiprocessing.Event,
):
print(f"Process {process_id} is started")
while not to_stop.is_set():
print(f"Process {process_id} is working")
time.sleep(0.5)
result = random.random()
results_queue.put((process_id, result))
print(f"Process {process_id} exited")
process_pool = []
result_queue = multiprocessing.Queue()
while True:
if random.random() < 0.3:
# staring a new process
process_id = random.randint(0, 10_000)
to_stop = multiprocessing.Event()
p = multiprocessing.Process(
target=worker_process, args=(process_id, result_queue, to_stop)
)
p.start()
process_pool.append((p, to_stop))
if random.random() < 0.2:
# closing a random process
if process_pool:
process, to_stop = process_pool.pop(
random.randint(0, len(process_pool) - 1)
)
to_stop.set()
process.join()
try:
p_id, result = result_queue.get_nowait()
print(f"Completed: process_id={p_id} result={result}")
except queue.Empty:
pass
time.sleep(1)
I have two custom Python classes, the first one has a method to make some calculations (using Pool) and create a new instance attribute, and the second one is used to aggregate two objects of the first class and has a method with which I want to send said calculations (also in parallel) in the two first-class objects and correctly save their new instance attributes.
Dummy code:
from multiprocessing import Pool, Process
class State:
def __init__(self, data):
self.data = data
def calculate(self):
with Pool() as p:
p.map(function, args)
new_attribute = *some code that reads the files generated with the Pool*
self.new_attribute = new_attribute
return
class Pair:
def __init__(self. state1:State, state2:State):
self.state1 = state1
self.state2 = state2
def calculate_states(self):
for state in [self.state1, self.state2]
p = Process(state.calculate, args)
p.start()
return
state1 = State(data1)
state2 = State(data2)
pair = Pair(state1, state2)
pair.calculate_states()
The problem is that, as I have found out during my extensive research about the problem, multiprocessing.Process creates copies of the namespace in which the processes work, and the values aren't returned to the main namespace. Setting the process.daemon to True produces an error, because "daemonic processes aren't allowed to have children", which is the same thing that happens if I exchange the Processes by an additional Pool. Using multiprocess (instead of multiprocessing) or concurrent.futures doesn't seem to work either. Additionally, I don't understand how multiprocessing.Queue works and I'm not sure if it could be applied here (I have read somewhere that it could be used).
I would like to do what I am trying to do without having to pass a shared-memory object to the Processes (to write the new_attribute into it and then apply it to the States in the main namespace). I hope someone can point me towards the solution even if I have not provided a working code/reproducible example.
Your problem arises from invoking method calculate as a new subprocess. You can still compute the new attributes in parallel without doing that by using map_async with a callback argument.
I have taken your code and provided missing function implementations to demonstrate:
from multiprocessing import Pool, cpu_count
def some_code(data):
if data == 1:
return 1032
if data == 2:
return 9874
raise ValueError('Invalid data value:', data)
def function(val):
...
# return value is not of interest
class State:
def __init__(self, data):
self.data = data
def calculate(self, pool, args):
pool.map_async(function, args, callback=self.callback)
def callback(self, result):
"""
Called when map_async completes
"""
new_attribute = some_code(self.data)
self.new_attribute = new_attribute
class Pair:
def __init__(self, state1:State, state2:State):
self.state1 = state1
self.state2 = state2
def calculate_states(self):
args = (6, 9, 18)
# Assumption is computation is VERY CPU-intensive
# If there is quite a bit of I/O involved then: pool_size = 2 * len(args)
# If it's mostly I/O you should have been using multithreading to begin with
pool_size = min(2*len(args), cpu_count())
with Pool(pool_size) as pool:
for state in [self.state1, self.state2]:
state.calculate(pool, args)
# wait for tasks to complete
pool.close()
pool.join()
# Required for Windows:
if __name__ == '__main__':
data1 = 1
data2 = 2
state1 = State(data1)
state2 = State(data2)
pair = Pair(state1, state2)
pair.calculate_states()
print(state1.new_attribute, state2.new_attribute)
Prints:
1032 9874
I have a class (MyClass) which contains a queue (self.msg_queue) of actions that need to be run and I have multiple sources of input that can add tasks to the queue.
Right now I have three functions that I want to run concurrently:
MyClass.get_input_from_user()
Creates a window in tkinter that has the user fill out information and when the user presses submit it pushes that message onto the queue.
MyClass.get_input_from_server()
Checks the server for a message, reads the message, and then puts it onto the queue. This method uses functions from MyClass's parent class.
MyClass.execute_next_item_on_the_queue()
Pops a message off of the queue and then acts upon it. It is dependent on what the message is, but each message corresponds to some method in MyClass or its parent which gets run according to a big decision tree.
Process description:
After the class has joined the network, I have it spawn three threads (one for each of the above functions). Each threaded function adds items from the queue with the syntax "self.msg_queue.put(message)" and removes items from the queue with "self.msg_queue.get_nowait()".
Problem description:
The issue I am having is that it seems that each thread is modifying its own queue object (they are not sharing the queue, msg_queue, of the class of which they, the functions, are all members).
I am not familiar enough with Multiprocessing to know what the important error messages are; however, it is stating that it cannot pickle a weakref object (it gives no indication of which object is the weakref object), and that within the queue.put() call the line "self._sem.acquire(block, timeout) yields a '[WinError 5] Access is denied'" error. Would it be safe to assume that this failure in the queue's reference not copying over properly?
[I am using Python 3.7.2 and the Multiprocessing package's Process and Queue]
[I have seen multiple Q/As about having threads shuttle information between classes--create a master harness that generates a queue and then pass that queue as an argument to each thread. If the functions didn't have to use other functions from MyClass I could see adapting this strategy by having those functions take in a queue and use a local variable rather than class variables.]
[I am fairly confident that this error is not the result of passing my queue to the tkinter object as my unit tests on how my GUI modifies its caller's queue work fine]
Below is a minimal reproducible example for the queue's error:
from multiprocessing import Queue
from multiprocessing import Process
import queue
import time
class MyTest:
def __init__(self):
self.my_q = Queue()
self.counter = 0
def input_function_A(self):
while True:
self.my_q.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
def input_function_B(self):
while True:
self.counter = 0
self.my_q.put(self.counter)
time.sleep(1)
def output_function(self):
while True:
try:
var = self.my_q.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
def run(self):
process_A = Process(target=self.input_function_A)
process_B = Process(target=self.input_function_B)
process_C = Process(target=self.output_function)
process_A.start()
process_B.start()
process_C.start()
# without this it generates the WinError:
# with this it still behaves as if the two input functions do not modify the queue
process_C.join()
if __name__ == '__main__':
test = MyTest()
test.run()
Indeed - these are not "threads" - these are "processes" - while if you were using multithreading, and not multiprocessing, the self.my_q instance would be the same object, placed at the same memory space on the computer,
multiprocessing does a fork of the process, and any data in the original process (the one in execution in the "run" call) will be duplicated when it is used - so, each subprocess will see its own "Queue" instance, unrelated to the others.
The correct way to have various process share a multiprocessing.Queue object is to pass it as a parameter to the target methods. The simpler way to reorganize your code so that it works is thus:
from multiprocessing import Queue
from multiprocessing import Process
import queue
import time
class MyTest:
def __init__(self):
self.my_q = Queue()
self.counter = 0
def input_function_A(self, queue):
while True:
queue.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
def input_function_B(self, queue):
while True:
self.counter = 0
queue.put(self.counter)
time.sleep(1)
def output_function(self, queue):
while True:
try:
var = queue.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
def run(self):
process_A = Process(target=self.input_function_A, args=(queue,))
process_B = Process(target=self.input_function_B, args=(queue,))
process_C = Process(target=self.output_function, args=(queue,))
process_A.start()
process_B.start()
process_C.start()
# without this it generates the WinError:
# with this it still behaves as if the two input functions do not modify the queue
process_C.join()
if __name__ == '__main__':
test = MyTest()
test.run()
As you can see, since your class is not actually sharing any data through the instance's attributes, this "class" design does not make much sense for your application - but for grouping the different workers in the same code block.
It would be possible to have a magic-multiprocess-class that would have some internal method to actually start the worker-methods and share the Queue instance - so if you have a lot of those in a project, there would be a lot less boilerplate.
Something along:
from multiprocessing import Queue
from multiprocessing import Process
import time
class MPWorkerBase:
def __init__(self, *args, **kw):
self.queue = None
self.is_parent_process = False
self.is_child_process = False
self.processes = []
# ensure this can be used as a colaborative mixin
super().__init__(*args, **kw)
def run(self):
if self.is_parent_process or self.is_child_process:
# workers already initialized
return
self.queue = Queue()
processes = []
cls = self.__class__
for name in dir(cls):
method = getattr(cls, name)
if callable(method) and getattr(method, "_MP_worker", False):
process = Process(target=self._start_worker, args=(self.queue, name))
self.processes.append(process)
process.start()
# Setting these attributes here ensure the child processes have the initial values for them.
self.is_parent_process = True
self.processes = processes
def _start_worker(self, queue, method_name):
# this method is called in a new spawned process - attribute
# changes here no longer reflect attributes on the
# object in the initial process
# overwrite queue in this process with the queue object sent over the wire:
self.queue = queue
self.is_child_process = True
# call the worker method
getattr(self, method_name)()
def __del__(self):
for process in self.processes:
process.join()
def worker(func):
"""decorator to mark a method as a worker that should
run in its own subprocess
"""
func._MP_worker = True
return func
class MyTest(MPWorkerBase):
def __init__(self):
super().__init__()
self.counter = 0
#worker
def input_function_A(self):
while True:
self.queue.put(self.counter)
self.counter = self.counter + 1
time.sleep(0.2)
#worker
def input_function_B(self):
while True:
self.counter = 0
self.queue.put(self.counter)
time.sleep(1)
#worker
def output_function(self):
while True:
try:
var = self.queue.get_nowait()
except queue.Empty:
var = -1
except:
break
print(var)
time.sleep(1)
if __name__ == '__main__':
test = MyTest()
test.run()
Background:
I am working on Telecoms Network discovery script, that is run by crontab on linux. It uses a seed file of initial network nodes, it connects to them, get all neighbors and then connects to those neighbors and so on and so on. Typical recursion.
To speed up the whole thing I was using Multi-threading with Semaphore, so I had only certain number of running threads, but huge number of started threads, waiting. At certain point I run into maximum thread limit of linux so the script was not able to start new threads.
Problem:
In pursuit of a design, that would allow multi-threading of this recursion it seemed to me its a case of multi hybrid producer/consumer scenario. Multiple consumers are also producing.
Consumer takes item from queue, consumes it and if there are any results, returns each result into the queue again.
To make it really nice I would like to create design pattern, that is usable for any type of recursion function, in other words with any args and kwargs.
What I expect from such function is, that I pass it any combination of variables(args, kwargs) that it needs and I get in return list of arguments, that I can pass to it again in other recursions.
Questions:
Is there any better way to handle getting args, kwargs from function return other than the one I used? I basically created a tuple (args, kwargs) (tuple(), dict()), that the func returns and Worker splits it into args, kwargs afterwards. Ideal would be to not need to create that tuple at all.
Would you have any other improvement tips on this design?
Thank you sincerely!
Current Code:
#!/usr/bin/env python3
from queue import Queue, Empty
from threading import Thread
from time import sleep
from random import choice, random
class RecursiveWorkerThread(Thread):
def __init__(self, name, pool):
Thread.__init__(self)
self.name = name
self.pool = pool
self.tasks = pool.tasks
self.POISON = pool.POISON
self.daemon = False
self.result = None
self.start()
def run(self):
print(f'WORKER {self.name} - is awake.')
while True:
if not self.tasks.empty():
# take task from queue
try:
func, f_args, f_kwargs = self.tasks.get(timeout=1)
# check for POISON
if func is self.POISON:
print(f'WORKER {self.name} - POISON found. Sending it back to queue. Dying...')
self.pool.add_task(self.POISON)
break
# try to perform the task on arguments and get result
try:
self.result = func(*f_args, **f_kwargs)
except Exception as e:
print(e)
# recursive part, add results to queue
print(f'WORKER {self.name} - FUNC: ({func.__name__}) IN: (args: {f_args}, kwargs: {f_kwargs}) OUT: ({self.result}).')
for n_args, n_kwargs in self.result:
self.pool.add_task(func, *n_args, **n_kwargs)
# mark one task done in queue
self.tasks.task_done()
except Empty:
pass
sleep(random())
class RecursiveThreadPool:
def __init__(self, num_threads):
self.tasks = Queue()
self.POISON = object()
print('\nTHREAD_POOL - initialized.\nTHREAD_POOL - waking up WORKERS.')
self.workers = [RecursiveWorkerThread(name=str(num), pool=self) for num in range(num_threads)]
def add_task(self, func, *args, **kwargs):
if func is not self.POISON:
print(f'THREAD_POOL - task received: [func: ({func.__name__}), args: ({args}), kwargs:({kwargs})]')
else:
print('THREAD_POOL - task received: POISON.')
self.tasks.put((func, args, kwargs))
def wait_for_completion(self):
print('\nTHREAD_POOL - waiting for all tasks to be completed.')
self.tasks.join()
print('\nTHREAD_POOL - all tasks have been completed.\nTHREAD_POOL - sending POISON to queue.')
self.add_task(self.POISON)
print('THREAD_POOL - waiting for WORKERS to die.')
for worker in self.workers:
worker.join()
print('\nTHREAD_POOL - all WORKERS are dead.\nTHREAD_POOL - FINISHED.')
# Test part
if __name__ == '__main__':
percentage = [True] * 2 + [False] * 8
# example function
def get_subnodes(node):
maximum_subnodes = 2
sleep(5 * random())
result_list = list()
for i in range(maximum_subnodes):
# apply chance on every possible subnode
if choice(percentage):
new_node = node + '.' + str(i)
# create single result
args = tuple()
kwargs = dict({'node': new_node})
# append it to the result list
result_list.append((args, kwargs))
return result_list
# 1) Init a Thread pool with the desired number of worker threads
THREAD_POOL = RecursiveThreadPool(10)
# 2) Put initial data into queue
initial_nodes = 10
for root_node in [str(i) for i in range(initial_nodes)]:
THREAD_POOL.add_task(get_subnodes, node=root_node)
# 3) Wait for completion
THREAD_POOL.wait_for_completion()
The Scipy minimization function (just to use as an example), has the option of adding a callback function at each step. So I can do something like,
def my_callback(x):
print x
scipy.optimize.fmin(func, x0, callback=my_callback)
Is there a way to use the callback function to create a generator version of fmin, so that I could do,
for x in my_fmin(func,x0):
print x
It seems like it might be possible with some combination of yields and sends, but I can quite think of anything.
As pointed in the comments, you could do it in a new thread, using Queue. The drawback is that you'd still need some way to access the final result (what fmin returns at the end). My example below uses an optional callback to do something with it (another option would be to just yield it also, though your calling code would have to differentiate between iteration results and final results):
from thread import start_new_thread
from Queue import Queue
def my_fmin(func, x0, end_callback=(lambda x:x), timeout=None):
q = Queue() # fmin produces, the generator consumes
job_done = object() # signals the processing is done
# Producer
def my_callback(x):
q.put(x)
def task():
ret = scipy.optimize.fmin(func,x0,callback=my_callback)
q.put(job_done)
end_callback(ret) # "Returns" the result of the main call
# Starts fmin in a new thread
start_new_thread(task,())
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
if next_item is job_done:
break
yield next_item
Update: to block the execution of the next iteration until the consumer has finished processing the last one, it's also necessary to use task_done and join.
# Producer
def my_callback(x):
q.put(x)
q.join() # Blocks until task_done is called
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
if next_item is job_done:
break
yield next_item
q.task_done() # Unblocks the producer, so a new iteration can start
Note that maxsize=1 is not necessary, since no new item will be added to the queue until the last one is consumed.
Update 2: Also note that, unless all items are eventually retrieved by this generator, the created thread will deadlock (it will block forever and its resources will never be released). The producer is waiting on the queue, and since it stores a reference to that queue, it will never be reclaimed by the gc even if the consumer is. The queue will then become unreachable, so nobody will be able to release the lock.
A clean solution for that is unknown, if possible at all (since it would depend on the particular function used in the place of fmin). A workaround could be made using timeout, having the producer raises an exception if put blocks for too long:
q = Queue(maxsize=1)
# Producer
def my_callback(x):
q.put(x)
q.put("dummy",True,timeout) # Blocks until the first result is retrieved
q.join() # Blocks again until task_done is called
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
q.task_done() # (one "task_done" per "get")
if next_item is job_done:
break
yield next_item
q.get() # Retrieves the "dummy" object (must be after yield)
q.task_done() # Unblocks the producer, so a new iteration can start
Generator as coroutine (no threading)
Let's have FakeFtp with retrbinary function using callback being called with each successful read of chunk of data:
class FakeFtp(object):
def __init__(self):
self.data = iter(["aaa", "bbb", "ccc", "ddd"])
def login(self, user, password):
self.user = user
self.password = password
def retrbinary(self, cmd, cb):
for chunk in self.data:
cb(chunk)
Using simple callback function has disadvantage, that it is called repeatedly and the callback
function cannot easily keep context between calls.
Following code defines process_chunks generator, which will be able receiving chunks of data one
by one and processing them. In contrast to simple callback, here we are able to keep all the
processing within one function without losing context.
from contextlib import closing
from itertools import count
def main():
processed = []
def process_chunks():
for i in count():
try:
# (repeatedly) get the chunk to process
chunk = yield
except GeneratorExit:
# finish_up
print("Finishing up.")
return
else:
# Here process the chunk as you like
print("inside coroutine, processing chunk:", i, chunk)
product = "processed({i}): {chunk}".format(i=i, chunk=chunk)
processed.append(product)
with closing(process_chunks()) as coroutine:
# Get the coroutine to the first yield
coroutine.next()
ftp = FakeFtp()
# next line repeatedly calls `coroutine.send(data)`
ftp.retrbinary("RETR binary", cb=coroutine.send)
# each callback "jumps" to `yield` line in `process_chunks`
print("processed result", processed)
print("DONE")
To see the code in action, put the FakeFtp class, the code shown above and following line:
main()
into one file and call it:
$ python headsandtails.py
('inside coroutine, processing chunk:', 0, 'aaa')
('inside coroutine, processing chunk:', 1, 'bbb')
('inside coroutine, processing chunk:', 2, 'ccc')
('inside coroutine, processing chunk:', 3, 'ddd')
Finishing up.
('processed result', ['processed(0): aaa', 'processed(1): bbb', 'processed(2): ccc', 'processed(3): ddd'])
DONE
How it works
processed = [] is here just to show, the generator process_chunks shall have no problems to
cooperate with its external context. All is wrapped into def main(): to prove, there is no need to
use global variables.
def process_chunks() is the core of the solution. It might have one shot input parameters (not
used here), but main point, where it receives input is each yield line returning what anyone sends
via .send(data) into instance of this generator. One can coroutine.send(chunk) but in this example it is done via callback refering to this function callback.send.
Note, that in real solution there is no problem to have multiple yields in the code, they are
processed one by one. This might be used e.g. to read (and ignore) header of CSV file and then
continue processing records with data.
We could instantiate and use the generator as follows:
coroutine = process_chunks()
# Get the coroutine to the first yield
coroutine.next()
ftp = FakeFtp()
# next line repeatedly calls `coroutine.send(data)`
ftp.retrbinary("RETR binary", cb=coroutine.send)
# each callback "jumps" to `yield` line in `process_chunks`
# close the coroutine (will throw the `GeneratorExit` exception into the
# `process_chunks` coroutine).
coroutine.close()
Real code is using contextlib closing context manager to ensure, the coroutine.close() is
always called.
Conclusions
This solution is not providing sort of iterator to consume data from in traditional style "from
outside". On the other hand, we are able to:
use the generator "from inside"
keep all iterative processing within one function without being interrupted between callbacks
optionally use external context
provide usable results to outside
all this can be done without using threading
Credits: The solution is heavily inspired by SO answer Python FTP “chunk” iterator (without loading entire file into memory)
written by user2357112
Concept Use a blocking queue with maxsize=1 and a producer/consumer model.
The callback produces, then the next call to the callback will block on the full queue.
The consumer then yields the value from the queue, tries to get another value, and blocks on read.
The producer is the allowed to push to the queue, rinse and repeat.
Usage:
def dummy(func, arg, callback=None):
for i in range(100):
callback(func(arg+i))
# Dummy example:
for i in Iteratorize(dummy, lambda x: x+1, 0):
print(i)
# example with scipy:
for i in Iteratorize(scipy.optimize.fmin, func, x0):
print(i)
Can be used as expected for an iterator:
for i in take(5, Iteratorize(dummy, lambda x: x+1, 0)):
print(i)
Iteratorize class:
from thread import start_new_thread
from Queue import Queue
class Iteratorize:
"""
Transforms a function that takes a callback
into a lazy iterator (generator).
"""
def __init__(self, func, ifunc, arg, callback=None):
self.mfunc=func
self.ifunc=ifunc
self.c_callback=callback
self.q = Queue(maxsize=1)
self.stored_arg=arg
self.sentinel = object()
def _callback(val):
self.q.put(val)
def gentask():
ret = self.mfunc(self.ifunc, self.stored_arg, callback=_callback)
self.q.put(self.sentinel)
if self.c_callback:
self.c_callback(ret)
start_new_thread(gentask, ())
def __iter__(self):
return self
def next(self):
obj = self.q.get(True,None)
if obj is self.sentinel:
raise StopIteration
else:
return obj
Can probably do with some cleaning up to accept *args and **kwargs for the function being wrapped and/or the final result callback.
How about
data = []
scipy.optimize.fmin(func,x0,callback=data.append)
for line in data:
print line
If not, what exactly do you want to do with the generator's data?
A variant of Frits' answer, that:
Supports send to choose a return value for the callback
Supports throw to choose an exception for the callback
Supports close to gracefully shut down
Does not compute a queue item until it is requested
The complete code with tests can be found on github
import queue
import threading
import collections.abc
class generator_from_callback(collections.abc.Generator):
def __init__(self, expr):
"""
expr: a function that takes a callback
"""
self._expr = expr
self._done = False
self._ready_queue = queue.Queue(1)
self._done_queue = queue.Queue(1)
self._done_holder = [False]
# local to avoid reference cycles
ready_queue = self._ready_queue
done_queue = self._done_queue
done_holder = self._done_holder
def callback(value):
done_queue.put((False, value))
cmd, *args = ready_queue.get()
if cmd == 'close':
raise GeneratorExit
elif cmd == 'send':
return args[0]
elif cmd == 'throw':
raise args[0]
def thread_func():
try:
cmd, *args = ready_queue.get()
if cmd == 'close':
raise GeneratorExit
elif cmd == 'send':
if args[0] is not None:
raise TypeError("can't send non-None value to a just-started generator")
elif cmd == 'throw':
raise args[0]
ret = expr(callback)
raise StopIteration(ret)
except BaseException as e:
done_holder[0] = True
done_queue.put((True, e))
self._thread = threading.Thread(target=thread_func)
self._thread.start()
def __next__(self):
return self.send(None)
def send(self, value):
if self._done_holder[0]:
raise StopIteration
self._ready_queue.put(('send', value))
is_exception, val = self._done_queue.get()
if is_exception:
raise val
else:
return val
def throw(self, exc):
if self._done_holder[0]:
raise StopIteration
self._ready_queue.put(('throw', exc))
is_exception, val = self._done_queue.get()
if is_exception:
raise val
else:
return val
def close(self):
if not self._done_holder[0]:
self._ready_queue.put(('close',))
self._thread.join()
def __del__(self):
self.close()
Which works as:
In [3]: def callback(f):
...: ret = f(1)
...: print("gave 1, got {}".format(ret))
...: f(2)
...: print("gave 2")
...: f(3)
...:
In [4]: i = generator_from_callback(callback)
In [5]: next(i)
Out[5]: 1
In [6]: i.send(4)
gave 1, got 4
Out[6]: 2
In [7]: next(i)
gave 2, got None
Out[7]: 3
In [8]: next(i)
StopIteration
For scipy.optimize.fmin, you would use generator_from_callback(lambda c: scipy.optimize.fmin(func, x0, callback=c))
Solution to handle non-blocking callbacks
The solution using threading and queue is pretty good, of high-performance and cross-platform, probably the best one.
Here I provide this not-too-bad solution, which is mainly for handling non-blocking callbacks, e.g. called from the parent function through threading.Thread(target=callback).start(), or other non-blocking ways.
import pickle
import select
import subprocess
def my_fmin(func, x0):
# open a process to use as a pipeline
proc = subprocess.Popen(['cat'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
def my_callback(x):
# x might be any object, not only str, so we use pickle to dump it
proc.stdin.write(pickle.dumps(x).replace(b'\n', b'\\n') + b'\n')
proc.stdin.flush()
from scipy import optimize
optimize.fmin(func, x0, callback=my_callback)
# this is meant to handle non-blocking callbacks, e.g. called somewhere
# through `threading.Thread(target=callback).start()`
while select.select([proc.stdout], [], [], 0)[0]:
yield pickle.loads(proc.stdout.readline()[:-1].replace(b'\\n', b'\n'))
# close the process
proc.communicate()
Then you can use the function like this:
# unfortunately, `scipy.optimize.fmin`'s callback is blocking.
# so this example is just for showing how-to.
for x in my_fmin(lambda x: x**2, 3):
print(x)
Although This solution seems quite simple and readable, it's not as high-performance as the threading and queue solution, because:
Processes are much heavier than threadings.
Passing data through pipe instead of memory is much slower.
Besides, it doesn't work on Windows, because the select module on Windows can only handle sockets, not pipes and other file descriptors.
For a super simple approach...
def callback_to_generator():
data = []
method_with_callback(blah, foo, callback=data.append)
for item in data:
yield item
Yes, this isn't good for large data
Yes, this blocks on all items being processed first
But it still might be useful for some use cases :)
Also thanks to #winston-ewert as this is just a small variant on his answer :)