Turn functions with a callback into Python generators? - python

The Scipy minimization function (just to use as an example), has the option of adding a callback function at each step. So I can do something like,
def my_callback(x):
print x
scipy.optimize.fmin(func, x0, callback=my_callback)
Is there a way to use the callback function to create a generator version of fmin, so that I could do,
for x in my_fmin(func,x0):
print x
It seems like it might be possible with some combination of yields and sends, but I can quite think of anything.

As pointed in the comments, you could do it in a new thread, using Queue. The drawback is that you'd still need some way to access the final result (what fmin returns at the end). My example below uses an optional callback to do something with it (another option would be to just yield it also, though your calling code would have to differentiate between iteration results and final results):
from thread import start_new_thread
from Queue import Queue
def my_fmin(func, x0, end_callback=(lambda x:x), timeout=None):
q = Queue() # fmin produces, the generator consumes
job_done = object() # signals the processing is done
# Producer
def my_callback(x):
q.put(x)
def task():
ret = scipy.optimize.fmin(func,x0,callback=my_callback)
q.put(job_done)
end_callback(ret) # "Returns" the result of the main call
# Starts fmin in a new thread
start_new_thread(task,())
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
if next_item is job_done:
break
yield next_item
Update: to block the execution of the next iteration until the consumer has finished processing the last one, it's also necessary to use task_done and join.
# Producer
def my_callback(x):
q.put(x)
q.join() # Blocks until task_done is called
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
if next_item is job_done:
break
yield next_item
q.task_done() # Unblocks the producer, so a new iteration can start
Note that maxsize=1 is not necessary, since no new item will be added to the queue until the last one is consumed.
Update 2: Also note that, unless all items are eventually retrieved by this generator, the created thread will deadlock (it will block forever and its resources will never be released). The producer is waiting on the queue, and since it stores a reference to that queue, it will never be reclaimed by the gc even if the consumer is. The queue will then become unreachable, so nobody will be able to release the lock.
A clean solution for that is unknown, if possible at all (since it would depend on the particular function used in the place of fmin). A workaround could be made using timeout, having the producer raises an exception if put blocks for too long:
q = Queue(maxsize=1)
# Producer
def my_callback(x):
q.put(x)
q.put("dummy",True,timeout) # Blocks until the first result is retrieved
q.join() # Blocks again until task_done is called
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
q.task_done() # (one "task_done" per "get")
if next_item is job_done:
break
yield next_item
q.get() # Retrieves the "dummy" object (must be after yield)
q.task_done() # Unblocks the producer, so a new iteration can start

Generator as coroutine (no threading)
Let's have FakeFtp with retrbinary function using callback being called with each successful read of chunk of data:
class FakeFtp(object):
def __init__(self):
self.data = iter(["aaa", "bbb", "ccc", "ddd"])
def login(self, user, password):
self.user = user
self.password = password
def retrbinary(self, cmd, cb):
for chunk in self.data:
cb(chunk)
Using simple callback function has disadvantage, that it is called repeatedly and the callback
function cannot easily keep context between calls.
Following code defines process_chunks generator, which will be able receiving chunks of data one
by one and processing them. In contrast to simple callback, here we are able to keep all the
processing within one function without losing context.
from contextlib import closing
from itertools import count
def main():
processed = []
def process_chunks():
for i in count():
try:
# (repeatedly) get the chunk to process
chunk = yield
except GeneratorExit:
# finish_up
print("Finishing up.")
return
else:
# Here process the chunk as you like
print("inside coroutine, processing chunk:", i, chunk)
product = "processed({i}): {chunk}".format(i=i, chunk=chunk)
processed.append(product)
with closing(process_chunks()) as coroutine:
# Get the coroutine to the first yield
coroutine.next()
ftp = FakeFtp()
# next line repeatedly calls `coroutine.send(data)`
ftp.retrbinary("RETR binary", cb=coroutine.send)
# each callback "jumps" to `yield` line in `process_chunks`
print("processed result", processed)
print("DONE")
To see the code in action, put the FakeFtp class, the code shown above and following line:
main()
into one file and call it:
$ python headsandtails.py
('inside coroutine, processing chunk:', 0, 'aaa')
('inside coroutine, processing chunk:', 1, 'bbb')
('inside coroutine, processing chunk:', 2, 'ccc')
('inside coroutine, processing chunk:', 3, 'ddd')
Finishing up.
('processed result', ['processed(0): aaa', 'processed(1): bbb', 'processed(2): ccc', 'processed(3): ddd'])
DONE
How it works
processed = [] is here just to show, the generator process_chunks shall have no problems to
cooperate with its external context. All is wrapped into def main(): to prove, there is no need to
use global variables.
def process_chunks() is the core of the solution. It might have one shot input parameters (not
used here), but main point, where it receives input is each yield line returning what anyone sends
via .send(data) into instance of this generator. One can coroutine.send(chunk) but in this example it is done via callback refering to this function callback.send.
Note, that in real solution there is no problem to have multiple yields in the code, they are
processed one by one. This might be used e.g. to read (and ignore) header of CSV file and then
continue processing records with data.
We could instantiate and use the generator as follows:
coroutine = process_chunks()
# Get the coroutine to the first yield
coroutine.next()
ftp = FakeFtp()
# next line repeatedly calls `coroutine.send(data)`
ftp.retrbinary("RETR binary", cb=coroutine.send)
# each callback "jumps" to `yield` line in `process_chunks`
# close the coroutine (will throw the `GeneratorExit` exception into the
# `process_chunks` coroutine).
coroutine.close()
Real code is using contextlib closing context manager to ensure, the coroutine.close() is
always called.
Conclusions
This solution is not providing sort of iterator to consume data from in traditional style "from
outside". On the other hand, we are able to:
use the generator "from inside"
keep all iterative processing within one function without being interrupted between callbacks
optionally use external context
provide usable results to outside
all this can be done without using threading
Credits: The solution is heavily inspired by SO answer Python FTP “chunk” iterator (without loading entire file into memory)
written by user2357112

Concept Use a blocking queue with maxsize=1 and a producer/consumer model.
The callback produces, then the next call to the callback will block on the full queue.
The consumer then yields the value from the queue, tries to get another value, and blocks on read.
The producer is the allowed to push to the queue, rinse and repeat.
Usage:
def dummy(func, arg, callback=None):
for i in range(100):
callback(func(arg+i))
# Dummy example:
for i in Iteratorize(dummy, lambda x: x+1, 0):
print(i)
# example with scipy:
for i in Iteratorize(scipy.optimize.fmin, func, x0):
print(i)
Can be used as expected for an iterator:
for i in take(5, Iteratorize(dummy, lambda x: x+1, 0)):
print(i)
Iteratorize class:
from thread import start_new_thread
from Queue import Queue
class Iteratorize:
"""
Transforms a function that takes a callback
into a lazy iterator (generator).
"""
def __init__(self, func, ifunc, arg, callback=None):
self.mfunc=func
self.ifunc=ifunc
self.c_callback=callback
self.q = Queue(maxsize=1)
self.stored_arg=arg
self.sentinel = object()
def _callback(val):
self.q.put(val)
def gentask():
ret = self.mfunc(self.ifunc, self.stored_arg, callback=_callback)
self.q.put(self.sentinel)
if self.c_callback:
self.c_callback(ret)
start_new_thread(gentask, ())
def __iter__(self):
return self
def next(self):
obj = self.q.get(True,None)
if obj is self.sentinel:
raise StopIteration
else:
return obj
Can probably do with some cleaning up to accept *args and **kwargs for the function being wrapped and/or the final result callback.

How about
data = []
scipy.optimize.fmin(func,x0,callback=data.append)
for line in data:
print line
If not, what exactly do you want to do with the generator's data?

A variant of Frits' answer, that:
Supports send to choose a return value for the callback
Supports throw to choose an exception for the callback
Supports close to gracefully shut down
Does not compute a queue item until it is requested
The complete code with tests can be found on github
import queue
import threading
import collections.abc
class generator_from_callback(collections.abc.Generator):
def __init__(self, expr):
"""
expr: a function that takes a callback
"""
self._expr = expr
self._done = False
self._ready_queue = queue.Queue(1)
self._done_queue = queue.Queue(1)
self._done_holder = [False]
# local to avoid reference cycles
ready_queue = self._ready_queue
done_queue = self._done_queue
done_holder = self._done_holder
def callback(value):
done_queue.put((False, value))
cmd, *args = ready_queue.get()
if cmd == 'close':
raise GeneratorExit
elif cmd == 'send':
return args[0]
elif cmd == 'throw':
raise args[0]
def thread_func():
try:
cmd, *args = ready_queue.get()
if cmd == 'close':
raise GeneratorExit
elif cmd == 'send':
if args[0] is not None:
raise TypeError("can't send non-None value to a just-started generator")
elif cmd == 'throw':
raise args[0]
ret = expr(callback)
raise StopIteration(ret)
except BaseException as e:
done_holder[0] = True
done_queue.put((True, e))
self._thread = threading.Thread(target=thread_func)
self._thread.start()
def __next__(self):
return self.send(None)
def send(self, value):
if self._done_holder[0]:
raise StopIteration
self._ready_queue.put(('send', value))
is_exception, val = self._done_queue.get()
if is_exception:
raise val
else:
return val
def throw(self, exc):
if self._done_holder[0]:
raise StopIteration
self._ready_queue.put(('throw', exc))
is_exception, val = self._done_queue.get()
if is_exception:
raise val
else:
return val
def close(self):
if not self._done_holder[0]:
self._ready_queue.put(('close',))
self._thread.join()
def __del__(self):
self.close()
Which works as:
In [3]: def callback(f):
...: ret = f(1)
...: print("gave 1, got {}".format(ret))
...: f(2)
...: print("gave 2")
...: f(3)
...:
In [4]: i = generator_from_callback(callback)
In [5]: next(i)
Out[5]: 1
In [6]: i.send(4)
gave 1, got 4
Out[6]: 2
In [7]: next(i)
gave 2, got None
Out[7]: 3
In [8]: next(i)
StopIteration
For scipy.optimize.fmin, you would use generator_from_callback(lambda c: scipy.optimize.fmin(func, x0, callback=c))

Solution to handle non-blocking callbacks
The solution using threading and queue is pretty good, of high-performance and cross-platform, probably the best one.
Here I provide this not-too-bad solution, which is mainly for handling non-blocking callbacks, e.g. called from the parent function through threading.Thread(target=callback).start(), or other non-blocking ways.
import pickle
import select
import subprocess
def my_fmin(func, x0):
# open a process to use as a pipeline
proc = subprocess.Popen(['cat'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
def my_callback(x):
# x might be any object, not only str, so we use pickle to dump it
proc.stdin.write(pickle.dumps(x).replace(b'\n', b'\\n') + b'\n')
proc.stdin.flush()
from scipy import optimize
optimize.fmin(func, x0, callback=my_callback)
# this is meant to handle non-blocking callbacks, e.g. called somewhere
# through `threading.Thread(target=callback).start()`
while select.select([proc.stdout], [], [], 0)[0]:
yield pickle.loads(proc.stdout.readline()[:-1].replace(b'\\n', b'\n'))
# close the process
proc.communicate()
Then you can use the function like this:
# unfortunately, `scipy.optimize.fmin`'s callback is blocking.
# so this example is just for showing how-to.
for x in my_fmin(lambda x: x**2, 3):
print(x)
Although This solution seems quite simple and readable, it's not as high-performance as the threading and queue solution, because:
Processes are much heavier than threadings.
Passing data through pipe instead of memory is much slower.
Besides, it doesn't work on Windows, because the select module on Windows can only handle sockets, not pipes and other file descriptors.

For a super simple approach...
def callback_to_generator():
data = []
method_with_callback(blah, foo, callback=data.append)
for item in data:
yield item
Yes, this isn't good for large data
Yes, this blocks on all items being processed first
But it still might be useful for some use cases :)
Also thanks to #winston-ewert as this is just a small variant on his answer :)

Related

Implementing "competing" processes in python

I'm trying to implement a function that takes 2 functions as arguments, runs both, returns the value of the function that returns first and kills the slower function before it finishes its execution.
My problem is that when I try to empty the Queue object I use to collect the return values, I get stuck.
Is there a more 'correct' way to handle this scenario or even an existing module? If not, can anyone explain what I'm doing wrong?
Here is my code (the implementation of the above function is 'run_both()'):
import multiprocessing as mp
from time import sleep
Q = mp.Queue()
def dump_queue(queue):
result = []
for i in iter(queue.get, 'STOP'):
result.append(i)
return result
def rabbit(x):
sleep(10)
Q.put(x)
def turtle(x):
sleep(30)
Q.put(x)
def run_both(a,b):
a.start()
b.start()
while a.is_alive() and b.is_alive():
sleep(1)
if a.is_alive():
a.terminate()
else:
b.terminate()
a.join()
b.join()
return dump_queue(Q)
p1 = mp.Process(target=rabbit, args=(1,))
p1 = mp.Process(target=turtle, args=(2,))
run_both(p1, p2)
Here's an example to call 2 or more functions with multiprocessing and return the fastest result. There are a few important things to note however.
Running multiprocessing code in IDLE sometimes causes problems. This example works, but I did run into that issue trying to solve this.
Multiprocessing code should start from inside a if __name__ == '__main__' clause, or else it will be run again if the main module is re-imported by another process. read the multiprocessing doc page for more info.
The result queue is passed directly to each process that uses it. When you use the queue by referencing a global name in the module, the code fails on windows because a new instance of the queue is used by each process. Read more here Multiprocessing Queue.get() hangs
I have also added a bit of a feature here to know which process' result was actually used.
import multiprocessing as mp
import time
import random
def task(value):
# our dummy task is to sleep for a random amount of time and
# return the given arg value
time.sleep(random.random())
return value
def process(q, idx, fn, args):
# simply call function fn with args, and push its result in the queue with its index
q.put([fn(*args), idx])
def fastest(calls):
queue = mp.Queue()
# we must pass the queue directly to each process that may use it
# or else on Windows, each process will have its own copy of the queue
# making it useless
procs = []
# create a 'mp.Process' that calls our 'process' for each call and start it
for idx, call in enumerate(calls):
fn = call[0]
args = call[1:]
p = mp.Process(target=process, args=(queue, idx, fn, args))
procs.append(p)
p.start()
# wait for the queue to have something
result, idx = queue.get()
for proc in procs: # kill all processes that may still be running
proc.terminate()
# proc may be using queue, so queue may be corrupted.
# https://docs.python.org/3.8/library/multiprocessing.html?highlight=queue#multiprocessing.Process.terminate
# we no longer need queue though so this is fine
return result, idx
if __name__ == '__main__':
from datetime import datetime
start = datetime.now()
print(start)
# to be compatible with 'fastest', each call is a list with the first
# element being callable, followed by args to be passed
calls = [
[task, 1],
[task, 'hello'],
[task, [1,2,3]]
]
val, idx = fastest(calls)
end = datetime.now()
print(end)
print('elapsed time:', end-start)
print('returned value:', val)
print('from call at index', idx)
Example output:
2019-12-21 04:01:09.525575
2019-12-21 04:01:10.171891
elapsed time: 0:00:00.646316
returned value: hello
from call at index 1
Apart from the typo on the penultimate line which should read:
p2 = mp.Process(target=turtle, args=(2,)) # not p1
the simplest change you can make to get the program to work is to add:
Q.put('STOP')
to the end of turtle() and rabbit().
You also don't really need to keep looping watching if the processes are alive, by definition if you just read the message queue and receive STOP, one of them has finished, so you could replace run_both() with:
def run_both(a,b):
a.start()
b.start()
result = dump_queue(Q)
a.terminate()
b.terminate()
return result
You may also need to think about what happens if both processes put some messages in the queue at much the same time. They could get mixed up. Maybe consider using 2 queues, or joining all the results up into a single message rather than appending multiple values together from queue.get()

using python multiprocessing package inside a qgis plugin code

I spent quite a bit of time looking on how to use the multiprocessing package, but couldn't find anything on how to use it inside a plugin in QGIS. I am developing a plugin that does some optimization for several elements. I would like to parallelize it.
I found a useful link on multi-threading inside a python plugin (http://snorf.net/blog/2013/12/07/multithreading-in-qgis-python-plugins/), but nothing on using the multiprocessing module, which might be easier?
I have been trying with a very basic example. I am only showing the run function from the plugin here:
def run(self):
"""Run method that performs all the real work"""
# show the dialog
self.dlg.show()
# Run the dialog event loop
result = self.dlg.exec_()
# See if OK was pressed and run code
if result:
#Get number of cores
nProcs = mp.cpu_count()
#Start a Process
p = mp.Pool(nProcs)
#Define function
def cube(x):
return x**3
#Run parallel
results = p.map(cube, range(1,7))
When I run this code from the plugin in QGIS, it opens several QGIS windows, which then return errors (can't load layers, etc.). What am I missing? Do I need to start a worker first on another thread and then use multiprocessing there? Or would we use another function from multiprocessing?
Please let me know if the question needs edits. I am working under windows 7, using QGIS 2.10.
Thanks,
UPDATE
I created a worker class to implement the function and sent it to a new thread, but I get the same problem when I use multiprocessing in that thread.
The class I created is as follows:
class Worker(QObject):
'''Example worker'''
def __init__(self, result_queue, f, attr=[], repet=None, nbCores=None):
QObject.__init__(self)
if not hasattr(f, '__call__'):
#Check if not a function
raise TypeError('Worker expected a function as second argument')
if not isinstance(attr, list) and not repet==None:
#Check if not a list if there is a repet command
raise TypeError('Input problem:\nThe arguments for the function should be in a list if repet is provided')
if not all(isinstance(elem, list) for elem in attr) and repet==None and len(inspect.getargspec(f).args) > 1:
#Check if not a list of lists if there isn't a repet command
raise TypeError('Input problem:\nThe arguments for the function should be a list of lists if repet is not provided')
if not repet == None and (not isinstance(repet, int) or repet == 0):
#Check that provided an integer greater than 0
raise TypeError('If provided, repet should be None or a strictly positive integer')
self.result_queue = result_queue
self.f = f
self.attr = attr
self.repet = repet
self.nbCores = nbCores
if self.nbCores == None:
self.nbCores = mp.cpu_count() - 1
def fStar(self, arg):
"""Convert the function to taking a list as arguments"""
return self.f(*arg)
def run(self):
ret = None
try:
if self.repet == 1:
# estimates the function based on provided arguments
ret = self.f(*self.attr) #The star unpacks the list into attributes
else:
pool = mp.Pool(processes=self.nbCores)
if self.repet > 1:
ret = pool.map(self.fStar, itools.repeat(self.attr,self.repet))
elif self.repet == None:
ret = pool.map(self.fStar, self.attr)
pool.close()
pool.join()
except Exception, e:
#I can't pass an exception, it makes qgis bug
pass
self.result_queue.put(ret) #Pass the result to the queue
finished = pyqtSignal(object)
error = pyqtSignal(Exception, basestring)
I start the worker and send it to a new thread using the following function:
def startWorker(f, attr, repet=None, nbCores=None):
#Create a result queue
result_queue = queue.Queue()
# create a new worker instance
worker = Worker(result_queue, f, attr, repet, nbCores)
# start the worker in a new thread
thread = QThread()
worker.moveToThread(thread)
thread.started.connect(worker.run)
thread.start()
#Clean up when the thread is finished
worker.deleteLater()
thread.quit()
thread.wait()
thread.deleteLater()
#Export the result to the queue
res = []
while not result_queue.empty():
r = result_queue.get()
if r is None:
continue
res.append(r)
return res
As in my initial question, I just replaced results = p.map(cube, range(1,7)) by calling the startWorker function
Please let me know if you have any idea how to make this work. I implemented the work in multiple threads, but it would be much faster to use several cores...

Transform callbacks to generator in Python?

Let's say we have some library (eg. for XML parsing) that accepts a callback and calls it everytime it encounters some event (eg. find some XML tag). I'd like to be able to transform those callbacks into a generator that can be iterated via the for loop. Is that possible in Python without using threads or collecting all the callback results (ie. with lazy evaluation)?
Example:
# this is how I can produce the items
def callback(item)
# do something with each item
parser.parse(xml_file, callback=callback)
# this is how the items should be consumed
for item in iter_parse(xml_file):
print(item)
I've tried to study if coroutines could be used but it seems that coroutines are useful for pushing data from the producer, while generator pull data to the consumer.
The natural idea was that the producer and consumer would be coroutines that would ping the execution flow back and forth.
I've managed to get a producer-consumer pattern working with the asyncio loop (in a similar way to this answer). However it cannot be used like a generator in a for loop:
import asyncio
q = asyncio.Queue(maxsize=1)
#asyncio.coroutine
def produce(data):
for v in data:
print("Producing:", v)
yield from q.put(v)
print("Producer waiting")
yield from q.put(None)
print("Producer done")
#asyncio.coroutine
def consume():
while True:
print("Consumer waiting")
value = yield from q.get()
print("Consumed:", value)
if value is not None:
# process the value
yield from asyncio.sleep(0.5)
else:
break
print("Consumer done")
tasks = [
asyncio.Task(consume()),
asyncio.Task(produce(data=range(5)))
]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
The problem is that the result cannot be iterated in a for loop since it is managed by the loop.
When I rewrite the code so that the callback is called from an ordinary function, the problem is that asyncio.Queue.put() called from the callback doesn't block and the computation is not lazy.
import asyncio
q = asyncio.Queue(maxsize=1)
def parse(data, callback):
for value in data:
# yield from q.put(value)
callback(value)
#asyncio.coroutine
def produce(data):
#asyncio.coroutine
def enqueue(value):
print('enqueue()', value)
yield from q.put(value)
def callback(value):
print('callback()', value)
asyncio.async(enqueue(value))
parse(data, callback)
print('produce()')
print('produce(): enqueuing sentinel value')
asyncio.async(enqueue(None))
print('produce(): done')
#asyncio.coroutine
def consume():
print('consume()')
while True:
print('consume(): waiting')
value = yield from q.get()
print('consumed:', value)
if value is not None:
# here we'd like to yield and use this in a for loop elsewhere
print(value)
else:
break
print('consume(): done')
tasks = [
asyncio.Task(consume()),
asyncio.Task(produce(range(5)))
]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
# I'd like:
# for value in iter_parse(data=range(5)):
# print('consumed:', value)
It this kind of computation even possible with asyncio or do I need to use greenlet or gevent? I seems in gevent it is possible to iterate over async results in for loop but I don't like to depend on another library if possible and it is not completely ready for Python 3.

Python: 'before' and 'after' for multiprocessing workers

Update: Here is a more specific example
Suppose I want to compile some statistical data from a sizable set of files:
I can make a generator (line for line in fileinput.input(files)) and some processor:
from collections import defaultdict
scores = defaultdict(int)
def process(line):
if 'Result' in line:
res = line.split('\"')[1].split('-')[0]
scores[res] += 1
The question is how to handle this when one gets to the multiprocessing.Pool.
Of course it's possible to define a multiprocessing.sharedctypes as well as a custom struct instead of a defaultdict but this seems rather painful. On the other hand I can't think of a pythonic way to instantiate something before the process or to return something after a generator has run out to the main thread.
So you basically create a histogram. This is can easily be parallelized, because histograms can be merged without complication. One might want to say that this problem is trivially parallelizable or "embarrassingly parallel". That is, you do not need to worry about communication among workers.
Just split your data set into multiple chunks, let your workers work on these chunks independently, collect the histogram of each worker, and then merge the histograms.
In practice, this problem is best off by letting each worker process/read its own file. That is, a "task" could be a file name. You should not start pickling file contents and send them around between processes through pipes. Let each worker process retrieve the bulk data directly from files. Otherwise your architecture spends too much time with inter-process communication, instead of doing some real work.
Do you need an example or can you figure this out yourself?
Edit: example implementation
I have a number of data files with file names in this format: data0.txt, data1.txt, ... .
Example contents:
wolf
wolf
cat
blume
eisenbahn
The goal is to create a histogram over the words contained in the data files. This is the code:
from multiprocessing import Pool
from collections import Counter
import glob
def build_histogram(filepath):
"""This function is run by a worker process.
The `filepath` argument is communicated to the worker
through a pipe. The return value of this function is
communicated to the manager through a pipe.
"""
hist = Counter()
with open(filepath) as f:
for line in f:
hist[line.strip()] += 1
return hist
def main():
"""This function runs in the manager (main) process."""
# Collect paths to data files.
datafile_paths = glob.glob("data*.txt")
# Create a pool of worker processes and distribute work.
# The input to worker processes (function argument) as well
# as the output by worker processes is transmitted through
# pipes, behind the scenes.
pool = Pool(processes=3)
histograms = pool.map(build_histogram, datafile_paths)
# Properly shut down the pool of worker processes, and
# wait until all of them have finished.
pool.close()
pool.join()
# Merge sub-histograms. Do not create too many intermediate
# objects: update the first sub-histogram with the others.
# Relevant docs: collections.Counter.update
merged_hist = histograms[0]
for h in histograms[1:]:
merged_hist.update(h)
for word, count in merged_hist.items():
print "%s: %s" % (word, count)
if __name__ == "__main__":
main()
Test output:
python countwords.py
eisenbahn: 12
auto: 6
cat: 1
katze: 10
stadt: 1
wolf: 3
zug: 4
blume: 5
herbert: 14
destruction: 4
I had to modify the original pool.py (the trouble was worker is defined as a method without any inheritance) to get what I want but it's not so bad, and probably better than writing a new pool entirely.
class worker(object):
def __init__(self, inqueue, outqueue, initializer=None, initargs=(), maxtasks=None,
wrap_exception=False, finalizer=None, finargs=()):
assert maxtasks is None or (type(maxtasks) == int and maxtasks > 0)
put = outqueue.put
get = inqueue.get
self.completed = 0
if hasattr(inqueue, '_writer'):
inqueue._writer.close()
outqueue._reader.close()
if initializer is not None:
initializer(self, *initargs)
def run(self):
while maxtasks is None or (maxtasks and self.completed < maxtasks):
try:
task = get()
except (EOFError, OSError):
util.debug('worker got EOFError or OSError -- exiting')
break
if task is None:
util.debug('worker got sentinel -- exiting')
break
job, i, func, args, kwds = task
try:
result = (True, func(*args, **kwds))
except Exception as e:
if wrap_exception:
e = ExceptionWithTraceback(e, e.__traceback__)
result = (False, e)
try:
put((job, i, result))
except Exception as e:
wrapped = MaybeEncodingError(e, result[1])
util.debug("Possible encoding error while sending result: %s" % (
wrapped))
put((job, i, (False, wrapped)))
self.completed += 1
if finalizer:
finalizer(self, *finargs)
util.debug('worker exiting after %d tasks' % self.completed)
run(self)

Python generator pre-fetch?

I have a generator that takes a long time for each iteration to run. Is there a standard way to have it yield a value, then generate the next value while waiting to be called again?
The generator would be called each time a button is pressed in a gui and the user would be expected to consider the result after each button press.
EDIT: a workaround might be:
def initialize():
res = next.gen()
def btn_callback()
display(res)
res = next.gen()
if not res:
return
If I wanted to do something like your workaround, I'd write a class like this:
class PrefetchedGenerator(object):
def __init__(self, generator):
self._data = generator.next()
self._generator = generator
self._ready = True
def next(self):
if not self._ready:
self.prefetch()
self._ready = False
return self._data
def prefetch(self):
if not self._ready:
self._data = self._generator.next()
self._ready = True
It is more complicated than your version, because I made it so that it handles not calling prefetch or calling prefetch too many times. The basic idea is that you call .next() when you want the next item. You call prefetch when you have "time" to kill.
Your other option is a thread..
class BackgroundGenerator(threading.Thread):
def __init__(self, generator):
threading.Thread.__init__(self)
self.queue = Queue.Queue(1)
self.generator = generator
self.daemon = True
self.start()
def run(self):
for item in self.generator:
self.queue.put(item)
self.queue.put(None)
def next(self):
next_item = self.queue.get()
if next_item is None:
raise StopIteration
return next_item
This will run separately from your main application. Your GUI should remain responsive no matter how long it takes to fetch each iteration.
No. A generator is not asynchronous. This isn't multiprocessing.
If you want to avoid waiting for the calculation, you should use the multiprocessing package so that an independent process can do your expensive calculation.
You want a separate process which is calculating and enqueueing results.
Your "generator" can then simply dequeue the available results.
You can definitely do this with generators, just create your generator so that each next call alternates between getting the next value and returning it by putting in multiple yield statements. Here is an example:
import itertools, time
def quick_gen():
counter = itertools.count().next
def long_running_func():
time.sleep(2)
return counter()
while True:
x = long_running_func()
yield
yield x
>>> itr = quick_gen()
>>> itr.next() # setup call, takes two seconds
>>> itr.next() # returns immediately
0
>>> itr.next() # setup call, takes two seconds
>>> itr.next() # returns immediately
1
Note that the generator does not automatically do the processing to get the next value, it is up to the caller to call next twice for each value. For your use case you would call next once as a setup up, and then each time the user clicks the button you would display the next value generated, then call next again for the pre-fetch.
I was after something similar. I wanted yield to quickly return a value (if it could) while a background thread processed the next, next.
import Queue
import time
import threading
class MyGen():
def __init__(self):
self.queue = Queue.Queue()
# Put a first element into the queue, and initialize our thread
self.i = 1
self.t = threading.Thread(target=self.worker, args=(self.queue, self.i))
self.t.start()
def __iter__(self):
return self
def worker(self, queue, i):
time.sleep(1) # Take a while to process
queue.put(i**2)
def __del__(self):
self.stop()
def stop(self):
while True: # Flush the queue
try:
self.queue.get(False)
except Queue.Empty:
break
self.t.join()
def next(self):
# Start a thread to compute the next next.
self.t.join()
self.i += 1
self.t = threading.Thread(target=self.worker, args=(self.queue, self.i))
self.t.start()
# Now deliver the already-queued element
while True:
try:
print "request at", time.time()
obj = self.queue.get(False)
self.queue.task_done()
return obj
except Queue.Empty:
pass
time.sleep(.001)
if __name__ == '__main__':
f = MyGen()
for i in range(5):
# time.sleep(2) # Comment out to get items as they are ready
print "*********"
print f.next()
print "returned at", time.time()
The code above gave the following results:
*********
request at 1342462505.96
1
returned at 1342462505.96
*********
request at 1342462506.96
4
returned at 1342462506.96
*********
request at 1342462507.96
9
returned at 1342462507.96
*********
request at 1342462508.96
16
returned at 1342462508.96
*********
request at 1342462509.96
25
returned at 1342462509.96

Categories