Python: 'before' and 'after' for multiprocessing workers

Python: 'before' and 'after' for multiprocessing workers - python

Update: Here is a more specific example
Suppose I want to compile some statistical data from a sizable set of files:
I can make a generator (line for line in fileinput.input(files)) and some processor:
from collections import defaultdict
scores = defaultdict(int)
def process(line):
if 'Result' in line:
res = line.split('\"')[1].split('-')[0]
scores[res] += 1
The question is how to handle this when one gets to the multiprocessing.Pool.
Of course it's possible to define a multiprocessing.sharedctypes as well as a custom struct instead of a defaultdict but this seems rather painful. On the other hand I can't think of a pythonic way to instantiate something before the process or to return something after a generator has run out to the main thread.

So you basically create a histogram. This is can easily be parallelized, because histograms can be merged without complication. One might want to say that this problem is trivially parallelizable or "embarrassingly parallel". That is, you do not need to worry about communication among workers.
Just split your data set into multiple chunks, let your workers work on these chunks independently, collect the histogram of each worker, and then merge the histograms.
In practice, this problem is best off by letting each worker process/read its own file. That is, a "task" could be a file name. You should not start pickling file contents and send them around between processes through pipes. Let each worker process retrieve the bulk data directly from files. Otherwise your architecture spends too much time with inter-process communication, instead of doing some real work.
Do you need an example or can you figure this out yourself?
Edit: example implementation
I have a number of data files with file names in this format: data0.txt, data1.txt, ... .
Example contents:
wolf
wolf
cat
blume
eisenbahn
The goal is to create a histogram over the words contained in the data files. This is the code:
from multiprocessing import Pool
from collections import Counter
import glob
def build_histogram(filepath):
"""This function is run by a worker process.
The `filepath` argument is communicated to the worker
through a pipe. The return value of this function is
communicated to the manager through a pipe.
"""
hist = Counter()
with open(filepath) as f:
for line in f:
hist[line.strip()] += 1
return hist
def main():
"""This function runs in the manager (main) process."""
# Collect paths to data files.
datafile_paths = glob.glob("data*.txt")
# Create a pool of worker processes and distribute work.
# The input to worker processes (function argument) as well
# as the output by worker processes is transmitted through
# pipes, behind the scenes.
pool = Pool(processes=3)
histograms = pool.map(build_histogram, datafile_paths)
# Properly shut down the pool of worker processes, and
# wait until all of them have finished.
pool.close()
pool.join()
# Merge sub-histograms. Do not create too many intermediate
# objects: update the first sub-histogram with the others.
# Relevant docs: collections.Counter.update
merged_hist = histograms[0]
for h in histograms[1:]:
merged_hist.update(h)
for word, count in merged_hist.items():
print "%s: %s" % (word, count)
if __name__ == "__main__":
main()
Test output:
python countwords.py
eisenbahn: 12
auto: 6
cat: 1
katze: 10
stadt: 1
wolf: 3
zug: 4
blume: 5
herbert: 14
destruction: 4

I had to modify the original pool.py (the trouble was worker is defined as a method without any inheritance) to get what I want but it's not so bad, and probably better than writing a new pool entirely.
class worker(object):
def __init__(self, inqueue, outqueue, initializer=None, initargs=(), maxtasks=None,
wrap_exception=False, finalizer=None, finargs=()):
assert maxtasks is None or (type(maxtasks) == int and maxtasks > 0)
put = outqueue.put
get = inqueue.get
self.completed = 0
if hasattr(inqueue, '_writer'):
inqueue._writer.close()
outqueue._reader.close()
if initializer is not None:
initializer(self, *initargs)
def run(self):
while maxtasks is None or (maxtasks and self.completed < maxtasks):
try:
task = get()
except (EOFError, OSError):
util.debug('worker got EOFError or OSError -- exiting')
break
if task is None:
util.debug('worker got sentinel -- exiting')
break
job, i, func, args, kwds = task
try:
result = (True, func(*args, **kwds))
except Exception as e:
if wrap_exception:
e = ExceptionWithTraceback(e, e.__traceback__)
result = (False, e)
try:
put((job, i, result))
except Exception as e:
wrapped = MaybeEncodingError(e, result[1])
util.debug("Possible encoding error while sending result: %s" % (
wrapped))
put((job, i, (False, wrapped)))
self.completed += 1
if finalizer:
finalizer(self, *finargs)
util.debug('worker exiting after %d tasks' % self.completed)
run(self)

Related

Implementing "competing" processes in python

I'm trying to implement a function that takes 2 functions as arguments, runs both, returns the value of the function that returns first and kills the slower function before it finishes its execution.
My problem is that when I try to empty the Queue object I use to collect the return values, I get stuck.
Is there a more 'correct' way to handle this scenario or even an existing module? If not, can anyone explain what I'm doing wrong?
Here is my code (the implementation of the above function is 'run_both()'):
import multiprocessing as mp
from time import sleep
Q = mp.Queue()
def dump_queue(queue):
result = []
for i in iter(queue.get, 'STOP'):
result.append(i)
return result
def rabbit(x):
sleep(10)
Q.put(x)
def turtle(x):
sleep(30)
Q.put(x)
def run_both(a,b):
a.start()
b.start()
while a.is_alive() and b.is_alive():
sleep(1)
if a.is_alive():
a.terminate()
else:
b.terminate()
a.join()
b.join()
return dump_queue(Q)
p1 = mp.Process(target=rabbit, args=(1,))
p1 = mp.Process(target=turtle, args=(2,))
run_both(p1, p2)

Here's an example to call 2 or more functions with multiprocessing and return the fastest result. There are a few important things to note however.
Running multiprocessing code in IDLE sometimes causes problems. This example works, but I did run into that issue trying to solve this.
Multiprocessing code should start from inside a if __name__ == '__main__' clause, or else it will be run again if the main module is re-imported by another process. read the multiprocessing doc page for more info.
The result queue is passed directly to each process that uses it. When you use the queue by referencing a global name in the module, the code fails on windows because a new instance of the queue is used by each process. Read more here Multiprocessing Queue.get() hangs
I have also added a bit of a feature here to know which process' result was actually used.
import multiprocessing as mp
import time
import random
def task(value):
# our dummy task is to sleep for a random amount of time and
# return the given arg value
time.sleep(random.random())
return value
def process(q, idx, fn, args):
# simply call function fn with args, and push its result in the queue with its index
q.put([fn(*args), idx])
def fastest(calls):
queue = mp.Queue()
# we must pass the queue directly to each process that may use it
# or else on Windows, each process will have its own copy of the queue
# making it useless
procs = []
# create a 'mp.Process' that calls our 'process' for each call and start it
for idx, call in enumerate(calls):
fn = call[0]
args = call[1:]
p = mp.Process(target=process, args=(queue, idx, fn, args))
procs.append(p)
p.start()
# wait for the queue to have something
result, idx = queue.get()
for proc in procs: # kill all processes that may still be running
proc.terminate()
# proc may be using queue, so queue may be corrupted.
# https://docs.python.org/3.8/library/multiprocessing.html?highlight=queue#multiprocessing.Process.terminate
# we no longer need queue though so this is fine
return result, idx
if __name__ == '__main__':
from datetime import datetime
start = datetime.now()
print(start)
# to be compatible with 'fastest', each call is a list with the first
# element being callable, followed by args to be passed
calls = [
[task, 1],
[task, 'hello'],
[task, [1,2,3]]
]
val, idx = fastest(calls)
end = datetime.now()
print(end)
print('elapsed time:', end-start)
print('returned value:', val)
print('from call at index', idx)
Example output:
2019-12-21 04:01:09.525575
2019-12-21 04:01:10.171891
elapsed time: 0:00:00.646316
returned value: hello
from call at index 1

Apart from the typo on the penultimate line which should read:
p2 = mp.Process(target=turtle, args=(2,)) # not p1
the simplest change you can make to get the program to work is to add:
Q.put('STOP')
to the end of turtle() and rabbit().
You also don't really need to keep looping watching if the processes are alive, by definition if you just read the message queue and receive STOP, one of them has finished, so you could replace run_both() with:
def run_both(a,b):
a.start()
b.start()
result = dump_queue(Q)
a.terminate()
b.terminate()
return result
You may also need to think about what happens if both processes put some messages in the queue at much the same time. They could get mixed up. Maybe consider using 2 queues, or joining all the results up into a single message rather than appending multiple values together from queue.get()

Is instance variable in Python 3.5 process-safe?

Testing Environment:
Python Version: 3.5.1
OS Platform: Ubuntu 16.04
IDE: PyCharm Community Edition 2016.3.2
I write a simple program to test process-safe. I find that subprocess2 won't run until subprocess1 finished. It seems that the instance variable self.count is process-safe.How the process share this variable? Does they share self directly?
Another question is when I use Queue, I have to use multiprocessing.Manager to guarantees process safety manually, or the program won't run as expected.(If you uncomment self.queue = multiprocessing.Queue(), this program won't run normally, but using self.queue = multiprocessing.Manager().Queue() is OK.)
The last question is why the final result is 900? I think it should be 102.
Sorry for asking so many questions, but I'm indeed curious about these things. Thanks a lot!
Code:
import multiprocessing
import time
class Test:
def __init__(self):
self.pool = multiprocessing.Pool(1)
self.count = 0
#self.queue = multiprocessing.Queue()
#self.queue = multiprocessing.Manager().Queue()
def subprocess1(self):
for i in range(3):
print("Subprocess 1, count = %d" %self.count)
self.count += 1
time.sleep(1)
print("Subprocess 1 Completed")
def subprocess2(self):
self.count = 100
for i in range(3):
print("Subprocess 2, count = %d" %self.count)
self.count += 1
time.sleep(1)
print("Subprocess 2 Completed")
def start(self):
self.pool.apply_async(func=self.subprocess1)
print("Subprocess 1 has been started")
self.count = 900
self.pool.apply_async(func=self.subprocess2)
print("Subprocess 2 has been started")
self.pool.close()
self.pool.join()
def __getstate__(self):
self_dict = self.__dict__.copy()
del self_dict['pool']
return self_dict
def __setstate__(self, state):
self.__dict__.update(state)
if __name__ == '__main__':
test = Test()
test.start()
print("Final Result, count = %d" %test.count)
Output:
Subprocess 1 has been started
Subprocess 2 has been started
Subprocess 1, count = 0
Subprocess 1, count = 1
Subprocess 1, count = 2
Subprocess 1 Completed
Subprocess 2, count = 100
Subprocess 2, count = 101
Subprocess 2, count = 102
Subprocess 2 Completed
Final Result, count = 900

The underlying details are rather tricky (see the Python3 documentation for more, and note that the details are slightly different for Python2), but essentially, when you pass self.subprocess1 or self.subprocess2 as an argument to self.pool.apply_async, Python ends up calling:
pickle.dumps(self)
in the main process—the initial one on Linux before forking, or the one invoked as __main__ on Windows—and then, eventually, pickle.loads() of the resulting byte-string in the pool process.1 The pickle.dumps code winds up calling your own __getstate__ function; that function's job is to return something that can be serialized to a byte-string.2 The subsequent pickle.loads creates a blank instance of the appropriate type, does not call its __init__, and then uses its __setstate__ function to fill in the object (instead of __init__ing it).
Your __getstate__ returns the dictionary holding the state of self, minus the pool object, for good reason:
>>> import multiprocessing
>>> x = multiprocessing.Pool(1)
>>> import pickle
>>> pickle.dumps(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 492, in __reduce__
'pool objects cannot be passed between processes or pickled'
NotImplementedError: pool objects cannot be passed between processes or pickled
Since pool objects refuse to be pickled (serialized), we must avoid even attempting to do that.
In any case, all of this means that the pool process has its own copy of self, which has its own copy of self.count (and is missing self.pool entirely). These items are not shared in any way so it is safe to modify self.count there.
I find the simplest mental model of this is to give each worker process a name: Alice, Bob, Carol, and so on, if you like. You can then think of the main process as "you": you copy something and give the copy to Alice, then copy it and give that one to Bob, and so on. Function calls, such as apply or apply_async, copy all of their arguments—including the implied self for bound methods.
When using a multiprocessing.Queue, you get something that knows how to work between the various processes, sharing data as needed, with appropriate synchronization. This lets you pass copies of data back and forth. However, like a pool instance, a multiprocessing.Queue instance cannot be copied. The multiprocessing routines do let you copy a multiprocessing.Manager().Queue() instance, which is good if you want a copied and otherwise private Queue() instance. (The internal details of this are complicated.3)
The final result you get is just 900 because you are looking only at the original self object.
Note that each applied functions (from apply or apply_async) returns a result. This result is copied back, from the worker process to the main process. With apply_async, you may choose to get called back as soon as the result is ready. If you want this result you should save it somewhere, or use the get function (as shown in that same answer) to wait for it when you need it.
1We can say "the" pool process here without worrying about which one, as you limited yourself to just one. In any case, though, there is a simple byte-oriented, two-way communications stream, managed by the multiprocessing code, connecting each worker process with the parent process that invoked it. If you create two such pool processes, each one has its own byte-stream connecting to the main process. This means it would not matter if there were two or more: the behavior would be the same.
2This "something" is often a dictionary, but see Simple example of use of __setstate__ and __getstate__ for details.
3The output of pickle.dumps on such an instance is:
>>> pickle.dumps(y)
(b'\x80\x03cmultiprocessing.managers\n'
b'RebuildProxy\n'
b'q\x00(cmultiprocessing.managers\n'
b'AutoProxy\n'
b'q\x01cmultiprocessing.managers\n'
b'Token\n'
b'q\x02)\x81q\x03X\x05\x00\x00\x00Queueq\x04X$\x00\x00\x00/tmp/pymp-pog4bhub/listener-0_uwd8c9q\x05X\t\x00\x00\x00801b92400q\x06\x87q\x07bX\x06\x00\x00\x00pickleq\x08}q\tX\x07\x00\x00\x00exposedq\n'
b'(X\x05\x00\x00\x00emptyq\x0bX\x04\x00\x00\x00fullq\x0cX\x03\x00\x00\x00getq\rX\n'
b'\x00\x00\x00get_nowaitq\x0eX\x04\x00\x00\x00joinq\x0fX\x03\x00\x00\x00putq\x10X\n'
b'\x00\x00\x00put_nowaitq\x11X\x05\x00\x00\x00qsizeq\x12X\t\x00\x00\x00task_doneq\x13tq\x14stq\x15Rq\x16.\n')
I did a little trickiness to split this at newlines and then manually added the parentheses, just to keep the long line from being super-long. The arguments will vary on different systems; this particular one uses a file system object that is a listener socket, that allows cooperating Python processes to establish a new byte stream between themselves.

Question: ... why the final result is 900? I think it should be 102.
The result should be 106, range are 0 based, you get 3 iterations.
You can get the expected output, for instance:
class PoolTasks(object):
def __init__(self):
self.count = None
def task(self, n, start):
import os
pid = os.getpid()
count = start
print("Task %s in Process %s has been started - start=%s" % (n, pid, count))
for i in range(3):
print("Task %s in Process %s, count = %d " % (n, pid, count))
count += 1
time.sleep(1)
print("Task %s in Process %s has been completed - count=%s" % (n, pid, count))
return count
def start(self):
with mp.Pool(processes=4) as pool:
# launching multiple tasks asynchronously using processes
multiple_results = [pool.apply_async(self.task, (p)) for p in [(1, 0), (2, 100)]]
# sum result from tasks
self.count = 0
for res in multiple_results:
self.count += res.get()
if __name__ == '__main__':
pool = PoolTasks()
pool.start()
print('sum(count) = %s' % pool.count)
Output:
Task 1 in Process 5601 has been started - start=0
Task 1 in Process 5601, count = 0
Task 2 in Process 5602 has been started - start=100
Task 2 in Process 5602, count = 100
Task 1 in Process 5601, count = 1
Task 2 in Process 5602, count = 101
Task 1 in Process 5601, count = 2
Task 2 in Process 5602, count = 102
Task 1 in Process 5601 has been completed - count=3
Task 2 in Process 5602 has been completed - count=103
sum(count) = 106
Tested with Python:3.4.2

multiprocessing - reading big input data - program hangs

I want to run parallel computation on some input data which is loaded from a file. (The file can be really big, so I use a generator for this.)
On a certain number of items, my code runs OK but above this threshold the program hangs (some of the worker processes do not end).
Any suggestions? (I am running this with python2.7, 8 CPUs; 5,000 lines still OK, 7,500 does not work.)
Firstly, you need an input file. Generate it in bash:
for i in {0..10000}; do echo -e "$i"'\r' >> counter.txt; done
Then, run this:
python2.7 main.py 100 counter.txt > run_log.txt
main.py:
#!/usr/bin/python2.7
import os, sys, signal, time
import Queue
import multiprocessing as mp
def eat_queue(job_queue, result_queue):
"""Eats input queue, feeds output queue
"""
proc_name = mp.current_process().name
while True:
try:
job = job_queue.get(block=False)
if job == None:
print(proc_name + " DONE")
return
result_queue.put(execute(job))
except Queue.Empty:
pass
def execute(x):
"""Does the computation on the input data
"""
return x*x
def save_result(result):
"""Saves results in a list
"""
result_list.append(result)
def load(ifilename):
"""Generator reading the input file and
yielding it row by row
"""
ifile = open(ifilename, "r")
for line in ifile:
line = line.strip()
num = int(line)
yield (num)
ifile.close()
print("file closed".upper())
def put_tasks(job_queue, ifilename):
"""Feeds the job queue
"""
for item in load(ifilename):
job_queue.put(item)
for _ in range(get_max_workers()):
job_queue.put(None)
def get_max_workers():
"""Returns optimal number of processes to run
"""
max_workers = mp.cpu_count() - 2
if max_workers < 1:
return 1
return max_workers
def run(workers_num, ifilename):
job_queue = mp.Queue()
result_queue = mp.Queue()
# decide how many processes are to be created
max_workers = get_max_workers()
print "processes available: %d" % max_workers
if workers_num < 1 or workers_num > max_workers:
workers_num = max_workers
workers_list = []
# a process for feeding job queue with the input file
task_gen = mp.Process(target=put_tasks, name="task_gen",
args=(job_queue, ifilename))
workers_list.append(task_gen)
for i in range(workers_num):
tmp = mp.Process(target=eat_queue, name="w%d" % (i+1),
args=(job_queue, result_queue))
workers_list.append(tmp)
for worker in workers_list:
worker.start()
for worker in workers_list:
worker.join()
print "worker %s finished!" % worker.name
if __name__ == '__main__':
result_list = []
args = sys.argv
workers_num = int(args[1])
ifilename = args[2]
run(workers_num, ifilename)

This is because nothing in your code takes anything off result_queue. The behavior then depends on internal queue buffering details: if "not a lot" of data is waiting, everything appears fine, but if "a lot" of data is waiting, everything freezes. Not much more can be said, because it involves layers of internal magic ;-) But the docs do warn about it:
Warning
As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
One easy way to repair that: First add
result_queue.put(None)
before eat_queue() returns. Then add:
count = 0
while count < workers_num:
if result_queue.get() is None:
count += 1
before the main program .join()s the workers. That drains the result queue, and everything shuts down cleanly then.
BTW, this code is pretty bizarre:
while True:
try:
job = job_queue.get(block=False)
if job == None:
print(proc_name + " DONE")
return
result_queue.put(execute(job))
except Queue.Empty:
pass
Why are you doing non-blocking get()? This turns into a CPU-hog "busy loop" so long as the queue is empty. The primary point of .get() is to supply an efficient way to wait for work to show up. So:
while True:
job = job_queue.get()
if job is None:
print(proc_name + " DONE")
break
else:
result_queue.put(execute(job))
result_queue.put(None)
does the same thing, but far more efficiently.
Queue size caution
You didn't ask about this, but let's cover it before it bites you ;-) By default, there is no bound on a Queue's size. If, e.g., you add a billion items to the Queue, it will demand enough RAM to hold a billion items. So if your producer(s) can generate work items faster than your consumer(s) can process them, memory use can get out of hand quickly.
Fortunately, that's easy to repair: specify a maximum queue size. For example,
job_queue = mp.Queue(maxsize=10*workers_num)
^^^^^^^^^^^^^^^^^^^^^^^
Then job_queue.put(some_work_item) will block until consumers reduce the size of the queue to less than the maximum. This way you can process enormous problems with a queue that requires trivial RAM.

How to properly set up multiprocessing proxy objects for objects that already exist

I'm trying to share an existing object across multiple processing using the proxy methods described here. My multiprocessing idiom is the worker/queue setup, modeled after the 4th example here.
The code needs to do some calculations on data that are stored in rather large files on disk. I have a class that encapsulates all the I/O interactions, and once it has read a file from disk, it saves the data in memory for the next time a task needs to use the same data (which happens often).
I thought I had everything working from reading the examples linked to above. Here is a mock up of the code that just uses numpy random arrays to model the disk I/O:
import numpy
from multiprocessing import Process, Queue, current_process, Lock
from multiprocessing.managers import BaseManager
nfiles = 200
njobs = 1000
class BigFiles:
def __init__(self, nfiles):
# Start out with nothing read in.
self.data = [ None for i in range(nfiles) ]
# Use a lock to make sure only one process is reading from disk at a time.
self.lock = Lock()
def access(self, i):
# Get the data for a particular file
# In my real application, this function reads in files from disk.
# Here I mock it up with random numpy arrays.
if self.data[i] is None:
with self.lock:
self.data[i] = numpy.random.rand(1024,1024)
return self.data[i]
def summary(self):
return 'BigFiles: %d, %d Storing %d of %d files in memory'%(
id(self),id(self.data),
(len(self.data) - self.data.count(None)),
len(self.data) )
# I'm using a worker/queue setup for the multprocessing:
def worker(input, output):
proc = current_process().name
for job in iter(input.get, 'STOP'):
(big_files, i, ifile) = job
data = big_files.access(ifile)
# Do some calculations on the data
answer = numpy.var(data)
msg = '%s, job %d'%(proc, i)
msg += '\n Answer for file %d = %f'%(ifile, answer)
msg += '\n ' + big_files.summary()
output.put(msg)
# A class that returns an existing file when called.
# This is my attempted workaround for the fact that Manager.register needs a callable.
class ObjectGetter:
def __init__(self, obj):
self.obj = obj
def __call__(self):
return self.obj
def main():
# Prior to the place where I want to do the multprocessing,
# I already have a BigFiles object, which might have some data already read in.
# (Here I start it out empty.)
big_files = BigFiles(nfiles)
print 'Initial big_files.summary = ',big_files.summary()
# My attempt at making a proxy class to pass big_files to the workers
class BigFileManager(BaseManager):
pass
getter = ObjectGetter(big_files)
BigFileManager.register('big_files', callable = getter)
manager = BigFileManager()
manager.start()
# Set up the jobs:
task_queue = Queue()
for i in range(njobs):
ifile = numpy.random.randint(0, nfiles)
big_files_proxy = manager.big_files()
task_queue.put( (big_files_proxy, i, ifile) )
# Set up the workers
nproc = 12
done_queue = Queue()
process_list = []
for j in range(nproc):
p = Process(target=worker, args=(task_queue, done_queue))
p.start()
process_list.append(p)
task_queue.put('STOP')
# Log the results
for i in range(njobs):
msg = done_queue.get()
print msg
print 'Finished all jobs'
print 'big_files.summary = ',big_files.summary()
# Shut down the workers
for j in range(nproc):
process_list[j].join()
task_queue.close()
done_queue.close()
main()
This works in the sense that it calculates everything correctly, and it is caching the data that is read along the way. The only problem I'm having is that at the end, the big_files object doesn't have any of the files loaded. The final msg returned is:
Process-2, job 999. Answer for file 198 = 0.083406
BigFiles: 4303246400, 4314056248 Storing 198 of 200 files in memory
But then after it's all done, we have:
Finished all jobs
big_files.summary = BigFiles: 4303246400, 4314056248 Storing 0 of 200 files in memory
So my question is: What happened to all the stored data? It's claiming to be using the same self.data according to the id(self.data). But it's empty now.
I want the end state of big_files to have all the saved data that it accumulated along the way, since I actually have to repeat this entire process many times, so I don't want to have to redo all the (slow) I/O each time.
I'm assuming it must have something to do with my ObjectGetter class. The examples for using BaseManager only show how to make a new object that will be shared, not share an existing one. So am I doing something wrong with way I get the existing big_files object? Can anyone suggest a better way to do this step?
Thanks much!

Turn functions with a callback into Python generators?

The Scipy minimization function (just to use as an example), has the option of adding a callback function at each step. So I can do something like,
def my_callback(x):
print x
scipy.optimize.fmin(func, x0, callback=my_callback)
Is there a way to use the callback function to create a generator version of fmin, so that I could do,
for x in my_fmin(func,x0):
print x
It seems like it might be possible with some combination of yields and sends, but I can quite think of anything.

As pointed in the comments, you could do it in a new thread, using Queue. The drawback is that you'd still need some way to access the final result (what fmin returns at the end). My example below uses an optional callback to do something with it (another option would be to just yield it also, though your calling code would have to differentiate between iteration results and final results):
from thread import start_new_thread
from Queue import Queue
def my_fmin(func, x0, end_callback=(lambda x:x), timeout=None):
q = Queue() # fmin produces, the generator consumes
job_done = object() # signals the processing is done
# Producer
def my_callback(x):
q.put(x)
def task():
ret = scipy.optimize.fmin(func,x0,callback=my_callback)
q.put(job_done)
end_callback(ret) # "Returns" the result of the main call
# Starts fmin in a new thread
start_new_thread(task,())
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
if next_item is job_done:
break
yield next_item
Update: to block the execution of the next iteration until the consumer has finished processing the last one, it's also necessary to use task_done and join.
# Producer
def my_callback(x):
q.put(x)
q.join() # Blocks until task_done is called
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
if next_item is job_done:
break
yield next_item
q.task_done() # Unblocks the producer, so a new iteration can start
Note that maxsize=1 is not necessary, since no new item will be added to the queue until the last one is consumed.
Update 2: Also note that, unless all items are eventually retrieved by this generator, the created thread will deadlock (it will block forever and its resources will never be released). The producer is waiting on the queue, and since it stores a reference to that queue, it will never be reclaimed by the gc even if the consumer is. The queue will then become unreachable, so nobody will be able to release the lock.
A clean solution for that is unknown, if possible at all (since it would depend on the particular function used in the place of fmin). A workaround could be made using timeout, having the producer raises an exception if put blocks for too long:
q = Queue(maxsize=1)
# Producer
def my_callback(x):
q.put(x)
q.put("dummy",True,timeout) # Blocks until the first result is retrieved
q.join() # Blocks again until task_done is called
# Consumer
while True:
next_item = q.get(True,timeout) # Blocks until an input is available
q.task_done() # (one "task_done" per "get")
if next_item is job_done:
break
yield next_item
q.get() # Retrieves the "dummy" object (must be after yield)
q.task_done() # Unblocks the producer, so a new iteration can start

Generator as coroutine (no threading)
Let's have FakeFtp with retrbinary function using callback being called with each successful read of chunk of data:
class FakeFtp(object):
def __init__(self):
self.data = iter(["aaa", "bbb", "ccc", "ddd"])
def login(self, user, password):
self.user = user
self.password = password
def retrbinary(self, cmd, cb):
for chunk in self.data:
cb(chunk)
Using simple callback function has disadvantage, that it is called repeatedly and the callback
function cannot easily keep context between calls.
Following code defines process_chunks generator, which will be able receiving chunks of data one
by one and processing them. In contrast to simple callback, here we are able to keep all the
processing within one function without losing context.
from contextlib import closing
from itertools import count
def main():
processed = []
def process_chunks():
for i in count():
try:
# (repeatedly) get the chunk to process
chunk = yield
except GeneratorExit:
# finish_up
print("Finishing up.")
return
else:
# Here process the chunk as you like
print("inside coroutine, processing chunk:", i, chunk)
product = "processed({i}): {chunk}".format(i=i, chunk=chunk)
processed.append(product)
with closing(process_chunks()) as coroutine:
# Get the coroutine to the first yield
coroutine.next()
ftp = FakeFtp()
# next line repeatedly calls `coroutine.send(data)`
ftp.retrbinary("RETR binary", cb=coroutine.send)
# each callback "jumps" to `yield` line in `process_chunks`
print("processed result", processed)
print("DONE")
To see the code in action, put the FakeFtp class, the code shown above and following line:
main()
into one file and call it:
$ python headsandtails.py
('inside coroutine, processing chunk:', 0, 'aaa')
('inside coroutine, processing chunk:', 1, 'bbb')
('inside coroutine, processing chunk:', 2, 'ccc')
('inside coroutine, processing chunk:', 3, 'ddd')
Finishing up.
('processed result', ['processed(0): aaa', 'processed(1): bbb', 'processed(2): ccc', 'processed(3): ddd'])
DONE
How it works
processed = [] is here just to show, the generator process_chunks shall have no problems to
cooperate with its external context. All is wrapped into def main(): to prove, there is no need to
use global variables.
def process_chunks() is the core of the solution. It might have one shot input parameters (not
used here), but main point, where it receives input is each yield line returning what anyone sends
via .send(data) into instance of this generator. One can coroutine.send(chunk) but in this example it is done via callback refering to this function callback.send.
Note, that in real solution there is no problem to have multiple yields in the code, they are
processed one by one. This might be used e.g. to read (and ignore) header of CSV file and then
continue processing records with data.
We could instantiate and use the generator as follows:
coroutine = process_chunks()
# Get the coroutine to the first yield
coroutine.next()
ftp = FakeFtp()
# next line repeatedly calls `coroutine.send(data)`
ftp.retrbinary("RETR binary", cb=coroutine.send)
# each callback "jumps" to `yield` line in `process_chunks`
# close the coroutine (will throw the `GeneratorExit` exception into the
# `process_chunks` coroutine).
coroutine.close()
Real code is using contextlib closing context manager to ensure, the coroutine.close() is
always called.
Conclusions
This solution is not providing sort of iterator to consume data from in traditional style "from
outside". On the other hand, we are able to:
use the generator "from inside"
keep all iterative processing within one function without being interrupted between callbacks
optionally use external context
provide usable results to outside
all this can be done without using threading
Credits: The solution is heavily inspired by SO answer Python FTP “chunk” iterator (without loading entire file into memory)
written by user2357112

Concept Use a blocking queue with maxsize=1 and a producer/consumer model.
The callback produces, then the next call to the callback will block on the full queue.
The consumer then yields the value from the queue, tries to get another value, and blocks on read.
The producer is the allowed to push to the queue, rinse and repeat.
Usage:
def dummy(func, arg, callback=None):
for i in range(100):
callback(func(arg+i))
# Dummy example:
for i in Iteratorize(dummy, lambda x: x+1, 0):
print(i)
# example with scipy:
for i in Iteratorize(scipy.optimize.fmin, func, x0):
print(i)
Can be used as expected for an iterator:
for i in take(5, Iteratorize(dummy, lambda x: x+1, 0)):
print(i)
Iteratorize class:
from thread import start_new_thread
from Queue import Queue
class Iteratorize:
"""
Transforms a function that takes a callback
into a lazy iterator (generator).
"""
def __init__(self, func, ifunc, arg, callback=None):
self.mfunc=func
self.ifunc=ifunc
self.c_callback=callback
self.q = Queue(maxsize=1)
self.stored_arg=arg
self.sentinel = object()
def _callback(val):
self.q.put(val)
def gentask():
ret = self.mfunc(self.ifunc, self.stored_arg, callback=_callback)
self.q.put(self.sentinel)
if self.c_callback:
self.c_callback(ret)
start_new_thread(gentask, ())
def __iter__(self):
return self
def next(self):
obj = self.q.get(True,None)
if obj is self.sentinel:
raise StopIteration
else:
return obj
Can probably do with some cleaning up to accept *args and **kwargs for the function being wrapped and/or the final result callback.

How about
data = []
scipy.optimize.fmin(func,x0,callback=data.append)
for line in data:
print line
If not, what exactly do you want to do with the generator's data?

A variant of Frits' answer, that:
Supports send to choose a return value for the callback
Supports throw to choose an exception for the callback
Supports close to gracefully shut down
Does not compute a queue item until it is requested
The complete code with tests can be found on github
import queue
import threading
import collections.abc
class generator_from_callback(collections.abc.Generator):
def __init__(self, expr):
"""
expr: a function that takes a callback
"""
self._expr = expr
self._done = False
self._ready_queue = queue.Queue(1)
self._done_queue = queue.Queue(1)
self._done_holder = [False]
# local to avoid reference cycles
ready_queue = self._ready_queue
done_queue = self._done_queue
done_holder = self._done_holder
def callback(value):
done_queue.put((False, value))
cmd, *args = ready_queue.get()
if cmd == 'close':
raise GeneratorExit
elif cmd == 'send':
return args[0]
elif cmd == 'throw':
raise args[0]
def thread_func():
try:
cmd, *args = ready_queue.get()
if cmd == 'close':
raise GeneratorExit
elif cmd == 'send':
if args[0] is not None:
raise TypeError("can't send non-None value to a just-started generator")
elif cmd == 'throw':
raise args[0]
ret = expr(callback)
raise StopIteration(ret)
except BaseException as e:
done_holder[0] = True
done_queue.put((True, e))
self._thread = threading.Thread(target=thread_func)
self._thread.start()
def __next__(self):
return self.send(None)
def send(self, value):
if self._done_holder[0]:
raise StopIteration
self._ready_queue.put(('send', value))
is_exception, val = self._done_queue.get()
if is_exception:
raise val
else:
return val
def throw(self, exc):
if self._done_holder[0]:
raise StopIteration
self._ready_queue.put(('throw', exc))
is_exception, val = self._done_queue.get()
if is_exception:
raise val
else:
return val
def close(self):
if not self._done_holder[0]:
self._ready_queue.put(('close',))
self._thread.join()
def __del__(self):
self.close()
Which works as:
In [3]: def callback(f):
...: ret = f(1)
...: print("gave 1, got {}".format(ret))
...: f(2)
...: print("gave 2")
...: f(3)
...:
In [4]: i = generator_from_callback(callback)
In [5]: next(i)
Out[5]: 1
In [6]: i.send(4)
gave 1, got 4
Out[6]: 2
In [7]: next(i)
gave 2, got None
Out[7]: 3
In [8]: next(i)
StopIteration
For scipy.optimize.fmin, you would use generator_from_callback(lambda c: scipy.optimize.fmin(func, x0, callback=c))

Solution to handle non-blocking callbacks
The solution using threading and queue is pretty good, of high-performance and cross-platform, probably the best one.
Here I provide this not-too-bad solution, which is mainly for handling non-blocking callbacks, e.g. called from the parent function through threading.Thread(target=callback).start(), or other non-blocking ways.
import pickle
import select
import subprocess
def my_fmin(func, x0):
# open a process to use as a pipeline
proc = subprocess.Popen(['cat'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
def my_callback(x):
# x might be any object, not only str, so we use pickle to dump it
proc.stdin.write(pickle.dumps(x).replace(b'\n', b'\\n') + b'\n')
proc.stdin.flush()
from scipy import optimize
optimize.fmin(func, x0, callback=my_callback)
# this is meant to handle non-blocking callbacks, e.g. called somewhere
# through `threading.Thread(target=callback).start()`
while select.select([proc.stdout], [], [], 0)[0]:
yield pickle.loads(proc.stdout.readline()[:-1].replace(b'\\n', b'\n'))
# close the process
proc.communicate()
Then you can use the function like this:
# unfortunately, `scipy.optimize.fmin`'s callback is blocking.
# so this example is just for showing how-to.
for x in my_fmin(lambda x: x**2, 3):
print(x)
Although This solution seems quite simple and readable, it's not as high-performance as the threading and queue solution, because:
Processes are much heavier than threadings.
Passing data through pipe instead of memory is much slower.
Besides, it doesn't work on Windows, because the select module on Windows can only handle sockets, not pipes and other file descriptors.

For a super simple approach...
def callback_to_generator():
data = []
method_with_callback(blah, foo, callback=data.append)
for item in data:
yield item
Yes, this isn't good for large data
Yes, this blocks on all items being processed first
But it still might be useful for some use cases :)
Also thanks to #winston-ewert as this is just a small variant on his answer :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: 'before' and 'after' for multiprocessing workers - python

Related

Implementing "competing" processes in python

Is instance variable in Python 3.5 process-safe?

multiprocessing - reading big input data - program hangs

How to properly set up multiprocessing proxy objects for objects that already exist

Turn functions with a callback into Python generators?

Categories

Resources