pipeline an iterator to multiple consumers?

pipeline an iterator to multiple consumers? - python

Is it possible to "pipeline" consumption of a generator across multiple consumers?
For example, it's common to have code with this pattern:
def consumer1(iterator):
for item in iterator:
foo(item)
def consumer2(iterator):
for item in iterator:
bar(item)
myiter = list(big_generator())
v1 = consumer1(myiter)
v2 = consumer2(myiter)
In this case, multiple functions completely consume the same iterator, making it necessary to cache the iterator in a list. Since each consumer exhausts the iterator, itertools.tee is useless.
I see code like this a lot and I always wish I could get the consumers to consume one item at a time in order instead of caching the entire iterator. E.g.:
consumer1 consumes myiter[0]
consumer2 consumes myiter[0]
consumer1 consumes myiter[1]
consumer2 consumes myiter[1]
etc...
If I were to make up a syntax, it would look like this:
c1_retval, c2_retval = iforkjoin(big_generator(), (consumer1, consumer2))
You can get close with threads or multiprocessing and teed iterators, but threads consume at different speeds meaning that the value deque cached inside tee could get very large. The point here is not to exploit parallelism or to speed up tasks but to avoid caching large sections of the iterator.
It seems to me that this might be impossible without modifying the consumers because the flow of control is in the consumer. However, when a consumer actually consumes the iterator control passes into the iterator's next() method, so maybe it is possible to invert the flow of control somehow so that the iterator blocks the consumers one at a time until it can feed them all?
If this is possible, I'm not clever enough to see how. Any ideas?

With the limitation of not changing consumers' code (i.e. having a loop in them), you're left with only two options:
the approach you already include in your question: caching the generated items in memory, then iterating over them multiple times.
running each consumer in a thread, and implement some kind of synchronized-itertools.tee, one with buffer of size=1, which blocks serving item i+1 until item i has been served to all consumers.
There are no other options. You can't achieve all of the below, as they are contradicting:
having a generator
having a loop to consume all of it
then, (serially-)after the previous loop has finished, having another loop to consume all of it again
only keeping O(1) items in memory (or disk, etc.) while consuming them
not regenerating (i.e. not re-creating the generator)
The generated items must be stored somewhere if you want to reuse them.
If changing the consumers' code is acceptable, clearly #monkey's solution is the simplest and most straightforward.

Doesn't this work? Or do you require the entire iterator so a copy to each like this, won't work? If so, then I think you either have to create a copy, else generate the list twice?
for item in big_generator():
consumer1.handle_item(item)
consumer2.handle_item(item)

Related

Fastest way to share a very large dict between multiple processes without copying

TL;DR: How to share a large (200MB) read only dict between multiple processes in a performant way, that is accessed VERY heavily without each process having a full copy in memory.
EDIT: It looks like if I just pass the dictionary as the argument for the multiprocessing.Pool/Process, it won't actually create a copy unless a worker modifies the dictionary. I just assumed it would copy. This behavior seems to be Unix only where fork is available and even then not always. But if so, it should solve my problem until this is converted to an ETL job.
What I'm trying to do:
I have a task to improve a script that replicates data from one store to another. Normalizing and transforming the data on the way. This task works on the scale of around 100 million documents coming from the source document store that get rolled up and pushed to another destination document store.
Each document has an ID and there is another document store is that essentially a key value store of those ID's mapped to some additional information needed for this task. This store is a lot smaller and doing queries against it while document from the main store come through, is not really an option without heavy caching and that heavy cache ends up being a copy of the whole thing very quickly. I just create the whole dictionary dictionary from that entire store at beginning before starting anything and use that. That dictionary is around ~200MB in size. Note that this dictionary is only ever read from.
For this I have setup multiprocessing and have around 30 concurrent processes. I've divided the work for each process such that each hit a different indices and can do the whole thing in around 4 hours.
I have noticed that I am extremely CPU bound when doing the following 2 things:
Using a thread pool/threads (what i'm currently doing) so each thread can access the dict without issue. The GIL is killing me and I have one process maxing out at 100% all the time with other CPU's sitting idle. Switching to PyPy helped a lot, but i'm still not happy with this approach.
Creating a Multiprocessing.Manager().dict() for the large dict and having the child processes access through that. The server process that this approach creates is constantly at 100% cpu. I don't know why, as I only ever read from this dictionary so I doubt it's a locking thing. I don't know how the Manager works internally but i'm guessing that the child processes are connecting via Pipes/Sockets for each fetch and the overhead of this is massive. It also suggests that using Reddis/Memcache will have the same problem if true. Maybe it can be configured better?
I am Memory bound when doing these things:
Using a SharedMemory view. You can't seem to do this for dicts like I need to. I can serialize the dict to get into the shared view, but for it to be usable on the Child process you need serialize the data to an actual usable dict which creates the copy in the process.
I strongly suspect that unless I've missed something I'm just going to have to "download more ram" or rewrite from Python into something without a GIL (or use ETL like it should be done in...).
In the case of ram, what is the most efficient way to store a dict like this to make it sting less? It's currently a standard dict mapped to a tuple of the extra information consisting of 3 long/float.
doc_to_docinfo = {
"ID1": (5.2, 3.0, 455),
}
Are there any more efficient hashmap implementations for this use case than what i'm doing?

You seem to have a similar problem that I have. It is possible to use my source here to create a partitioning of those dictionary-keys per thread. My suggestion: Split the document IDs into partitions of length 3 or 4, keep the partition table in sync for all processes/threads and then just move the parts of your documents to each process/thread and as an entrypoint the process does a dictionary lookup and finds out which process can handle the part of that dictionary. If you are clever with balancing the partitions, you could also have an equal amount of documents per thread managed.

Concurrent access to list from multiple threads in python when data is appended constantly

We have a list to which data is appended at regular time intervals and this procedure takes time so using usual mutex to protect the entire list during writes is not the most efficient solution. How to organize reads and writes to such list in a more concurrent fashion?

You don't need locking when using list.append() or list.extend() with multiple threads. These operations are thread-safe.
Here is a brief overview of operations that are thread-safe: https://docs.python.org/3/faq/library.html#what-kinds-of-global-value-mutation-are-thread-safe
It's also worth mentioning that from a performance standpoint it's much faster to prepare sub-lists in separate threads, and then extend the main list with these sub-lists.

python converting map to list taking a long time

EDIT: I'm using Python 3.5.0, and so map() will return an iterator instead of a list, unlike Python 2.x
I have a list of units and I am calling a REST api on all of them to return more data about them. I'm using map() to do this, but when I try to convert that map to a list, the program hangs there and doesn't proceed (both when I run it and debug it)
data = list(map(lambda product: client.request(units_url + "/" + product), units))
At first I thought maybe it was an issue with calling the api so quickly, but when I iterate through the map (without converting it to a list) manually and print it goes just fine:
data = map(lambda product: client.request(units_url + "/" + product), units)
for item in data:
print(item) # <-- this works just fine for the entire map
Anyone know why I'm getting this behavior?

When you list-ify the map, that means every single request is dispatched serially, waits for completion, then stores to the resulting list. If you're dispatching 1000 requests, that means each request must complete in order, one by one, before the list is constructed and you see the first result; it's entirely synchronous.
You get results (almost) immediately in the direct map iteration case because it only makes one request at a time; instead of waiting for 1000 requests, it waits for 1, you process that result, then it waits for another, etc.
If the goal is to minimize latency, take a look at multiprocessing.Pool.imap (or the thread based version of the pool implemented in multiprocessing.dummy; threads can be ideal for parallel network I/O requests and won't require pickling data for IPC). With the Pool's map, imap, or imap_unordered methods (choose one based on your needs), the requests will be dispatched asynchronously, several at a time (depending on the number of workers you select). If you absolutely must have a list, Pool.map will usually construct it faster; if you can iterate directly and don't care about the ordering of results, Pool.imap_unordered will get you results as fast as the workers can get them, in whatever order they are satisfied in. Plain map without a Pool isn't getting you any magical performance benefits (a list comprehension would usually run faster actually), so use a Pool.
Simple example code for fastest results:
import multiprocessing.dummy as multiprocessing # Import thread based version of library; for network I/O should work fine
with multiprocessing.Pool(8) as pool: # Pool of eight worker threads
for item in pool.imap_unordered(lambda product: client.request(units_url + "/" + product), units):
print(item)
If you really need to, you can use Pool.map and store to a real list, and assuming you have the bandwidth to run eight parallel requests (or however many workers you configure the pool for), that should (roughly) divide the time to complete the map by eight.

Better answer than I previously had. Check out this link. Near the bottom of the answer it gives a great analysis on why you should really use a list comprehension.
data = [ client.request(units_url + "/" + product) for product in units ]

Implementing a Timer in Python

General Overview
I have medium size django project
I have a bunch of prefix trees in memory (as opposed to DB)
The nodes of these trees represent entities/objects that are subject to a timeout. Ie, I need to timeout these nodes at various points in time
Design:
Essentially, I needed a Timer construct that allows me to fire a resettable 1-shot timer and associate and give it a callback that can can perform some operation on the entity creating the timer, which in this case is a node of the tree.
After looking through the various options, I couldn't find anything that I could natively use (like some django app). The Timer object in Python is not suitable for this since it won't scale/perform. Thus I decided to write my own timer based on:
A sorted list of time-delta objects that holds the time-horizon
A mechanism to trigger the "tick"
Implementation Choices:
Went with a wrapper around Bisect for the sorted delta list:
http://code.activestate.com/recipes/577197-sortedcollection/
Went with celery to provide the tick - A granularity of 1 minute, where the worker would trigger the timer_tick function provided by my Timer class.
The timer_tick essentially should go through the sorted list, decrementing the head node every tick. Then if any nodes have ticked down to 0, kick off the callback and remove those nodes from the sorted timer list.
Creating a timer involves instantiating a Timer object which returns the id of the object. This id is stored in db and associated with an entry in DB that represents the entity creating the timer
Additional Data Structures:
In order to track the Timer instances (which get instantiated for each timer creation) I have a WeakRef Dictionary that maps the id to obj
So essentially, I have 2 data-structures in memory of my main Django project.
Problem Statement:
Since the celery worker needs to walk the timer list and also potentially modify the id2obj map, looks like I need to find a way to share state between my celery worker and main
Going through SO/Google, I find the following suggestions
Manager
Shared Memory
Unfortunately, bisect wrapper doesn't lend itself very well to pickling and/or state sharing. I tried the Manager approach by creating a dict and trying to embed the sorted List within the Dict..it came out with an error (kind of expected I guess since the memory held by the Sorted List is not shared and embedding it within a "shared" memory object will not work)
Finally...Question:
Is there a way I can share my SortedCollection and Weakref Dict with the worker thread
Alternate solution:
How about keeping the worker thread simple...having it write to DB for every tick and then using a post Db signal to get notified on the main and execute the processing of expired timers in the main. Of course, the con is that I lose parallelisation.

Let's start with some comments on your existing implementation:
Went with a wrapper around Bisect for the sorted delta list: http://code.activestate.com/recipes/577197-sortedcollection/
While this gives you O(1) pops (as long as you keep the list in reverse time order), it makes each insert O(N) (and likewise for less common operations like deleting arbitrary jobs if you have a "cancel" API). Since you're doing exactly as many inserts as pops, this means the whole thing is algorithmically no better than an unsorted list.
Replacing this with a heapq (that's exactly what they're for) gives you O(log N) inserts. (Note that Python's heapq doesn't have a peek, but that's because heap[0] is equivalent to heap.peek(0), so you don't need it.)
If you need to make other operations (cancel, iterate non-destructively, etc.) O(log N) as well, you want a search tree; look at blist and bintrees on PyPI for some good ones.
Went with celery to provide the tick - A granularity of 1 minute, where the worker would trigger the timer_tick function provided by my Timer class. The timer_tick essentially should go through the sorted list, decrementing the head node every tick. Then if any nodes have ticked down to 0, kick off the callback and remove those nodes from the sorted timer list.
It's much nicer to just keep the target times instead of the deltas. With target times, you just have to do this:
while q.peek().timestamp <= now():
process(q.pop())
Again, that's O(1) rather than O(N), and it's a lot simpler, and it treats the elements on the queue as immutable, and it avoids any possible problems with iterations taking longer than your tick time (probably not a problem with 1-minute ticks…).
Now, on to your main question:
Is there a way I can share my SortedCollection
Yes. If you just want a priority heap of (timestamp, id) pairs, you can fit that into a multiprocessing.Array just as easily as a list, except for the need to keep track of length explicitly. Then you just need to synchronize every operation, and… that's it.
If you're only ticking once/minute, and you expect to be busy more often than not, you can just use a Lock to synchronize, and have the schedule-worker(s) tick itself.
But honestly, I'd drop the ticks completely and just use a Condition—it's more flexible, and conceptually simpler (even if it's a bit more code), and it means you're using 0% CPU when there's no work to be done and responding quickly and smoothly when you're under load. For example:
def schedule_job(timestamp, job):
job_id = add_job_to_shared_dict(job) # see below
with scheduler_condition:
scheduler_heap.push((timestamp, job))
scheduler_condition.notify_all()
def scheduler_worker_run_once():
with scheduler_condition:
while True:
top = scheduler_heap.peek()
if top is not None:
delay = top[0] - now()
if delay <= 0:
break
scheduler_condition.wait(delay)
else:
scheduler_condition.wait()
top = scheduler_heap.pop()
if top is not None:
job = pop_job_from_shared_dict(top[1])
process_job(job)
Anyway, that brings us to the weakdict full of jobs.
Since a weakdict is explicitly storing references to in-process objects, it doesn't make any sense to share it across processes. What you want to store are immutable objects that define what the jobs actually are, not the mutable jobs themselves. Then it's just a plain old dict.
But still, a plain old dict is not an easy thing to share across processes.
The easy way to do that is to use a dbm database (or a shelve wrapper around one) instead of an in-memory dict, synchronized with a Lock. But this means re-flushing and re-opening the database every time anyone wants to change it, which may be unacceptable.
Switching to, say, a sqlite3 database may seem like overkill, but it may be a whole lot simpler.
On the other hand… the only operations you actually have here are "map the next id to this job and return the id" and "pop and return the job specified by this id". Does that really need to be a dict? The keys are integers, and you control them. An Array, plus a single Value for the next key, and a Lock, and you're almost done. The problem is that you need some kind of scheme for key overflow. Instead of just next_id += 1, you have to roll over, and check for already-used slots:
with lock:
next_id += 1
if next_id == size: next_id = 0
if arr[next_id] is None:
arr[next_id] = job
return next_id
Another option is to just store the dict in the main process, and use a Queue to make other processes query it.

Clean way to get near-LIFO behavior from multiprocessing.Queue? (or even just not near-FIFO)

Does anyone know a clean way to get near-LIFO or even not near-FIFO (e.g. random) behavior from multiprocessing.Queue?
Alternative Question: Could someone point me to the code for the thread that manages the actual storage structure behind multiprocessing.Queue? It seems like it would be trivial within that to provide approximately LIFO access, but I got lost in the rabbit hole trying to find it.
Notes:
I believe multiprocessing.Queue does not guarantee order. Fine. But it is near-FIFO so near-LIFO would be great.
I could pull all the current items off the queue and reverse the order before working with them, but I prefer to avoid a kludge if possible.
(edit) To clarify: I am doing a CPU bound simulation with multiprocessing and so can't use the specialized queues from Queue. Since I haven't seen any answers for a few days, I've added the alternative question above.
In case it is an issue, below is slight evidence that multiprocessing.Queue is near-FIFO. It just shows that in a simple case (a single thread), it is perfectly FIFO on my system:
import multiprocessing as mp
import Queue
q = mp.Queue()
for i in xrange(1000):
q.put(i)
deltas = []
while True:
try:
value1 = q.get(timeout=0.1)
value2 = q.get(timeout=0.1)
deltas.append(value2-value1)
except Queue.Empty:
break
#positive deltas would indicate the numbers are coming out in increasing order
min_delta, max_delta = min(deltas), max(deltas)
avg_delta = sum(deltas)/len(deltas)
print "min", min_delta
print "max", max_delta
print "avg", avg_delta
prints: min, max, and average are exactly 1 (perfect FIFO)

I've looked over the Queue class that lives in Lib/multiprocessing/queues.py in my Python installation (Python 2.7, but nothing obvious is different in the version from Python 3.2 that I briefly checked). Here's how I understand it works:
There are two sets of objects that are maintained by the Queue object. One set are multiprocess-safe primatives that are shared by all processes. The others are created and used separately by each process.
The cross-process objects are set up in the __init__ method:
A Pipe object, who's two ends are saved as self._reader and self._writer.
A BoundedSemaphore object, which counts (and optionally limits) how many objects are in the queue.
A Lock object for reading the Pipe, and on non-Windows platforms another for writing. (I assume that this is because writing to a pipe is inherently multiprocess-safe on Windows.)
The per-process objects are set up in the _after_fork and _start_thread methods:
A collections.deque object used to buffer writes to the Pipe.
A threading.condition object used to signal when the buffer is not empty.
A threading.Thread object that does the actual writing. It is created lazily, so it won't exist until at least one write to the Queue has been requested in a given process.
Various Finalize objects that clean stuff up when the process ends.
A get from the queue is pretty simple. You acquire the read lock, decrement the semaphore, and grab an object from the read end of the Pipe.
A put is more complicated. It uses multiple threads. The caller to put grabs the condition's lock, then adds its object to the buffer and signals the condition before unlocking it. It also increments the semaphore and starts up the writer thread if it isn't running yet.
The writer thread loops forever (until canceled) in the _feed method. If the buffer is empty, it waits on the notempty condition. Then it takes an item from the buffer, acquires the write lock (if it exists) and writes the item to the Pipe.
So, given all of that, can you modify it to get a LIFO queue? It doesn't seem easy. Pipes are inherently FIFO objects, and while the Queue can't guarantee FIFO behavior overall (due to the asynchronous nature of the writes from multiple processes) it is always going to be mostly FIFO.
If you have only a single consumer, you could get objects from the queue and add them to your own process-local stack. It would be harder to do a multi-consumer stack, though with shared memory a bounded-size stack wouldn't be too hard. You'd need a lock, a pair of conditions (for blocking/signaling on full and empty states), a shared integer value (for the number of values held) and a shared array of an appropriate type (for the values themselves).

There is a LIFO queue in the Queue package (queue in Python 3). This isn't exposed in the multiprocessing or multiprocessing.queues modules.
Replacing your line q = mp.Queue() with q = Queue.LifoQueue() and running prints: min, max and average as exactly -1.
(Also I think you should always get exactly FIFO/LIFO order when getting items from only one thread.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.