General Overview
I have medium size django project
I have a bunch of prefix trees in memory (as opposed to DB)
The nodes of these trees represent entities/objects that are subject to a timeout. Ie, I need to timeout these nodes at various points in time
Design:
Essentially, I needed a Timer construct that allows me to fire a resettable 1-shot timer and associate and give it a callback that can can perform some operation on the entity creating the timer, which in this case is a node of the tree.
After looking through the various options, I couldn't find anything that I could natively use (like some django app). The Timer object in Python is not suitable for this since it won't scale/perform. Thus I decided to write my own timer based on:
A sorted list of time-delta objects that holds the time-horizon
A mechanism to trigger the "tick"
Implementation Choices:
Went with a wrapper around Bisect for the sorted delta list:
http://code.activestate.com/recipes/577197-sortedcollection/
Went with celery to provide the tick - A granularity of 1 minute, where the worker would trigger the timer_tick function provided by my Timer class.
The timer_tick essentially should go through the sorted list, decrementing the head node every tick. Then if any nodes have ticked down to 0, kick off the callback and remove those nodes from the sorted timer list.
Creating a timer involves instantiating a Timer object which returns the id of the object. This id is stored in db and associated with an entry in DB that represents the entity creating the timer
Additional Data Structures:
In order to track the Timer instances (which get instantiated for each timer creation) I have a WeakRef Dictionary that maps the id to obj
So essentially, I have 2 data-structures in memory of my main Django project.
Problem Statement:
Since the celery worker needs to walk the timer list and also potentially modify the id2obj map, looks like I need to find a way to share state between my celery worker and main
Going through SO/Google, I find the following suggestions
Manager
Shared Memory
Unfortunately, bisect wrapper doesn't lend itself very well to pickling and/or state sharing. I tried the Manager approach by creating a dict and trying to embed the sorted List within the Dict..it came out with an error (kind of expected I guess since the memory held by the Sorted List is not shared and embedding it within a "shared" memory object will not work)
Finally...Question:
Is there a way I can share my SortedCollection and Weakref Dict with the worker thread
Alternate solution:
How about keeping the worker thread simple...having it write to DB for every tick and then using a post Db signal to get notified on the main and execute the processing of expired timers in the main. Of course, the con is that I lose parallelisation.
Let's start with some comments on your existing implementation:
Went with a wrapper around Bisect for the sorted delta list: http://code.activestate.com/recipes/577197-sortedcollection/
While this gives you O(1) pops (as long as you keep the list in reverse time order), it makes each insert O(N) (and likewise for less common operations like deleting arbitrary jobs if you have a "cancel" API). Since you're doing exactly as many inserts as pops, this means the whole thing is algorithmically no better than an unsorted list.
Replacing this with a heapq (that's exactly what they're for) gives you O(log N) inserts. (Note that Python's heapq doesn't have a peek, but that's because heap[0] is equivalent to heap.peek(0), so you don't need it.)
If you need to make other operations (cancel, iterate non-destructively, etc.) O(log N) as well, you want a search tree; look at blist and bintrees on PyPI for some good ones.
Went with celery to provide the tick - A granularity of 1 minute, where the worker would trigger the timer_tick function provided by my Timer class. The timer_tick essentially should go through the sorted list, decrementing the head node every tick. Then if any nodes have ticked down to 0, kick off the callback and remove those nodes from the sorted timer list.
It's much nicer to just keep the target times instead of the deltas. With target times, you just have to do this:
while q.peek().timestamp <= now():
process(q.pop())
Again, that's O(1) rather than O(N), and it's a lot simpler, and it treats the elements on the queue as immutable, and it avoids any possible problems with iterations taking longer than your tick time (probably not a problem with 1-minute ticks…).
Now, on to your main question:
Is there a way I can share my SortedCollection
Yes. If you just want a priority heap of (timestamp, id) pairs, you can fit that into a multiprocessing.Array just as easily as a list, except for the need to keep track of length explicitly. Then you just need to synchronize every operation, and… that's it.
If you're only ticking once/minute, and you expect to be busy more often than not, you can just use a Lock to synchronize, and have the schedule-worker(s) tick itself.
But honestly, I'd drop the ticks completely and just use a Condition—it's more flexible, and conceptually simpler (even if it's a bit more code), and it means you're using 0% CPU when there's no work to be done and responding quickly and smoothly when you're under load. For example:
def schedule_job(timestamp, job):
job_id = add_job_to_shared_dict(job) # see below
with scheduler_condition:
scheduler_heap.push((timestamp, job))
scheduler_condition.notify_all()
def scheduler_worker_run_once():
with scheduler_condition:
while True:
top = scheduler_heap.peek()
if top is not None:
delay = top[0] - now()
if delay <= 0:
break
scheduler_condition.wait(delay)
else:
scheduler_condition.wait()
top = scheduler_heap.pop()
if top is not None:
job = pop_job_from_shared_dict(top[1])
process_job(job)
Anyway, that brings us to the weakdict full of jobs.
Since a weakdict is explicitly storing references to in-process objects, it doesn't make any sense to share it across processes. What you want to store are immutable objects that define what the jobs actually are, not the mutable jobs themselves. Then it's just a plain old dict.
But still, a plain old dict is not an easy thing to share across processes.
The easy way to do that is to use a dbm database (or a shelve wrapper around one) instead of an in-memory dict, synchronized with a Lock. But this means re-flushing and re-opening the database every time anyone wants to change it, which may be unacceptable.
Switching to, say, a sqlite3 database may seem like overkill, but it may be a whole lot simpler.
On the other hand… the only operations you actually have here are "map the next id to this job and return the id" and "pop and return the job specified by this id". Does that really need to be a dict? The keys are integers, and you control them. An Array, plus a single Value for the next key, and a Lock, and you're almost done. The problem is that you need some kind of scheme for key overflow. Instead of just next_id += 1, you have to roll over, and check for already-used slots:
with lock:
next_id += 1
if next_id == size: next_id = 0
if arr[next_id] is None:
arr[next_id] = job
return next_id
Another option is to just store the dict in the main process, and use a Queue to make other processes query it.
Related
TL;DR: How to share a large (200MB) read only dict between multiple processes in a performant way, that is accessed VERY heavily without each process having a full copy in memory.
EDIT: It looks like if I just pass the dictionary as the argument for the multiprocessing.Pool/Process, it won't actually create a copy unless a worker modifies the dictionary. I just assumed it would copy. This behavior seems to be Unix only where fork is available and even then not always. But if so, it should solve my problem until this is converted to an ETL job.
What I'm trying to do:
I have a task to improve a script that replicates data from one store to another. Normalizing and transforming the data on the way. This task works on the scale of around 100 million documents coming from the source document store that get rolled up and pushed to another destination document store.
Each document has an ID and there is another document store is that essentially a key value store of those ID's mapped to some additional information needed for this task. This store is a lot smaller and doing queries against it while document from the main store come through, is not really an option without heavy caching and that heavy cache ends up being a copy of the whole thing very quickly. I just create the whole dictionary dictionary from that entire store at beginning before starting anything and use that. That dictionary is around ~200MB in size. Note that this dictionary is only ever read from.
For this I have setup multiprocessing and have around 30 concurrent processes. I've divided the work for each process such that each hit a different indices and can do the whole thing in around 4 hours.
I have noticed that I am extremely CPU bound when doing the following 2 things:
Using a thread pool/threads (what i'm currently doing) so each thread can access the dict without issue. The GIL is killing me and I have one process maxing out at 100% all the time with other CPU's sitting idle. Switching to PyPy helped a lot, but i'm still not happy with this approach.
Creating a Multiprocessing.Manager().dict() for the large dict and having the child processes access through that. The server process that this approach creates is constantly at 100% cpu. I don't know why, as I only ever read from this dictionary so I doubt it's a locking thing. I don't know how the Manager works internally but i'm guessing that the child processes are connecting via Pipes/Sockets for each fetch and the overhead of this is massive. It also suggests that using Reddis/Memcache will have the same problem if true. Maybe it can be configured better?
I am Memory bound when doing these things:
Using a SharedMemory view. You can't seem to do this for dicts like I need to. I can serialize the dict to get into the shared view, but for it to be usable on the Child process you need serialize the data to an actual usable dict which creates the copy in the process.
I strongly suspect that unless I've missed something I'm just going to have to "download more ram" or rewrite from Python into something without a GIL (or use ETL like it should be done in...).
In the case of ram, what is the most efficient way to store a dict like this to make it sting less? It's currently a standard dict mapped to a tuple of the extra information consisting of 3 long/float.
doc_to_docinfo = {
"ID1": (5.2, 3.0, 455),
}
Are there any more efficient hashmap implementations for this use case than what i'm doing?
You seem to have a similar problem that I have. It is possible to use my source here to create a partitioning of those dictionary-keys per thread. My suggestion: Split the document IDs into partitions of length 3 or 4, keep the partition table in sync for all processes/threads and then just move the parts of your documents to each process/thread and as an entrypoint the process does a dictionary lookup and finds out which process can handle the part of that dictionary. If you are clever with balancing the partitions, you could also have an equal amount of documents per thread managed.
When I use dictionary.get() function, is it locking the whole dictionary? I am developing a multiprocess and multithreading program. The dictionary is used to act as a state table to keep track of data. I have to impose a size limit to the dictionary, so whenever the limit is being hit, I have to do garbage collection on the table, based on the timestamp. The current implementation will delay adding operation while garbage collection is iterating through the whole table.
I will have 2 or more threads, one just to add data and one just to do garbage collection. Performance is critical in my program to handle streaming data. My program is receiving streaming data, and whenever it receives a message, it has to look for it in the state table, then add the record if it's non-existent in the first place, or copy certain information and then send it along the pipe.
I have thought of using multiprocessing to do the search and adding operation concurrently, but if I used processes, I have to make a copy of state table for each process, in that case, the performance overhead for synchronization is too high. And I also read that multiprocessing.manager.dict() is also locking the access for each CRUD operation. I could not spare the overhead for it so my current approach is using threading.
So my question is while one thread is doing .get(), del dict['key'] operation on the table, will the other insertion thread be blocked from accessing it?
Note: I have read through most SO's python dictionary related posts, but I cannot seem to find the answer. Most people only answer that even though python dictionary operations are atomic, it is safer to use a Lock for insertion/update. I'm handling a huge amount of streaming data so Locking every time is not ideal for me. Please advise if there is a better approach.
If the process of hashing or comparing the keys in your dictionary could invoke arbitrary Python code (basically, if the keys aren't all Python built-in types implemented in C, e.g. str, int, float, etc.), then yes, it would be possible for a race condition to occur in which the GIL is released while a bucket collision is being resolved (during the equality test), and another thread could leap in and cause the object being compared to to disappear from the dict. They try to ensure it doesn't actually crash the interpreter, but it has been a source of errors in the past.
If that's a possibility (or you're on a non-CPython interpreter, where there is no GIL providing basic guarantees like this), then you should really use a lock to coordinate access. On CPython, as long as you're on modern Python 3, the cost will be fairly low; contention on the lock should be fairly low since the GIL ensures only one thread is actually running at once; most of the time your lock should be uncontended (because the contention is on the GIL), so the incremental cost of using it should be fairly small.
A note: You might consider using collections.OrderedDict to simplify the process of limiting the size of your table. With OrderedDict, you can implement the size limit as a strict LRU (least-recently used) system by making additions to the table done as:
with lock:
try:
try:
odict.move_to_end(key) # If key already existed, make sure it's "renewed"
finally:
odict[key] = value # set new value whether or not key already existed
except KeyError:
# move_to_end raising key error means newly added key, so we might
# have grown larger than limit
if len(odict) > maxsize:
odict.popitem(False) # Pops oldest item
and usage done as:
with lock:
# move_to_end optional; if using key means it should live longer, then do it
# if only setting key should refresh it, omit move_to_end
odict.move_to_end(key)
return odict[key]
This does need a lock, but it also reduces the work for garbage collection when it grows too large from "check every key" (O(n) work) to "pop the oldest item off without looking at anything else" (O(1) work).
A lock is used to avoid race conditions so no two threads could make change to the dict at the same time so it is advisible that you use the lock else you might go into a race condition causing program to fail. A mutex lock can be used to deal with 2 threads.
Is it possible to "pipeline" consumption of a generator across multiple consumers?
For example, it's common to have code with this pattern:
def consumer1(iterator):
for item in iterator:
foo(item)
def consumer2(iterator):
for item in iterator:
bar(item)
myiter = list(big_generator())
v1 = consumer1(myiter)
v2 = consumer2(myiter)
In this case, multiple functions completely consume the same iterator, making it necessary to cache the iterator in a list. Since each consumer exhausts the iterator, itertools.tee is useless.
I see code like this a lot and I always wish I could get the consumers to consume one item at a time in order instead of caching the entire iterator. E.g.:
consumer1 consumes myiter[0]
consumer2 consumes myiter[0]
consumer1 consumes myiter[1]
consumer2 consumes myiter[1]
etc...
If I were to make up a syntax, it would look like this:
c1_retval, c2_retval = iforkjoin(big_generator(), (consumer1, consumer2))
You can get close with threads or multiprocessing and teed iterators, but threads consume at different speeds meaning that the value deque cached inside tee could get very large. The point here is not to exploit parallelism or to speed up tasks but to avoid caching large sections of the iterator.
It seems to me that this might be impossible without modifying the consumers because the flow of control is in the consumer. However, when a consumer actually consumes the iterator control passes into the iterator's next() method, so maybe it is possible to invert the flow of control somehow so that the iterator blocks the consumers one at a time until it can feed them all?
If this is possible, I'm not clever enough to see how. Any ideas?
With the limitation of not changing consumers' code (i.e. having a loop in them), you're left with only two options:
the approach you already include in your question: caching the generated items in memory, then iterating over them multiple times.
running each consumer in a thread, and implement some kind of synchronized-itertools.tee, one with buffer of size=1, which blocks serving item i+1 until item i has been served to all consumers.
There are no other options. You can't achieve all of the below, as they are contradicting:
having a generator
having a loop to consume all of it
then, (serially-)after the previous loop has finished, having another loop to consume all of it again
only keeping O(1) items in memory (or disk, etc.) while consuming them
not regenerating (i.e. not re-creating the generator)
The generated items must be stored somewhere if you want to reuse them.
If changing the consumers' code is acceptable, clearly #monkey's solution is the simplest and most straightforward.
Doesn't this work? Or do you require the entire iterator so a copy to each like this, won't work? If so, then I think you either have to create a copy, else generate the list twice?
for item in big_generator():
consumer1.handle_item(item)
consumer2.handle_item(item)
Does anyone know a clean way to get near-LIFO or even not near-FIFO (e.g. random) behavior from multiprocessing.Queue?
Alternative Question: Could someone point me to the code for the thread that manages the actual storage structure behind multiprocessing.Queue? It seems like it would be trivial within that to provide approximately LIFO access, but I got lost in the rabbit hole trying to find it.
Notes:
I believe multiprocessing.Queue does not guarantee order. Fine. But it is near-FIFO so near-LIFO would be great.
I could pull all the current items off the queue and reverse the order before working with them, but I prefer to avoid a kludge if possible.
(edit) To clarify: I am doing a CPU bound simulation with multiprocessing and so can't use the specialized queues from Queue. Since I haven't seen any answers for a few days, I've added the alternative question above.
In case it is an issue, below is slight evidence that multiprocessing.Queue is near-FIFO. It just shows that in a simple case (a single thread), it is perfectly FIFO on my system:
import multiprocessing as mp
import Queue
q = mp.Queue()
for i in xrange(1000):
q.put(i)
deltas = []
while True:
try:
value1 = q.get(timeout=0.1)
value2 = q.get(timeout=0.1)
deltas.append(value2-value1)
except Queue.Empty:
break
#positive deltas would indicate the numbers are coming out in increasing order
min_delta, max_delta = min(deltas), max(deltas)
avg_delta = sum(deltas)/len(deltas)
print "min", min_delta
print "max", max_delta
print "avg", avg_delta
prints: min, max, and average are exactly 1 (perfect FIFO)
I've looked over the Queue class that lives in Lib/multiprocessing/queues.py in my Python installation (Python 2.7, but nothing obvious is different in the version from Python 3.2 that I briefly checked). Here's how I understand it works:
There are two sets of objects that are maintained by the Queue object. One set are multiprocess-safe primatives that are shared by all processes. The others are created and used separately by each process.
The cross-process objects are set up in the __init__ method:
A Pipe object, who's two ends are saved as self._reader and self._writer.
A BoundedSemaphore object, which counts (and optionally limits) how many objects are in the queue.
A Lock object for reading the Pipe, and on non-Windows platforms another for writing. (I assume that this is because writing to a pipe is inherently multiprocess-safe on Windows.)
The per-process objects are set up in the _after_fork and _start_thread methods:
A collections.deque object used to buffer writes to the Pipe.
A threading.condition object used to signal when the buffer is not empty.
A threading.Thread object that does the actual writing. It is created lazily, so it won't exist until at least one write to the Queue has been requested in a given process.
Various Finalize objects that clean stuff up when the process ends.
A get from the queue is pretty simple. You acquire the read lock, decrement the semaphore, and grab an object from the read end of the Pipe.
A put is more complicated. It uses multiple threads. The caller to put grabs the condition's lock, then adds its object to the buffer and signals the condition before unlocking it. It also increments the semaphore and starts up the writer thread if it isn't running yet.
The writer thread loops forever (until canceled) in the _feed method. If the buffer is empty, it waits on the notempty condition. Then it takes an item from the buffer, acquires the write lock (if it exists) and writes the item to the Pipe.
So, given all of that, can you modify it to get a LIFO queue? It doesn't seem easy. Pipes are inherently FIFO objects, and while the Queue can't guarantee FIFO behavior overall (due to the asynchronous nature of the writes from multiple processes) it is always going to be mostly FIFO.
If you have only a single consumer, you could get objects from the queue and add them to your own process-local stack. It would be harder to do a multi-consumer stack, though with shared memory a bounded-size stack wouldn't be too hard. You'd need a lock, a pair of conditions (for blocking/signaling on full and empty states), a shared integer value (for the number of values held) and a shared array of an appropriate type (for the values themselves).
There is a LIFO queue in the Queue package (queue in Python 3). This isn't exposed in the multiprocessing or multiprocessing.queues modules.
Replacing your line q = mp.Queue() with q = Queue.LifoQueue() and running prints: min, max and average as exactly -1.
(Also I think you should always get exactly FIFO/LIFO order when getting items from only one thread.)
There is a list of data that I want to deal with. However I need to process the data with multiple instances to increase efficiency.
Each time each instance shall take out one item, delete it from the list and process it with some procedures.
First I tried to store the list in a sqlite database, but sqlite allows multiple read-locks which means multiple instances might get the same item from the database.
Is there any way that makes each instance will get an unique item to process?
I could use other data structure (other database or just file) if needed.
By the way, is there a way to check whether a DELETE operation is successful or not, after executing cursor.execute(delete_query)?
How about another field in db as a flag (e.g. PROCESSING, UNPROCESSED, PROCESSED)?
From what I know you'll need to start up multiple instances of the python interpreter to get true concurrency with python (or at least multiple executing processes so you could:
make 1 broker process that tells the others which record they're allowed to take (via something like 0mq for instance), this could effectively make your broker a bottleneck though.
section off parts of your database per process, if your data is easy divisible (ascending numbers for primary keys for example).
things like greenlets and tasklets are really executed one after the other, they switch really fast due to the fact that they don't have the true threading/process overhead but they're not executed truly concurrently.
The simplest way is to generate the items in a single process and pass them for processing to multiple worker processes e.g.:
from multiprocessing import Pool
def process(item):
pass # executed in worker processes
def main():
p = Pool() # use all available CPUs
for result in p.imap_unordered(process, open('items.txt')):
pass
if __name__=='__main__':
main()
Why not read in all the items from the database and put them in a queue? You can have a worker thread get at item, process it and move on to the next one.