How does Python pass-by-object-reference work with asynchronous code? - python

foo = { 'bar': None }
for i in range(3):
foo['bar'] = i
send_to_async_queue(foo)
Can I be certain that the queue processes 0,1,2 every time? If not how to I ensure that it does?

It depends on how send_to_async_queue will be implemented.
If you are using something like Celery, the dictionary will be serialized (as json, pickle or other method you choose) before sending to the queue. So, you are safe.
If you are using threads or another in-memory mechanism to hold the queue and process it, all consumers will share the address of this dictionary, and changing it will likely give you problems.
In this case, you can serialize by yourself before putting in the queue; or do a copy of the dictionary using send_to_async_queue(copy.deepcopy(foo)).

Related

Handling endless data stream with multiprocessing and Queues

I want to use the Python 2.7 multiprocessing package to operate on an endless stream of data. A subprocess will constantly receive data via TCP/IP or UDP packets and immediately place the data in a multiprocessing.Queue. However, at certain intervals, say, every 500ms, I only want to operate on a user specified slice of this data. Let's say, the last 200 data packets.
I know I can put() and get() on the Queue, but how can I create that slice of data without a) Backing up the queue and b) Keeping things threadsafe?
I'm thinking I have to constantly get() from the Queue with another subprocess to prevent the Queue from getting full. Then I have to store the data in another data structure (such as a list) to build the user specified slice. But the data structure would probably not be thread safe, so it does not sound like a good solution.
Is there some programming paradigm that achieves what I am trying to do easily? I looked at the multiprocessing.Manager class, but wasn't sure it would work.
You can do this as follows:
Use an instance of the threading.Lock class. Call method acquire to claim exclusive access to your queue from a certain thread and call release to grant other threads access.
Since you want to keep gathering your input, copying the whole queue would be probably be to expensive. Probably the fastest way is to first collect data in one queue, than swap it for another and use the old one to read data from into your application by a different thread. Protect the swapping with a Lock instance, so you can be sure that whenever the writer acquires the lock, the current 'listener' queue is ready to receive data.
If only recent data is important, use two circular buffer instead of queues, allowing old data to be overwritten.

Python multiprocessing - function-like communication between two processes

I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.
Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.

Concurrency on sqlite database using python

There is a list of data that I want to deal with. However I need to process the data with multiple instances to increase efficiency.
Each time each instance shall take out one item, delete it from the list and process it with some procedures.
First I tried to store the list in a sqlite database, but sqlite allows multiple read-locks which means multiple instances might get the same item from the database.
Is there any way that makes each instance will get an unique item to process?
I could use other data structure (other database or just file) if needed.
By the way, is there a way to check whether a DELETE operation is successful or not, after executing cursor.execute(delete_query)?
How about another field in db as a flag (e.g. PROCESSING, UNPROCESSED, PROCESSED)?
From what I know you'll need to start up multiple instances of the python interpreter to get true concurrency with python (or at least multiple executing processes so you could:
make 1 broker process that tells the others which record they're allowed to take (via something like 0mq for instance), this could effectively make your broker a bottleneck though.
section off parts of your database per process, if your data is easy divisible (ascending numbers for primary keys for example).
things like greenlets and tasklets are really executed one after the other, they switch really fast due to the fact that they don't have the true threading/process overhead but they're not executed truly concurrently.
The simplest way is to generate the items in a single process and pass them for processing to multiple worker processes e.g.:
from multiprocessing import Pool
def process(item):
pass # executed in worker processes
def main():
p = Pool() # use all available CPUs
for result in p.imap_unordered(process, open('items.txt')):
pass
if __name__=='__main__':
main()
Why not read in all the items from the database and put them in a queue? You can have a worker thread get at item, process it and move on to the next one.

How to synchronize python lists?

I have different threads and after processing they put data in a common list. Is there anything built in python for a list or a numpy array to be accessed by only a single thread. Secondly, if it is not what is an elegant way of doing it?
According to Thread synchronisation mechanisms in Python, reading a single item from a list and modifying a list in place are guaranteed to be atomic. If this is right (although it seems to be partially contradicted by the very existence of the Queue module), then if your code is all of the form:
try:
val = mylist.pop()
except IndexError:
# wait for a while or exit
else:
# process val
And everything put into mylist is done by .append(), then your code is already threadsafe. If you don't trust that one document on that score, use a queue.queue, which does all synchronisation for you, and has a better API than list for concurrent programs - particularly, it gives you the option of blocking indefinitely, or for a timeout, waiting for .pop() to work if you don't have anything else the thread could be getting on with in the mean time.
For numpy arrays, and in general any case where you need more than a producer/consumer queue, use a Lock or RLock from threading - these implement the context manager protocol, so using them is quite simple:
with mylock:
# Process as necessarry
And python will guarantee that the lock gets released once you fall off the end of the with block - including in tricky cases like if something you do raises an exception.
Finally, consider whether multiprocessing is a better fit for your application than threading - threads in Python aren't guaranteed to actually run concurrently, and in CPython only can if the drop to C-level code. multiprocessing gets around that issue, but may have some extra overhead - if you haven't already, you should read the docs to determine which one suits your needs better.
threading provides Lock objects if you need to protect an entire critical section, or the Queue module provides a queue that is threadsafe.
How about the standard library Queue?

Multiprocessing Queue maxsize limit is 32767

I'm trying to write a Python 2.6 (OSX) program using multiprocessing, and I want to populate a Queue with more than the default of 32767 items.
from multiprocessing import Queue
Queue(2**15) # raises OSError
Queue(32767) works fine, but any higher number (e.g. Queue(32768)) fails with OSError: [Errno 22] Invalid argument
Is there a workaround for this issue?
One approach would be to wrap your multiprocessing.Queue with a custom class (just on the producer side, or transparently from the consumer perspective). Using that you would queue up items to be dispatched to the Queue object that you're wrapping, and only feed things from the local queue (Python list() object) into the multiprocess.Queue as space becomes available, with exception handling to throttle when the Queue is full.
That's probably the easiest approach since it should have the minimum impact on the rest of your code. The custom class should behave just like a Queue while hiding the underlying multiprocessing.Queue behind your abstraction.
(One approach might be to have your producer use threads, one thread to manage the dispatch from a threading Queue to your multiprocessing.Queue and any other threads actually just feeding the threading Queue).
I've already answered the original question but I do feel like adding that Redis lists are quite reliable and the Python module's support for them are extremely easy to use for implementing a Queue like object. These have the advantage of allowing one to scale out over multiple nodes (across a network) as well as just over multiple processes.
Basically to use those you'd just pick a key (string) for your queue name have your producers push into it and have your workers (task consumers) loop on blocking pops from that key.
The Redis BLPOP, and BRPOP commands all take a list of keys (lists/queues) and an optional timeout value. They return a tuple (key,value) or None (on timeout). So you can easily write up an event driven system that's very similar to the familiar structure of select() (but at a much higher level). The only thing you have to watch for are missing keys and invalid key types (just wrap your queue operations with exception handlers, of course). (If some other application stops on your shared Redis server removing keys or replacing keys that you were using as queues with string/integer or other types of values ... well, you have a different problem at that point). :)
Another advantage of this model is that Redis does persist its data to the disk. So your work queue could survive system restarts if you chose to allow it.
(Of course you could implement a simple Queue as a table in SQLlite or any other SQL system if you really wanted to do so; just using some sort of auto-incrementing index for the sequencing and a column to mark each item has having been "done" (consumed); but that does involve somewhat more complexity than using what Redis gives you "out of the box").
Working for me on MacOSX
>>> import Queue
>>> Queue.Queue(30000000)
<Queue.Queue instance at 0x1006035f0>

Categories