Threadsafe way to copy-and-clear array - python

One array accumulates datagrams. On interval or some length it flushed to database.
It accumulates on datagram_received network event asynchronous.
class Protocol:
flows = []
def datagram_received(self, data, addr):
...
self.flows.append(flow)
And flushed by this method:
def store(self):
flows = []
while len(self.flows):
flows.append(self.flows.pop(0))
self.db.insert(flows)
sleep(10)
self.store()
How to optimize it to replace while with one thread-safe operation?
This module runs one instance of class, but in two threads.

Related

Rewrite Python generator as asynchronous

I have a Python script that generates [str, float] tuples which are then indexed into ElasticSearch using a custom function which eventually calls helper.streaming_bulk().
This is how the generator is implemented:
doc_ids: List[str] = [...]
docs = ((doc_id, get_value(doc_id) for doc_id in doc_ids)
get_value() calls a remote service that computes a float value per document id.
Next, these tuples are passed on to update_page_quality_bulk():
for success, item in update_page_quality_bulk(
islice(doc_qualities, size)
):
total_success += success
if not success:
logging.error(item)
Internally, update_page_quality_bulk() creates the ElasticSearch requests.
One of the advantages of using a generator here is that the first size elements can be fed into update_page_quality_bulk() through islice().
In order to make the entire process faster, I would like to parallelize the get_value() calls. As mentioned, these are remote calls so the local compute cost in negligible, but the duration is significant.
The order of the tuples does not matter, neither which elements are passed into update_page_quality_bulk(). On a high level, I would like to make the get_value() calls (up to x in parallel) for any n tuples and pass on whichever ones are finished first.
My naive attempt was to define get_value() as asynchronous:
async def get_value():
...
and await the call in the generator:
docs = ((doc_id, await get_value(doc_id) for doc_id in doc_ids)
However, this raises an error in the subsequent islice() call:
TypeError: 'async_generator' object is not iterable
Removing the islice call and passing the unmodified docs generator to update_page_quality_bulk() causes the same error to be raised when looping over the tuples to convert them into ElasticSearch requests.
I am aware that the ElasticSearch client provides asynchronous helpers, but they don't seem applicable here because I need to generate the actions first.
According to this answer, it seems like I have to change the implementation to using a queue.
This answer implies that it cannot be done without using multiprocessing due to Python GIL, but that answer is not marked as correct and is quite old too.
Generally, I am looking for a way to change the current logic as little as possible while parallelizing the get_value() calls.
So, you want to pass an "synchronous looking" generator to a call that expects a normal lazy generator such as islice, and keep getting the results for this in parallel.
It sounds like a work for asyncio.as_completed: you use your plain generator to create tasks - these are run in parallel by the asyncio machinery, and the results are made available as the tasks are completed (d'oh!).
However since update_page_quality_bulk is not asynco aware, it will never yield the control to the asyncio loop, so that it can complete the tasks which got their results. This would likely block.
Calling update_page_quality_bulk in another thread probably won't work as well. I did not try it here, but I'd say you can't just iterate over doc in a different thread than the one it (and its tasks) where created.
So, first things first - the "generator expression" syntax does not work when you want some terms of the generator to be calculated asynchronously, as you found out - we refactor that so that the tuples are created in an coroutine-function - and we wrap all calls for those in tasks (some of the asyncio functions do the wrapping in a task automatically)
Then we can us the asyncio machinery to schedule all the calls and call update_page_quality_bulk as these results arrive. The problem is that as_completed, as stated above, can't be passed directly to a non-async function: the asyncio loop would never get control back. Instead, we keep picking the results of tasks in the main thread, and call the sync function in another thread - using a Queue to pass the fetched results. And finally, so that the results can be consumed as made available inside update_page_quality_bulk, we create a small wrapper class to the threading.Queue, so that it can be consumed as in iterator - this is transparent for the code consuming the iterator.
# example code: untested
async def get_doc_values(doc_id):
loop = asyncio.get_running_loop()
# Run_in_executor runs the synchronous function in parallel in a thread-pool
# check the docs - you might want to pass a custom executor with more than
# the default number of workers, instead of None:
return doc_id, await asyncio.run_in_executor(None, get_value, doc_id)
def update_es(iterator):
# this function runs in a separate thread -
for success, item in update_page_quality_bulk(iterator):
total_success += success
if not success:
logging.error(item)
sentinel = Ellipsis # ... : python ellipsis - a nice sentinel that also worker for multiprocessing
class Iterator:
"""This allows the queue, fed in the main thread by the tasks as they are as they are completed
to behave like an ordinary iterator, which can be consumed by "update_page_quality_bulk" in another thread
"""
def __init__(self, source_queue):
self.source = source_queue
def __next__(self):
value= self.source.get()
if value is sentinel:
raise StopIteration()
return value
queue = threading.Queue()
iterator = Iterator(queue)
es_worker = threading.Thread(target=update_es, args=(iterator,))
es_worker.start()
for doc_value_task in asyncio.as_completed(get_doc_values(doc_id) for doc_id in doc_ids):
doc_value = await doc_value_task
queue.put(doc_value)
es_worker.join()

How to limit the number of simultaneous connections in Twisted

so I have a twisted server I built, and I was wondering what is the best way to limit the number of simultaneous connections?
Is having my Factory return None the best way? When I do this, I throw a lot of exceptions like:
exceptions.AttributeError: 'NoneType' object has no attribute 'makeConnection'
I would like someway to have the clients just sit in queue until the current connection number goes back down, but I don't know how to do that asynchronously.
Currently I am using my factory do like this:
class HandleClientFactory(Factory):
def __init__(self):
self.numConnections = 0
def buildProtocol(self, addr):
#limit connection number here
if self.numConnections >= Max_Clients:
logging.warning("Reached maximum Client connections")
return None
return HandleClient(self)
which works, but disconnects rather than waits, and also throws a lot of unhandled errors.
You have to build this yourself. Fortunately, the pieces are mostly in place to do so (you could probably ask for slightly more suitable pieces but ...)
First, to avoid the AttributeError (which indeed causes the connection to be closed), be sure to return an IProtocol provider from your buildProtocol method.
class DoesNothing(Protocol):
pass
class YourFactory(Factory):
def buildProtocol(self, addr):
if self.currentConnections < self.maxConnections:
return Factory.buildProtocol(self, addr)
protocol = DoesNothing()
protocol.factory = self
return protocol
If you use this factory (filling in the missing pieces - eg, initializing maxConnections and so tracking currentConnections correctly) then you'll find that clients which connect once the limit has been reached are given the DoesNothing protocol. They can send as much data as they like to this protocol. It will discard it all. It will never send them any data. It will leave the connection open until they close it. In short, it does nothing.
However, you also wanted clients to actually receive service once connection count fell below the limit.
To do this, you need a few more pieces:
You have to keep any data they might send buffered so it is available to be read when you're ready to read it.
You have to keep track of the connections so you can begin to service them when the time is ripe.
You have to begin to service them at said time.
For the first of these, you can use the feature of most transports to "pause":
class PauseTransport(Protocol):
def makeConnection(self, transport):
transport.pauseProducing()
class YourFactory(Factory):
def buildProtocol(self, addr):
if self.currentConnections < self.maxConnections:
return Factory.buildProtocol(self, addr)
protocol = PauseTransport()
protocol.factory = self
return protocol
PauseTransport is similar to DoesNothing but with the minor (and useful) difference that as soon as it is connected to a transport it tells the transport to pause. Thus, no data will ever be read from the connection and it will all remain buffered for whenever you're ready to deal with it.
For the next requirement, many possible solutions exist. One of the simplest is to use the factory as storage:
class PauseAndStoreTransport(Protocol):
def makeConnection(self, transport):
transport.pauseProducing()
self.factory.addPausedTransport(transport)
class YourFactory(Factory):
def buildProtocol(self, addr):
# As above
...
def addPausedTransport(self, transport):
self.transports.append(transport)
Again, with the proper setup (eg, initialize the transports attribute), you now have a list of all of the transports which correspond to connections you've accepted above the concurrency limit which are waiting for service.
For the last requirement, all that is necessary is to instantiate and initialize the protocol that's actually capable of serving your clients. Instantiation is easy (it's your protocol, you probably know how it works). Initialization is largely a matter of calling the makeConnection method:
class YourFactory(Factory):
def buildProtocol(self, addr):
# As above
...
def addPausedTransport(self, transport):
# As above
...
def oneConnectionDisconnected(self)
self.currentConnections -= 1
if self.currentConnections < self.maxConnections:
transport = self.transports.pop(0)
protocol = self.buildProtocol(address)
protocol.makeConnection(transport)
transport.resumeProducing()
I've omitted the details of keeping track of the address argument required by buildProtocol (with the transport carried from its point of origin to this part of the program, it should be clear how to do something similar for the original address value if your program actually wants it).
Apart from that, all that happens here is you take the next queued transport (you could use a different scheduling algorithm if you want, eg LIFO) and hook it up to a protocol of your choosing just as Twisted would do. Finally, you undo the earlier pause operation so data will begin to flow.
Or... almost. This would be pretty slick except Twisted transports don't actually expose any way to change which protocol they deliver data to. Thus, as written, data from clients will actually be delivered to the original PauseAndStoreTransport protocol instance. You can hack around this (and "hack" is clearly the right word). Store both the transport and PauseAndStoreTransport instance in the list on the factory and then:
def oneConnectionDisconnected(self)
self.currentConnections -= 1
if self.currentConnections < self.maxConnections:
originalProtocol, transport = self.transports.pop(0)
newProtocol = self.buildProtocol(address)
originalProtocol.dataReceived = newProtocol.dataReceived
originalProtocol.connectionLost = newProtocol.connectionLost
newProtocol.makeConnection(transport)
transport.resumeProducing()
Now the object that the transport wants to call methods on has had its methods replaced by those from the object that you want the methods called on. Again, this is clearly a hack. You can probably put together something less hackish if you want (eg, a third protocol class that explicitly supports delegating to another protocol). The idea will be the same - it'll just be more wear on your keyboard. For what it's worth, I suspect that it may be both easier and less typing to do something similar using Tubes but I'll leave an attempt at a solution based on that library to someone else for now.
I've avoided addressing the problem of keeping currentConnections properly up to date. Since you already had numConnections in your question I'm assuming you know how to manage that part. All I've done in the last step here is suppose that the way you do the decrement step is by calling oneConnectionDisconnected on the factory.
I've also avoided addressing the event that a queued connection gets bored and goes away. This will mostly work as written - Twisted won't notice the connection was closed until you call resumeProducing and then connectionLost will be called on your application protocol. This should be fine since your protocol needs to handle lost connections anyway.

Do I have to lock all functions that calls to one or more locked function for multi-threading?

Ok I have been reading this article about lock vital statements in the correct way in python.
The example is
lock = threading.Lock()
def get_first_part():
lock.acquire()
try:
... fetch data for first part from shared object
finally:
lock.release()
return data
def get_second_part():
lock.acquire()
try:
... fetch data for second part from shared object
finally:
lock.release()
return data
def get_both_parts():
lock.acquire()
try:
first = get_first_part()
second = get_second_part()
finally:
lock.release() # The finally block will alway execute no mater what before this, not matter return/break/continue. https://docs.python.org/2/tutorial/errors.html
return first, second
So here is my problem, I have got a serial device control class, that every function in it calls to the vital statements (functions require a thread lock (RLock)).
Some of the functions call to the get_first_part() only, some of them calls both, some of them are more complex and might call like first, second, third, first etc.
So my problem is do I have to put the lock.aquire() and release in every this kind of function in the class ? What's the preferred way to lock and relase all of these functions ? Use the decorators ?
Thanks.
The with statement is way better:
lock = threading.RLock()
def get_first_part():
with lock:
return ...data for first part...
etc, etc.
And yes, of course, you can use a decorator if you wish:
import functools
def withlock(func):
#functools.wrap(func)
def wrapper(*a, **k):
with lock:
return func(*a, **k)
return wrapper
and then
#withlock
def get_first_part():
return ...data for first part...
However, it's unusual that the whole body of all functions need to be holding the lock throughout, and the with statement gives you better granularity about what parts exactly need that lock (any preparation and post-processing can happen before acquiring, and after releasing, the lock). So I normally go with the elegant with instead, rather than going for the "gross grained" approach of a decorator in such cases.

Twisted wait for event in loop

I want to read and process some data from an external service. I ask the service if there is any data, if something was returned I process it and ask again (so data can be processed immediately when it's available) and otherwise I wait for a notification that data is available. This can be written as an infinite loop:
def loop(self):
while True:
data = yield self.get_data_nonblocking()
if data is not None:
yield self.process_data(data)
else:
yield self.data_available
def on_data_available(self):
self.data_available.fire()
How can data_available be implemented here? It could be a Deferred but a Deferred cannot be reset, only recreated. Are there better options?
Can this loop be integrated into the Twisted event loop? I can read and process data right in on_data_available and write some code instead of the loop checking get_data_nonblocking but I feel like then I'll need some locks to make sure data is processed in the same order it arrives (the code above enforces it because it's the only place where it's processed). Is this a good idea at all?
Consider the case of a TCP connection. The receiver buffer for a TCP connection can either have data in it or not. You can get that data, or get nothing, without blocking by using the non-blocking socket API:
data = socket.recv(1024)
if data:
self.process_data(data)
You can wait for data to be available using select() (or any of the basically equivalent APIs):
socket.setblocking(False)
while True:
data = socket.recv(1024)
if data:
self.process_data(data)
else:
select([socket], [], [])
Of these, only select() is particularly Twisted-unfriendly (though the Twisted idiom is certainly not to make your own socket.recv calls). You could replace the select call with a Twisted-friendly version though (implement a Protocol with a dataReceived method that fires a Deferred - sort of like your on_data_available method - toss in some yields and make this whole thing an inlineCallbacks generator).
But though that's one way you can get data from a TCP connection, that's not the API that Twisted encourages you to use to do so. Instead, the API is:
class SomeProtocol(Protocol):
def dataReceived(self, data):
# Your logic here
I don't see how your case is substantially different. What if, instead of the loop you wrote, you did something like this:
class YourDataProcessor(object):
def process_data(self, data):
# Your logic here
class SomeDataGetter(object):
def __init__(self, processor):
self.processor = processor
def on_available_data(self):
data = self.get_data_nonblocking()
if data is not None:
self.processor.process_data(data)
Now there are no Deferreds at all (except perhaps in whatever implements on_available_data or get_data_nonblocking but I can't see that code).
If you leave this roughly as-is, you are guaranteed of in-ordered execution because Twisted is single-threaded (except in a couple places that are very clearly marked) and in a single-threaded program, an earlier call to process_data must complete before any later call to process_data could be made (excepting, of course, the case where process_data reentrantly invokes itself - but that's another story).
If you switch this back to using inlineCallbacks (or any equivalent "coroutine" flavored drink mix) then you are probably introducing the possibility of out-of-order execution.
For example, if get_data_nonblocking returns a Deferred and you write something like this:
#inlineCallbacks
def on_available_data(self):
data = yield self.get_data_nonblocking()
if data is not None:
self.processor.process_data(data)
Then you have changed on_available_data to say that a context switch is allowed when calling get_data_nonblocking. In this case, depending on your implementation of get_data_nonblocking and on_available_data, it's entirely possible that:
on_available_data is called
get_data_nonblocking is called and returns a Deferred
on_available_data tells execution to switch to another context (via yield / inlineCallbacks)
on_available_data is called again
get_data_nonblocking is called again and returns a Deferred (perhaps the same one! perhaps a new one! depends on how it's implement)
The second invocation of on_available_data tells execution to switch to another context (same reason)
The reactor spins around for a while and eventually an event arrives that causes the Deferred returned by the second invocation of get_data_nonblocking to fire.
Execution switches back to the second on_available_data frame
process_data is called with whatever data the second get_data_nonblocking call returned
Eventually the same things happen to the first set of objects and process_data is called again with whatever data the first get_data_nonblocking call returned
Now perhaps you've processed data out of order - again, this depends on more details of other parts of your system.
If so, you can always re-impose order. There are a lot of different possible approaches to this. Twisted itself doesn't come with any APIs that are explicitly in support of this operation so the solution involves writing some new code. Here's one idea (untested) for an approach - a queue-like class that knows about object sequence numbers:
class SequencedQueue(object):
"""
A queue-like type which guarantees objects come out of the queue in the order
defined by a sequence number associated with the objects when they are put into
the queue.
Application code manages sequence number assignment so that sequence numbers don't
have to have the same order as `put` calls on this type.
"""
def __init__(self):
# The sequence number of the object that should be given out
# by the next call to `get`
self._next_sequence = 0
# The sequence number of the next result that needs to be provided.
self._next_result = 0
# A holding area for objects past _next_sequence
self._queue = {}
# A holding area
self._waiting =
def put(self, sequence, object):
"""
Put an object into the queue at a particular point in the sequence.
"""
if sequence < self._next_sequence:
# Programming error. The sequence number
# of the object being put has already been used.
raise ...
self._queue[sequence] = object
self._check_waiters()
def get(self):
"""
Get an object from the queue which has the next sequence number
following whatever was previously gotten.
"""
result = self._waiters[self._next_sequence] = Deferred()
self._next_sequence += 1
self._check_waiters()
return result
def _check_waiters(self):
"""
Find any Deferreds previously given out by get calls which can now be given
their results and give them to them.
"""
while True:
seq = self._next_result
if seq in self._queue and seq in self._waiting:
self._next_result += 1
# XXX Probably a re-entrancy bug here. If a callback calls back in to
# put then this loop might run recursively
self._waiting.pop(seq).callback(self._queue.pop(seq))
else:
break
The expected behavior (modulo any bugs I accidentally added) is something like:
q = SequencedQueue()
d1 = q.get()
d2 = q.get()
# Nothing in particular happens
q.put(1, "second result")
# d1 fires with "first result" and afterwards d2 fires with "second result"
q.put(0, "first result")
Using this, just make sure you assign sequence numbers in the order you want data dispatched rather than the order it actually shows up somewhere. For example:
#inlineCallbacks
def on_available_data(self):
sequence = self._process_order
data = yield self.get_data_nonblocking()
if data is not None:
self._process_order += 1
self.sequenced_queue.put(sequence, data)
Elsewhere, some code can consume the queue sort of like:
#inlineCallbacks
def queue_consumer(self):
while True:
yield self.process_data(yield self.sequenced_queue.get())

Blocking dict in Python?

Is there a data structure in Python that resembles a blocking dictionary? This data structure must fulfill these requirements:
it must be randomly accessible and allow any element to be modified/deleted (not just the first or last)
it must have a blocking get() and put()
it must be thread-safe
I would have used a queue but, although blocking and thread-safe, it's not randomly accessible. A dict is not blocking either (as far as my Python knowledge goes).
As an example, think of one producer thread adding key-value pairs to such a data-structure (updating values for existing keys if already present - this is where a queue won't cut it), and a worker blocking on get() and consuming these key-value pairs as they become available.
Many many thanks!
edit:
Let's assume the producer polls a CI server and gets project-status pairs. It generates the differences in project statuses and puts them in the aforementioned data structure. The worker picks up these project-status updates and displays them one by one as an animation on the screen.
class Producer:
def generateProjectStatusChanges():
...
def updateSuperAwesomeDataStructure(changes):
for (proj, stat) in changes:
#queue won't do cause the update could take place in the middle of the queue
#hence the dict behavior
superAwesomeDS.putOrUpdate(proj, stat)
def watchForUpdates():
changes = generateProjectStatusChanges()
updateSuperAwesomeDataStructure(changes)
time.sleep(self.interval)
class Worker:
def blockingNotifyAnimation():
...
def watchForUpdates():
while true:
proj, stat = superAwesomeDS.getFirstPair() #or any pair really
blockingNotifyAnimation(proj, stat)
Something along the following lines should do the trick (untested):
class UpdatableBlockingQueue(object):
def __init__(self):
self.queue = {}
self.cv = threading.Condition()
def put(self, key, value):
with self.cv:
self.queue[key] = value
self.cv.notify()
def pop(self):
with self.cv:
while not self.queue:
self.cv.wait()
return self.queue.popitem()
It uses a dictionary for the queue and a condition variable for serialising access and signalling between threads.

Categories