I want to read and process some data from an external service. I ask the service if there is any data, if something was returned I process it and ask again (so data can be processed immediately when it's available) and otherwise I wait for a notification that data is available. This can be written as an infinite loop:
def loop(self):
while True:
data = yield self.get_data_nonblocking()
if data is not None:
yield self.process_data(data)
else:
yield self.data_available
def on_data_available(self):
self.data_available.fire()
How can data_available be implemented here? It could be a Deferred but a Deferred cannot be reset, only recreated. Are there better options?
Can this loop be integrated into the Twisted event loop? I can read and process data right in on_data_available and write some code instead of the loop checking get_data_nonblocking but I feel like then I'll need some locks to make sure data is processed in the same order it arrives (the code above enforces it because it's the only place where it's processed). Is this a good idea at all?
Consider the case of a TCP connection. The receiver buffer for a TCP connection can either have data in it or not. You can get that data, or get nothing, without blocking by using the non-blocking socket API:
data = socket.recv(1024)
if data:
self.process_data(data)
You can wait for data to be available using select() (or any of the basically equivalent APIs):
socket.setblocking(False)
while True:
data = socket.recv(1024)
if data:
self.process_data(data)
else:
select([socket], [], [])
Of these, only select() is particularly Twisted-unfriendly (though the Twisted idiom is certainly not to make your own socket.recv calls). You could replace the select call with a Twisted-friendly version though (implement a Protocol with a dataReceived method that fires a Deferred - sort of like your on_data_available method - toss in some yields and make this whole thing an inlineCallbacks generator).
But though that's one way you can get data from a TCP connection, that's not the API that Twisted encourages you to use to do so. Instead, the API is:
class SomeProtocol(Protocol):
def dataReceived(self, data):
# Your logic here
I don't see how your case is substantially different. What if, instead of the loop you wrote, you did something like this:
class YourDataProcessor(object):
def process_data(self, data):
# Your logic here
class SomeDataGetter(object):
def __init__(self, processor):
self.processor = processor
def on_available_data(self):
data = self.get_data_nonblocking()
if data is not None:
self.processor.process_data(data)
Now there are no Deferreds at all (except perhaps in whatever implements on_available_data or get_data_nonblocking but I can't see that code).
If you leave this roughly as-is, you are guaranteed of in-ordered execution because Twisted is single-threaded (except in a couple places that are very clearly marked) and in a single-threaded program, an earlier call to process_data must complete before any later call to process_data could be made (excepting, of course, the case where process_data reentrantly invokes itself - but that's another story).
If you switch this back to using inlineCallbacks (or any equivalent "coroutine" flavored drink mix) then you are probably introducing the possibility of out-of-order execution.
For example, if get_data_nonblocking returns a Deferred and you write something like this:
#inlineCallbacks
def on_available_data(self):
data = yield self.get_data_nonblocking()
if data is not None:
self.processor.process_data(data)
Then you have changed on_available_data to say that a context switch is allowed when calling get_data_nonblocking. In this case, depending on your implementation of get_data_nonblocking and on_available_data, it's entirely possible that:
on_available_data is called
get_data_nonblocking is called and returns a Deferred
on_available_data tells execution to switch to another context (via yield / inlineCallbacks)
on_available_data is called again
get_data_nonblocking is called again and returns a Deferred (perhaps the same one! perhaps a new one! depends on how it's implement)
The second invocation of on_available_data tells execution to switch to another context (same reason)
The reactor spins around for a while and eventually an event arrives that causes the Deferred returned by the second invocation of get_data_nonblocking to fire.
Execution switches back to the second on_available_data frame
process_data is called with whatever data the second get_data_nonblocking call returned
Eventually the same things happen to the first set of objects and process_data is called again with whatever data the first get_data_nonblocking call returned
Now perhaps you've processed data out of order - again, this depends on more details of other parts of your system.
If so, you can always re-impose order. There are a lot of different possible approaches to this. Twisted itself doesn't come with any APIs that are explicitly in support of this operation so the solution involves writing some new code. Here's one idea (untested) for an approach - a queue-like class that knows about object sequence numbers:
class SequencedQueue(object):
"""
A queue-like type which guarantees objects come out of the queue in the order
defined by a sequence number associated with the objects when they are put into
the queue.
Application code manages sequence number assignment so that sequence numbers don't
have to have the same order as `put` calls on this type.
"""
def __init__(self):
# The sequence number of the object that should be given out
# by the next call to `get`
self._next_sequence = 0
# The sequence number of the next result that needs to be provided.
self._next_result = 0
# A holding area for objects past _next_sequence
self._queue = {}
# A holding area
self._waiting =
def put(self, sequence, object):
"""
Put an object into the queue at a particular point in the sequence.
"""
if sequence < self._next_sequence:
# Programming error. The sequence number
# of the object being put has already been used.
raise ...
self._queue[sequence] = object
self._check_waiters()
def get(self):
"""
Get an object from the queue which has the next sequence number
following whatever was previously gotten.
"""
result = self._waiters[self._next_sequence] = Deferred()
self._next_sequence += 1
self._check_waiters()
return result
def _check_waiters(self):
"""
Find any Deferreds previously given out by get calls which can now be given
their results and give them to them.
"""
while True:
seq = self._next_result
if seq in self._queue and seq in self._waiting:
self._next_result += 1
# XXX Probably a re-entrancy bug here. If a callback calls back in to
# put then this loop might run recursively
self._waiting.pop(seq).callback(self._queue.pop(seq))
else:
break
The expected behavior (modulo any bugs I accidentally added) is something like:
q = SequencedQueue()
d1 = q.get()
d2 = q.get()
# Nothing in particular happens
q.put(1, "second result")
# d1 fires with "first result" and afterwards d2 fires with "second result"
q.put(0, "first result")
Using this, just make sure you assign sequence numbers in the order you want data dispatched rather than the order it actually shows up somewhere. For example:
#inlineCallbacks
def on_available_data(self):
sequence = self._process_order
data = yield self.get_data_nonblocking()
if data is not None:
self._process_order += 1
self.sequenced_queue.put(sequence, data)
Elsewhere, some code can consume the queue sort of like:
#inlineCallbacks
def queue_consumer(self):
while True:
yield self.process_data(yield self.sequenced_queue.get())
Related
I have a Python script that generates [str, float] tuples which are then indexed into ElasticSearch using a custom function which eventually calls helper.streaming_bulk().
This is how the generator is implemented:
doc_ids: List[str] = [...]
docs = ((doc_id, get_value(doc_id) for doc_id in doc_ids)
get_value() calls a remote service that computes a float value per document id.
Next, these tuples are passed on to update_page_quality_bulk():
for success, item in update_page_quality_bulk(
islice(doc_qualities, size)
):
total_success += success
if not success:
logging.error(item)
Internally, update_page_quality_bulk() creates the ElasticSearch requests.
One of the advantages of using a generator here is that the first size elements can be fed into update_page_quality_bulk() through islice().
In order to make the entire process faster, I would like to parallelize the get_value() calls. As mentioned, these are remote calls so the local compute cost in negligible, but the duration is significant.
The order of the tuples does not matter, neither which elements are passed into update_page_quality_bulk(). On a high level, I would like to make the get_value() calls (up to x in parallel) for any n tuples and pass on whichever ones are finished first.
My naive attempt was to define get_value() as asynchronous:
async def get_value():
...
and await the call in the generator:
docs = ((doc_id, await get_value(doc_id) for doc_id in doc_ids)
However, this raises an error in the subsequent islice() call:
TypeError: 'async_generator' object is not iterable
Removing the islice call and passing the unmodified docs generator to update_page_quality_bulk() causes the same error to be raised when looping over the tuples to convert them into ElasticSearch requests.
I am aware that the ElasticSearch client provides asynchronous helpers, but they don't seem applicable here because I need to generate the actions first.
According to this answer, it seems like I have to change the implementation to using a queue.
This answer implies that it cannot be done without using multiprocessing due to Python GIL, but that answer is not marked as correct and is quite old too.
Generally, I am looking for a way to change the current logic as little as possible while parallelizing the get_value() calls.
So, you want to pass an "synchronous looking" generator to a call that expects a normal lazy generator such as islice, and keep getting the results for this in parallel.
It sounds like a work for asyncio.as_completed: you use your plain generator to create tasks - these are run in parallel by the asyncio machinery, and the results are made available as the tasks are completed (d'oh!).
However since update_page_quality_bulk is not asynco aware, it will never yield the control to the asyncio loop, so that it can complete the tasks which got their results. This would likely block.
Calling update_page_quality_bulk in another thread probably won't work as well. I did not try it here, but I'd say you can't just iterate over doc in a different thread than the one it (and its tasks) where created.
So, first things first - the "generator expression" syntax does not work when you want some terms of the generator to be calculated asynchronously, as you found out - we refactor that so that the tuples are created in an coroutine-function - and we wrap all calls for those in tasks (some of the asyncio functions do the wrapping in a task automatically)
Then we can us the asyncio machinery to schedule all the calls and call update_page_quality_bulk as these results arrive. The problem is that as_completed, as stated above, can't be passed directly to a non-async function: the asyncio loop would never get control back. Instead, we keep picking the results of tasks in the main thread, and call the sync function in another thread - using a Queue to pass the fetched results. And finally, so that the results can be consumed as made available inside update_page_quality_bulk, we create a small wrapper class to the threading.Queue, so that it can be consumed as in iterator - this is transparent for the code consuming the iterator.
# example code: untested
async def get_doc_values(doc_id):
loop = asyncio.get_running_loop()
# Run_in_executor runs the synchronous function in parallel in a thread-pool
# check the docs - you might want to pass a custom executor with more than
# the default number of workers, instead of None:
return doc_id, await asyncio.run_in_executor(None, get_value, doc_id)
def update_es(iterator):
# this function runs in a separate thread -
for success, item in update_page_quality_bulk(iterator):
total_success += success
if not success:
logging.error(item)
sentinel = Ellipsis # ... : python ellipsis - a nice sentinel that also worker for multiprocessing
class Iterator:
"""This allows the queue, fed in the main thread by the tasks as they are as they are completed
to behave like an ordinary iterator, which can be consumed by "update_page_quality_bulk" in another thread
"""
def __init__(self, source_queue):
self.source = source_queue
def __next__(self):
value= self.source.get()
if value is sentinel:
raise StopIteration()
return value
queue = threading.Queue()
iterator = Iterator(queue)
es_worker = threading.Thread(target=update_es, args=(iterator,))
es_worker.start()
for doc_value_task in asyncio.as_completed(get_doc_values(doc_id) for doc_id in doc_ids):
doc_value = await doc_value_task
queue.put(doc_value)
es_worker.join()
I have a defer.inlineCallback function for incrementally updating a large (>1k) list one piece at a time. This list may change at any time, and I'm getting bugs because of that behavior.
The simplest representation of what I'm doing is:-
#defer.inlineCallbacks
def _get_details(self, dt=None):
data = self.data
for e in data:
if needs_update(e):
more_detail = yield get_more_detail(e)
do_the_update(e, more_detail)
schedule_future(self._get_details)
self.data is a list of dictionaries which is initially populated with basic information (e.g. a name and ID) at application start. _get_details will run whenever allowed to by the reactor to get more detailed information for each item in data, updating the item as it goes along.
This works well when self.data does not change, but once it is changed (can be at any point) the loop obviously refers to the wrong information. In fact in that situation it would be better to just stop the loop entirely.
I'm able to set a flag in my class (which the inlineCallback can then check) when the data is changed.
Where should this check be conducted?
How does the inlineCallback code execute compared to a normal deferred (and indeed to a normal python generator).
Does code execution stop everytime it encounters yield (i.e. can I rely on this code between one yield and the next to be atomic)?
In the case of unreliable large lists, should I even be looping through the data (for e in data), or is there a better way?
the Twisted reactor never preempts your code while it is executing -- you have to voluntarily yield to the reactor by returning a value. This is why it is such a terrible thing to write Twisted code that blocks on I/O, because the reactor is not able to schedule any tasks while you are waiting for your disk.
So the short answer is that yes, execution is atomic between yields.
Without #inlineCallbacks, the _get_details function returns a generator. The #inlineCallbacks annotation simply wraps the generator in a Deferred that traverses the generator until it reaches a StopIteration exception or a defer.returnValue exception. When either of those conditions is reached, inlineCallbacks fires its Deferred. It's quite clever, really.
I don't know enough about your use case to help with your concurrency problem. Maybe make a copy of the list with tuple() and update that. But it seems like you really want an event-driven solution and not a state-driven one.
You need to protect access to shared resource (self.data).
You can do this with: twisted.internet.defer.DeferredLock.
http://twistedmatrix.com/documents/current/api/twisted.internet.defer.DeferredLock.html
Method acquire
Attempt to acquire the lock. Returns a Deferred that fires on lock
acquisition with the DeferredLock as the value. If the lock is locked,
then the Deferred is placed at the end of a waiting list.
Method release
Release the lock. If there is a waiting list, then the first Deferred in that waiting list will be called back.
#defer.inlineCallback
def _get_details(self, dt=None):
data = self.data
i = 0
while i < len(data):
e = data[i]
if needs_update(e):
more_detail = yield get_more_detail(e)
if i < len(data) or data[i] != e:
break
do_the_update(e, more_detail)
i += 1
schedule_future(self._get_details)
Based on more testing, the following are my observations.
for e in data iterates through elements, with the element still existing even if data itself does not, both before and after the yield statement.
As far as I can tell, execution is atomic between one yield and the next.
Looping through the data is more transparently done by using a counter. This also allows for checking whether the data has changed. The check can be done anytime after yield because any changes must have occurred before yield returned. This results in the code shown above.
self.data is a list of dictionaries...once it is changed (can be at any point) the loop obviously refers to the wrong information
If you're modifying a list while you iterate it, as Raymond Hettinger would say "You're living in the land of sin and you deserve everything that happens to you." :) Scenarios like this should be avoided or the list should be immutable. To circumvent this problem, you can use self.data.pop() or DeferredQueue object to store data. This way you can add and remove elements at anytime without causing adverse effects. Example with a list:
#defer.inlineCallbacks
def _get_details(self, dt=None):
try:
data = yield self.data.pop()
except IndexError:
schedule_future(self._get_details)
defer.returnValue(None) # exit function
if needs_update(e):
more_detail = yield get_more_detail(data)
do_the_update(data, more_detail)
schedule_future(self._get_details)
Take a look at DeferredQueue because a Deferred is returned when the get() function is called, which you can chain callbacks to handle each element you pop from the queue.
Behold my simple Python memcached code below:
import memcache
memcache_client = memcache.Client(['127.0.0.1:11211'], debug=True)
key = "myList"
obj = ["A", "B", "C"]
memcache_client.set(key, obj)
Now, suppose I want to append an element "D" to the list cached as myList, how can I do it atomically?
I know this is wrong because it is not atomic:
memcache_client.set(key, memcache_client.get(key) + ["D"])
The above statement contains a race condition. If another thread executes this same instruction at the exact right moment, one of the updates will get clobbered.
How can I solve this race condition? How can I update a list or dictionary stored in memcached atomically?
Here's the corresponding function of the python client API
https://cloud.google.com/appengine/docs/python/memcache/clientclass#Client_cas
Also here's a nice tutorial by Guido van Rossum. Hope he'd better explain python stuff than I ;)
Here's how the code should look like in your case:
memcache_client = memcache.Client(['127.0.0.1:11211'], debug=True)
key = "myList"
while True: # Retry loop, probably it should be limited to some reasonable retries
obj = memcache_client.gets(key)
assert obj is not None, 'Uninitialized object'
if memcache_client.cas(key, obj + ["D"]):
break
The whole workflow remains the same: first you fetch a value (w/ some internal information bound to a key), then modify the fetched value, then attempt to update it in the memcache. The only difference that the value (actually, key/value pair) is checked that it hasn't been changed simultaneosly from a parallel process. In the latter case the call fails and you should retry the workflow from the beginning. Also, if you have a multi-threaded application, then each memcache_client instance likely should be thread-local.
Also don't forget that there're incr() and decr() methods for simple integer counters which are "atomic" by their nature.
If you don't want receive a race condition then you must use Lock primitive from threading module. For example
lock = threading.Lock()
def thread_func():
obj = get_obj()
lock.acquire()
memcache_client.set(key, obj)
lock.release()
Question:
Essentially I want to return a unique result from the database everytime a view is called (until I run out of unique objects and have to start over). I was thinking that a simple and elegant solution would be to use a generator to handle this. Is this possible and if so how can this be approached with regards to pulling values from with ORM?
Note:
I think sessions or utilizing a design pattern like Memento may be a solution here, but I'm really curious to see if and how Python generators could be used in this context.
As Django is synchronous wsgi, you have to process each request as stand alone, your python environment can be killed or switched to an other at any time.
Still if you have no fear and a single process, you can make a file scope dictionary with session ids and iterators that you'll consume each time
from django.shortcuts import render
from collections import defaultdict
import uuid
def iterator():
for item in DatabaseTable.objects.all():
yield item
sessions_current_iterators = defaultdict(iterator)
def my_view(request):
id = request.session.get("iterator_id", None)
if id is None:
request.session["iterator_id"] = str(uuid.uuid4())
try:
return render(request, "item_template.html", {"item": next(sessions_current_iterators)}
except StopIteration:
request.session.pop("iterator_id")
return render(request, "end_template.html", {})
but: NEVER USE THIS ON A PRODUCTION ENVIRONMENT!
generators are great to reduce memory consumption while computing the request or can be good for tornado web service, but clearly, django should not share data between request in local variables.
You can always use yield where you can use return (since these are python stuff not Django stuff). The only caveat here is that the same function is called for every request; so the continuation after the yield may serve another client instead of the one you intend. However you can beat this problem by using a higher level function (generator here). Basically the function will have a dictionary of generators indexed by unique keys derived from the requests. Every time the function is called, check whether an entry already exists for the request in the dictionary. If not add a new function for that request. Then invoke the generator for the given request making sure to store whatever is yielded or returned by the generator. To keep the dictionary in memory let the main function now yield the stored value. Finally, so that the dictionary is not cleared every time the main function is called, start the function body by initializing the dictionary to an empty dictionary; then wrap everything else in an infinite while loop. This will ensure that the main function, also a generator, never really exits. When called the first time, the dictionary is initialized; then the while starts. In the while, the function creates and stores a generator in the dictionary if no entry already exists for the given request. Then the function invokes the generator for the request and yields whatever the generator returns or yields at the bottom of the while. When called again; the main function resumes at the top of the while. The code is like so:
def main_func(request, *args) :
funcs = {}
while True:
request_key = make_key(request)
If request_key not in funcs.keys():
def generator_func():
# your generator code here...
# remember to delete the func item in funcs before returning...
funcs[request_key] = generator_func
yield funcs[request_key] ()
def make_key(request):
# quick and dirty impl
return str(request.session)
Is there a data structure in Python that resembles a blocking dictionary? This data structure must fulfill these requirements:
it must be randomly accessible and allow any element to be modified/deleted (not just the first or last)
it must have a blocking get() and put()
it must be thread-safe
I would have used a queue but, although blocking and thread-safe, it's not randomly accessible. A dict is not blocking either (as far as my Python knowledge goes).
As an example, think of one producer thread adding key-value pairs to such a data-structure (updating values for existing keys if already present - this is where a queue won't cut it), and a worker blocking on get() and consuming these key-value pairs as they become available.
Many many thanks!
edit:
Let's assume the producer polls a CI server and gets project-status pairs. It generates the differences in project statuses and puts them in the aforementioned data structure. The worker picks up these project-status updates and displays them one by one as an animation on the screen.
class Producer:
def generateProjectStatusChanges():
...
def updateSuperAwesomeDataStructure(changes):
for (proj, stat) in changes:
#queue won't do cause the update could take place in the middle of the queue
#hence the dict behavior
superAwesomeDS.putOrUpdate(proj, stat)
def watchForUpdates():
changes = generateProjectStatusChanges()
updateSuperAwesomeDataStructure(changes)
time.sleep(self.interval)
class Worker:
def blockingNotifyAnimation():
...
def watchForUpdates():
while true:
proj, stat = superAwesomeDS.getFirstPair() #or any pair really
blockingNotifyAnimation(proj, stat)
Something along the following lines should do the trick (untested):
class UpdatableBlockingQueue(object):
def __init__(self):
self.queue = {}
self.cv = threading.Condition()
def put(self, key, value):
with self.cv:
self.queue[key] = value
self.cv.notify()
def pop(self):
with self.cv:
while not self.queue:
self.cv.wait()
return self.queue.popitem()
It uses a dictionary for the queue and a condition variable for serialising access and signalling between threads.