Can Python Generators be used in Django Views? - python

Question:
Essentially I want to return a unique result from the database everytime a view is called (until I run out of unique objects and have to start over). I was thinking that a simple and elegant solution would be to use a generator to handle this. Is this possible and if so how can this be approached with regards to pulling values from with ORM?
Note:
I think sessions or utilizing a design pattern like Memento may be a solution here, but I'm really curious to see if and how Python generators could be used in this context.

As Django is synchronous wsgi, you have to process each request as stand alone, your python environment can be killed or switched to an other at any time.
Still if you have no fear and a single process, you can make a file scope dictionary with session ids and iterators that you'll consume each time
from django.shortcuts import render
from collections import defaultdict
import uuid
def iterator():
for item in DatabaseTable.objects.all():
yield item
sessions_current_iterators = defaultdict(iterator)
def my_view(request):
id = request.session.get("iterator_id", None)
if id is None:
request.session["iterator_id"] = str(uuid.uuid4())
try:
return render(request, "item_template.html", {"item": next(sessions_current_iterators)}
except StopIteration:
request.session.pop("iterator_id")
return render(request, "end_template.html", {})
but: NEVER USE THIS ON A PRODUCTION ENVIRONMENT!
generators are great to reduce memory consumption while computing the request or can be good for tornado web service, but clearly, django should not share data between request in local variables.

You can always use yield where you can use return (since these are python stuff not Django stuff). The only caveat here is that the same function is called for every request; so the continuation after the yield may serve another client instead of the one you intend. However you can beat this problem by using a higher level function (generator here). Basically the function will have a dictionary of generators indexed by unique keys derived from the requests. Every time the function is called, check whether an entry already exists for the request in the dictionary. If not add a new function for that request. Then invoke the generator for the given request making sure to store whatever is yielded or returned by the generator. To keep the dictionary in memory let the main function now yield the stored value. Finally, so that the dictionary is not cleared every time the main function is called, start the function body by initializing the dictionary to an empty dictionary; then wrap everything else in an infinite while loop. This will ensure that the main function, also a generator, never really exits. When called the first time, the dictionary is initialized; then the while starts. In the while, the function creates and stores a generator in the dictionary if no entry already exists for the given request. Then the function invokes the generator for the request and yields whatever the generator returns or yields at the bottom of the while. When called again; the main function resumes at the top of the while. The code is like so:
def main_func(request, *args) :
funcs = {}
while True:
request_key = make_key(request)
If request_key not in funcs.keys():
def generator_func():
# your generator code here...
# remember to delete the func item in funcs before returning...
funcs[request_key] = generator_func
yield funcs[request_key] ()
def make_key(request):
# quick and dirty impl
return str(request.session)

Related

Python recursive tree example

I'm analying a bit of code in one of the Jenkins builds which uses recursion to get jenkins' downstream jobs' urls
def get_all_downstream_jobs_urls(ds_jobs: set = None):
global JENKINS_JOBS
if not ds_jobs:
ds_jobs = set(); ds_jobs.update(extract_ds_job_url(get_ds_jobs(BASE_JOB_URL)))
temp = ds_jobs
for _ in ds_jobs.copy():
result = extract_ds_job_url(get_ds_jobs(_)) # <--- jenkins rest api call
if result: temp.update(result); JENKINS_JOBS.update(temp);
else: return temp
return get_all_downstream_jobs_urls(temp)
This works OK for a project which has downstream jobs although it's making too many calls to Jenkins rest api, but if a project doesn't have downstream jobs it is stuck in recursion. Could you help me figure out where the issue is?
If extract_ds_job_url(get_ds_jobs(BASE_JOB_URL)) returns an empty set, you are forever calling get_all_downstream_jobs_urls(temp). That's because the for loop is not going to do anything.
The test at the top should check for None instead:
if ds_jobs is None:
and a separate test for ds_jobs being empty should end the recursion:
if not ds_jobs:
# no downstream jobs to process
return set()
I can't vouch for the rest of the logic, but there are certainly many style errors in the code. I'd refactor it to at least get rid of some of those errors:
JENKINS_JOBS is never rebound, so global JENKINS_JOBS is redundant and confusing and should be removed.
It's not clear why the function is updating a global and is returning the result set. It should do one, or the other, not both.
_ is, by convention, a throw-away variable. It signals that the value is not going to be used. Yet here the code does use it. It should be renamed job_url instead.
You really never should use ; in production code. Put the code on separate lines.
ds_jobs = set() then ds_jobs.update(...) is a way too verbose spelling of ds_jobs = set(...).
temp is not a good variable name, updated might be a better name. It should be made a copy when assigned, so updated = set(ds_jobs), and the .copy() call can be removed from the for loop.
the return when the first job URL doesn't have downstream URLs is probably not what you want either.
If you really want a tree of downstream URLs, the recursive call should not try to pass in all job urls collected so far! It's just as likely to call the jenkins API again and again for a job URL that was already checked.
The following code removes the recursion by using a stack instead, and is guaranteed to call the Jenkins API for each job URL just once:
def get_all_downstream_jobs_urls():
ds_jobs = set()
stack = [extract_ds_job_url(get_ds_jobs(BASE_JOB_URL))]
while stack:
job_url = stack.pop()
if job_url in ds_jobs:
# already seen before, skip
continue
ds_jobs.add(job_url)
# add downstream jobs to the stack for further processing
stack.extend(extract_ds_job_url(get_ds_jobs(job_url)))
return ds_jobs
Last but not least, I strongly suspect that using a third-party library like the jenkinsapi package would make this all even simpler; the Jenkins API probably lets you query this information in just one call but a library probably does such a call for you and give you readily-parsed Python objects for the information.

Loop through changing dataset with inlineCallbacks/yield (python-twisted)

I have a defer.inlineCallback function for incrementally updating a large (>1k) list one piece at a time. This list may change at any time, and I'm getting bugs because of that behavior.
The simplest representation of what I'm doing is:-
#defer.inlineCallbacks
def _get_details(self, dt=None):
data = self.data
for e in data:
if needs_update(e):
more_detail = yield get_more_detail(e)
do_the_update(e, more_detail)
schedule_future(self._get_details)
self.data is a list of dictionaries which is initially populated with basic information (e.g. a name and ID) at application start. _get_details will run whenever allowed to by the reactor to get more detailed information for each item in data, updating the item as it goes along.
This works well when self.data does not change, but once it is changed (can be at any point) the loop obviously refers to the wrong information. In fact in that situation it would be better to just stop the loop entirely.
I'm able to set a flag in my class (which the inlineCallback can then check) when the data is changed.
Where should this check be conducted?
How does the inlineCallback code execute compared to a normal deferred (and indeed to a normal python generator).
Does code execution stop everytime it encounters yield (i.e. can I rely on this code between one yield and the next to be atomic)?
In the case of unreliable large lists, should I even be looping through the data (for e in data), or is there a better way?
the Twisted reactor never preempts your code while it is executing -- you have to voluntarily yield to the reactor by returning a value. This is why it is such a terrible thing to write Twisted code that blocks on I/O, because the reactor is not able to schedule any tasks while you are waiting for your disk.
So the short answer is that yes, execution is atomic between yields.
Without #inlineCallbacks, the _get_details function returns a generator. The #inlineCallbacks annotation simply wraps the generator in a Deferred that traverses the generator until it reaches a StopIteration exception or a defer.returnValue exception. When either of those conditions is reached, inlineCallbacks fires its Deferred. It's quite clever, really.
I don't know enough about your use case to help with your concurrency problem. Maybe make a copy of the list with tuple() and update that. But it seems like you really want an event-driven solution and not a state-driven one.
You need to protect access to shared resource (self.data).
You can do this with: twisted.internet.defer.DeferredLock.
http://twistedmatrix.com/documents/current/api/twisted.internet.defer.DeferredLock.html
Method acquire
Attempt to acquire the lock. Returns a Deferred that fires on lock
acquisition with the DeferredLock as the value. If the lock is locked,
then the Deferred is placed at the end of a waiting list.
Method release
Release the lock. If there is a waiting list, then the first Deferred in that waiting list will be called back.
#defer.inlineCallback
def _get_details(self, dt=None):
data = self.data
i = 0
while i < len(data):
e = data[i]
if needs_update(e):
more_detail = yield get_more_detail(e)
if i < len(data) or data[i] != e:
break
do_the_update(e, more_detail)
i += 1
schedule_future(self._get_details)
Based on more testing, the following are my observations.
for e in data iterates through elements, with the element still existing even if data itself does not, both before and after the yield statement.
As far as I can tell, execution is atomic between one yield and the next.
Looping through the data is more transparently done by using a counter. This also allows for checking whether the data has changed. The check can be done anytime after yield because any changes must have occurred before yield returned. This results in the code shown above.
self.data is a list of dictionaries...once it is changed (can be at any point) the loop obviously refers to the wrong information
If you're modifying a list while you iterate it, as Raymond Hettinger would say "You're living in the land of sin and you deserve everything that happens to you." :) Scenarios like this should be avoided or the list should be immutable. To circumvent this problem, you can use self.data.pop() or DeferredQueue object to store data. This way you can add and remove elements at anytime without causing adverse effects. Example with a list:
#defer.inlineCallbacks
def _get_details(self, dt=None):
try:
data = yield self.data.pop()
except IndexError:
schedule_future(self._get_details)
defer.returnValue(None) # exit function
if needs_update(e):
more_detail = yield get_more_detail(data)
do_the_update(data, more_detail)
schedule_future(self._get_details)
Take a look at DeferredQueue because a Deferred is returned when the get() function is called, which you can chain callbacks to handle each element you pop from the queue.

How to add an item to a memcached list atomically (in Python)

Behold my simple Python memcached code below:
import memcache
memcache_client = memcache.Client(['127.0.0.1:11211'], debug=True)
key = "myList"
obj = ["A", "B", "C"]
memcache_client.set(key, obj)
Now, suppose I want to append an element "D" to the list cached as myList, how can I do it atomically?
I know this is wrong because it is not atomic:
memcache_client.set(key, memcache_client.get(key) + ["D"])
The above statement contains a race condition. If another thread executes this same instruction at the exact right moment, one of the updates will get clobbered.
How can I solve this race condition? How can I update a list or dictionary stored in memcached atomically?
Here's the corresponding function of the python client API
https://cloud.google.com/appengine/docs/python/memcache/clientclass#Client_cas
Also here's a nice tutorial by Guido van Rossum. Hope he'd better explain python stuff than I ;)
Here's how the code should look like in your case:
memcache_client = memcache.Client(['127.0.0.1:11211'], debug=True)
key = "myList"
while True: # Retry loop, probably it should be limited to some reasonable retries
obj = memcache_client.gets(key)
assert obj is not None, 'Uninitialized object'
if memcache_client.cas(key, obj + ["D"]):
break
The whole workflow remains the same: first you fetch a value (w/ some internal information bound to a key), then modify the fetched value, then attempt to update it in the memcache. The only difference that the value (actually, key/value pair) is checked that it hasn't been changed simultaneosly from a parallel process. In the latter case the call fails and you should retry the workflow from the beginning. Also, if you have a multi-threaded application, then each memcache_client instance likely should be thread-local.
Also don't forget that there're incr() and decr() methods for simple integer counters which are "atomic" by their nature.
If you don't want receive a race condition then you must use Lock primitive from threading module. For example
lock = threading.Lock()
def thread_func():
obj = get_obj()
lock.acquire()
memcache_client.set(key, obj)
lock.release()

Creating an asynchronous method with Google App Engine's NDB

I want to make sure I got down how to create tasklets and asyncrounous methods. What I have is a method that returns a list. I want it to be called from somewhere, and immediatly allow other calls to be made. So I have this:
future_1 = get_updates_for_user(userKey, aDate)
future_2 = get_updates_for_user(anotherUserKey, aDate)
somelist.extend(future_1)
somelist.extend(future_2)
....
#ndb.tasklet
def get_updates_for_user(userKey, lastSyncDate):
noteQuery = ndb.GqlQuery('SELECT * FROM Comments WHERE ANCESTOR IS :1 AND modifiedDate > :2', userKey, lastSyncDate)
note_list = list()
qit = noteQuery.iter()
while (yield qit.has_next_async()):
note = qit.next()
noteDic = note.to_dict()
note_list.append(noteDic)
raise ndb.Return(note_list)
Is this code doing what I'd expect it to do? Namely, will the two calls run asynchronously? Am I using futures correctly?
Edit: Well after testing, the code does produce the desired results. I'm a newbie to Python - what are some ways to test to see if the methods are running async?
It's pretty hard to verify for yourself that the methods are running concurrently -- you'd have to put copious logging in. Also in the dev appserver it'll be even harder as it doesn't really run RPCs in parallel.
Your code looks okay, it uses yield in the right place.
My only recommendation is to name your function get_updates_for_user_async() -- that matches the convention NDB itself uses and is a hint to the reader of your code that the function returns a Future and should be yielded to get the actual result.
An alternative way to do this is to use the map_async() method on the Query object; it would let you write a callback that just contains the to_dict() call:
#ndb.tasklet
def get_updates_for_user_async(userKey, lastSyncDate):
noteQuery = ndb.gql('...')
note_list = yield noteQuery.map_async(lambda note: note.to_dict())
raise ndb.Return(note_list)
Advanced tip: you can simplify this even more by dropping the #ndb.tasklet decorator and just returning the Future returned by map_async():
def get_updates_for_user_Async(userKey, lastSyncDate):
noteQuery = ndb.gql('...')
return noteQuery.map_async(lambda note: note.to_dict())
This is a general slight optimization for async functions that contain only one yield and immediately return the value yielded. (If you don't immediately get this you're in good company, and it runs the risk to be broken by a future maintainer who doesn't either. :-)

Python generator's 'yield' in separate function

I'm implementing a utility library which is a sort-of task manager intended to run within the distributed environment of Google App Engine cloud computing service. (It uses a combination of task queues and memcache to execute background processing). I plan to use generators to control the execution of tasks, essentially enforcing a non-preemptive "concurrency" via the use of yield in the user's code.
The trivial example - processing a bunch of database entities - could be something like the following:
class EntityWorker(Worker):
def setup():
self.entity_query = Entity.all()
def run():
for e in self.entity_query:
do_something_with(e)
yield
As we know, yield is two way communication channel, allowing to pass values to code that uses generators. This allows to simulate a "preemptive API" such as the SLEEP call below:
def run():
for e in self.entity_query:
do_something_with(e)
yield Worker.SLEEP, timedelta(seconds=1)
But this is ugly. It would be great to hide the yield within seperate function which could invoked in simple way:
self.sleep(timedelta(seconds=1))
The problem is that putting yield in function sleep turns it into a generator function. The call above would therefore just return another generator. Only after adding .next() and yield back again we would obtain previous result:
yield self.sleep(timedelta(seconds=1)).next()
which is of course even more ugly and unnecessarily verbose that before.
Hence my question: Is there a way to put yield into function without turning it into generator function but making it usable by other generators to yield values computed by it?
You seem to be missing the obvious:
class EntityWorker(Worker):
def setup(self):
self.entity_query = Entity.all()
def run(self):
for e in self.entity_query:
do_something_with(e)
yield self.sleep(timedelta(seconds=1))
def sleep(self, wait):
return Worker.SLEEP, wait
It's the yield that turns functions into generators, it's impossible to leave it out.
To hide the yield you need a higher order function, in your example it's map:
from itertools import imap
def slowmap(f, sleep, *iters):
for row in imap(f, self.entity_query):
yield Worker.SLEEP, wait
def run():
return slowmap(do_something_with,
(Worker.SLEEP, timedelta(seconds=1)),
self.entity_query)
Alas, this won't work. But a "middle-way" could be fine:
def sleepjob(*a, **k):
if a:
return Worker.SLEEP, a[0]
else:
return Worker.SLEEP, timedelta(**k)
So
yield self.sleepjob(timedelta(seconds=1))
yield self.sleepjob(seconds=1)
looks ok for me.
I would suggest you have a look at the ndb. It uses generators as co-routines (as you are proposing here), allowing you to write programs that work with rpcs asynchronously.
The api does this by wrapping the generator with another function that 'primes' the generator (it calls .next() immediately so that the code begins execution). The tasklets are also designed to work with App Engine's rpc infrastructure, making it possible to use any of the existing asynchronous api calls.
With the concurreny model used in ndb, you yield either a future object (similar to what is described in pep-3148) or an App Engine rpc object. When that rpc has completed, the execution in the function that yielded the object is allowed to continue.
If you are using a model derived from ndb.model.Model then the following will allow you to asynchronously iterate over a query:
from ndb import tasklets
#tasklets.tasklet
def run():
it = iter(Entity.query())
# Other tasklets will be allowed to run if the next call has to wait for an rpc.
while (yield it.has_next_async()):
entity = it.next()
do_something_with(entity)
Although ndb is still considered experimental (some of its error handling code still needs some work), I would recommend you have a look at it. I have used it in my last 2 projects and found it to be an excellent library.
Make sure you read through the documentation linked from the main page, and also the companion documentation for the tasklet stuff.

Categories