I'm analying a bit of code in one of the Jenkins builds which uses recursion to get jenkins' downstream jobs' urls
def get_all_downstream_jobs_urls(ds_jobs: set = None):
global JENKINS_JOBS
if not ds_jobs:
ds_jobs = set(); ds_jobs.update(extract_ds_job_url(get_ds_jobs(BASE_JOB_URL)))
temp = ds_jobs
for _ in ds_jobs.copy():
result = extract_ds_job_url(get_ds_jobs(_)) # <--- jenkins rest api call
if result: temp.update(result); JENKINS_JOBS.update(temp);
else: return temp
return get_all_downstream_jobs_urls(temp)
This works OK for a project which has downstream jobs although it's making too many calls to Jenkins rest api, but if a project doesn't have downstream jobs it is stuck in recursion. Could you help me figure out where the issue is?
If extract_ds_job_url(get_ds_jobs(BASE_JOB_URL)) returns an empty set, you are forever calling get_all_downstream_jobs_urls(temp). That's because the for loop is not going to do anything.
The test at the top should check for None instead:
if ds_jobs is None:
and a separate test for ds_jobs being empty should end the recursion:
if not ds_jobs:
# no downstream jobs to process
return set()
I can't vouch for the rest of the logic, but there are certainly many style errors in the code. I'd refactor it to at least get rid of some of those errors:
JENKINS_JOBS is never rebound, so global JENKINS_JOBS is redundant and confusing and should be removed.
It's not clear why the function is updating a global and is returning the result set. It should do one, or the other, not both.
_ is, by convention, a throw-away variable. It signals that the value is not going to be used. Yet here the code does use it. It should be renamed job_url instead.
You really never should use ; in production code. Put the code on separate lines.
ds_jobs = set() then ds_jobs.update(...) is a way too verbose spelling of ds_jobs = set(...).
temp is not a good variable name, updated might be a better name. It should be made a copy when assigned, so updated = set(ds_jobs), and the .copy() call can be removed from the for loop.
the return when the first job URL doesn't have downstream URLs is probably not what you want either.
If you really want a tree of downstream URLs, the recursive call should not try to pass in all job urls collected so far! It's just as likely to call the jenkins API again and again for a job URL that was already checked.
The following code removes the recursion by using a stack instead, and is guaranteed to call the Jenkins API for each job URL just once:
def get_all_downstream_jobs_urls():
ds_jobs = set()
stack = [extract_ds_job_url(get_ds_jobs(BASE_JOB_URL))]
while stack:
job_url = stack.pop()
if job_url in ds_jobs:
# already seen before, skip
continue
ds_jobs.add(job_url)
# add downstream jobs to the stack for further processing
stack.extend(extract_ds_job_url(get_ds_jobs(job_url)))
return ds_jobs
Last but not least, I strongly suspect that using a third-party library like the jenkinsapi package would make this all even simpler; the Jenkins API probably lets you query this information in just one call but a library probably does such a call for you and give you readily-parsed Python objects for the information.
Related
If I have a structure in an async webserver like
import contextvars
...
my_context_var = contextvars.ContextVar("var")
#app.route("/foo") # decorator from webserver
async def some_web_endpoint():
local_ctx_var = my_context_var.set(params.get("bar")) # app sets params
await some_function_that_can_raise()
local_ctx_var.reset()
Will it leak memory if I don't wrap the ContextVar in a finally: block and some_function_that_can_raise() raises an Exception?
(without such a case, .reset() would never be called)
try:
await some_function_that_can_raise()
finally:
local_ctx_var.reset()
.. or is it safe to assume the value will be destroyed when the request scope ends?
The async example in the upstream docs doesn't actually bother .reset()-ing it at all!
In such a case, .reset() is redundant as it happens right before the context is cleaned up anyways.
To add some more context (ha), I'm recently learning about ContextVars and I assume the second is the case.
local_ctx_var is the only name which refers to the Token (from .set()), and as the name is deleted when the request scope ends, the local value should become candidate for garbage collection, preventing a potential leak and making .reset() unnecessary for short-lived scopes (hooray)
..but I'm not absolutely certain, and while there's some very extremely helpful information on the subject, it muddles the mixture slightly
What happens if I don't reset Python's ContextVars? (implies it'll be GC'd as one would expect)
Context variables in Python (explicitly uses finally:)
Yes - the previous value of the context_var is kept in the token object in this case. There is this rather similar question, where one of the answers run a simple benchmark to assert that calling context_var.set() multiple times and discarding the return value does not consume memory, when compared to, say, create a new string and keeping a reference to it.
Given the benchmark, I made some further experimentation and concluded there is no leak - in fact, in code like the above, calling reset is indeed redundant - it is useful if you'd have to restore the previous value inside a loop construct for some reason.
The new var is set, on top of the last saved context, the value set in the current version of the context is simply discarded along the way: the only references to it are the one left in the tokens, if any. In ohtther words: what preserves the previous values in a "stack like" way are calls to contextvars.run and contextvars.copy_context only, not Contextvar.set.
I have a defer.inlineCallback function for incrementally updating a large (>1k) list one piece at a time. This list may change at any time, and I'm getting bugs because of that behavior.
The simplest representation of what I'm doing is:-
#defer.inlineCallbacks
def _get_details(self, dt=None):
data = self.data
for e in data:
if needs_update(e):
more_detail = yield get_more_detail(e)
do_the_update(e, more_detail)
schedule_future(self._get_details)
self.data is a list of dictionaries which is initially populated with basic information (e.g. a name and ID) at application start. _get_details will run whenever allowed to by the reactor to get more detailed information for each item in data, updating the item as it goes along.
This works well when self.data does not change, but once it is changed (can be at any point) the loop obviously refers to the wrong information. In fact in that situation it would be better to just stop the loop entirely.
I'm able to set a flag in my class (which the inlineCallback can then check) when the data is changed.
Where should this check be conducted?
How does the inlineCallback code execute compared to a normal deferred (and indeed to a normal python generator).
Does code execution stop everytime it encounters yield (i.e. can I rely on this code between one yield and the next to be atomic)?
In the case of unreliable large lists, should I even be looping through the data (for e in data), or is there a better way?
the Twisted reactor never preempts your code while it is executing -- you have to voluntarily yield to the reactor by returning a value. This is why it is such a terrible thing to write Twisted code that blocks on I/O, because the reactor is not able to schedule any tasks while you are waiting for your disk.
So the short answer is that yes, execution is atomic between yields.
Without #inlineCallbacks, the _get_details function returns a generator. The #inlineCallbacks annotation simply wraps the generator in a Deferred that traverses the generator until it reaches a StopIteration exception or a defer.returnValue exception. When either of those conditions is reached, inlineCallbacks fires its Deferred. It's quite clever, really.
I don't know enough about your use case to help with your concurrency problem. Maybe make a copy of the list with tuple() and update that. But it seems like you really want an event-driven solution and not a state-driven one.
You need to protect access to shared resource (self.data).
You can do this with: twisted.internet.defer.DeferredLock.
http://twistedmatrix.com/documents/current/api/twisted.internet.defer.DeferredLock.html
Method acquire
Attempt to acquire the lock. Returns a Deferred that fires on lock
acquisition with the DeferredLock as the value. If the lock is locked,
then the Deferred is placed at the end of a waiting list.
Method release
Release the lock. If there is a waiting list, then the first Deferred in that waiting list will be called back.
#defer.inlineCallback
def _get_details(self, dt=None):
data = self.data
i = 0
while i < len(data):
e = data[i]
if needs_update(e):
more_detail = yield get_more_detail(e)
if i < len(data) or data[i] != e:
break
do_the_update(e, more_detail)
i += 1
schedule_future(self._get_details)
Based on more testing, the following are my observations.
for e in data iterates through elements, with the element still existing even if data itself does not, both before and after the yield statement.
As far as I can tell, execution is atomic between one yield and the next.
Looping through the data is more transparently done by using a counter. This also allows for checking whether the data has changed. The check can be done anytime after yield because any changes must have occurred before yield returned. This results in the code shown above.
self.data is a list of dictionaries...once it is changed (can be at any point) the loop obviously refers to the wrong information
If you're modifying a list while you iterate it, as Raymond Hettinger would say "You're living in the land of sin and you deserve everything that happens to you." :) Scenarios like this should be avoided or the list should be immutable. To circumvent this problem, you can use self.data.pop() or DeferredQueue object to store data. This way you can add and remove elements at anytime without causing adverse effects. Example with a list:
#defer.inlineCallbacks
def _get_details(self, dt=None):
try:
data = yield self.data.pop()
except IndexError:
schedule_future(self._get_details)
defer.returnValue(None) # exit function
if needs_update(e):
more_detail = yield get_more_detail(data)
do_the_update(data, more_detail)
schedule_future(self._get_details)
Take a look at DeferredQueue because a Deferred is returned when the get() function is called, which you can chain callbacks to handle each element you pop from the queue.
Behold my simple Python memcached code below:
import memcache
memcache_client = memcache.Client(['127.0.0.1:11211'], debug=True)
key = "myList"
obj = ["A", "B", "C"]
memcache_client.set(key, obj)
Now, suppose I want to append an element "D" to the list cached as myList, how can I do it atomically?
I know this is wrong because it is not atomic:
memcache_client.set(key, memcache_client.get(key) + ["D"])
The above statement contains a race condition. If another thread executes this same instruction at the exact right moment, one of the updates will get clobbered.
How can I solve this race condition? How can I update a list or dictionary stored in memcached atomically?
Here's the corresponding function of the python client API
https://cloud.google.com/appengine/docs/python/memcache/clientclass#Client_cas
Also here's a nice tutorial by Guido van Rossum. Hope he'd better explain python stuff than I ;)
Here's how the code should look like in your case:
memcache_client = memcache.Client(['127.0.0.1:11211'], debug=True)
key = "myList"
while True: # Retry loop, probably it should be limited to some reasonable retries
obj = memcache_client.gets(key)
assert obj is not None, 'Uninitialized object'
if memcache_client.cas(key, obj + ["D"]):
break
The whole workflow remains the same: first you fetch a value (w/ some internal information bound to a key), then modify the fetched value, then attempt to update it in the memcache. The only difference that the value (actually, key/value pair) is checked that it hasn't been changed simultaneosly from a parallel process. In the latter case the call fails and you should retry the workflow from the beginning. Also, if you have a multi-threaded application, then each memcache_client instance likely should be thread-local.
Also don't forget that there're incr() and decr() methods for simple integer counters which are "atomic" by their nature.
If you don't want receive a race condition then you must use Lock primitive from threading module. For example
lock = threading.Lock()
def thread_func():
obj = get_obj()
lock.acquire()
memcache_client.set(key, obj)
lock.release()
Question:
Essentially I want to return a unique result from the database everytime a view is called (until I run out of unique objects and have to start over). I was thinking that a simple and elegant solution would be to use a generator to handle this. Is this possible and if so how can this be approached with regards to pulling values from with ORM?
Note:
I think sessions or utilizing a design pattern like Memento may be a solution here, but I'm really curious to see if and how Python generators could be used in this context.
As Django is synchronous wsgi, you have to process each request as stand alone, your python environment can be killed or switched to an other at any time.
Still if you have no fear and a single process, you can make a file scope dictionary with session ids and iterators that you'll consume each time
from django.shortcuts import render
from collections import defaultdict
import uuid
def iterator():
for item in DatabaseTable.objects.all():
yield item
sessions_current_iterators = defaultdict(iterator)
def my_view(request):
id = request.session.get("iterator_id", None)
if id is None:
request.session["iterator_id"] = str(uuid.uuid4())
try:
return render(request, "item_template.html", {"item": next(sessions_current_iterators)}
except StopIteration:
request.session.pop("iterator_id")
return render(request, "end_template.html", {})
but: NEVER USE THIS ON A PRODUCTION ENVIRONMENT!
generators are great to reduce memory consumption while computing the request or can be good for tornado web service, but clearly, django should not share data between request in local variables.
You can always use yield where you can use return (since these are python stuff not Django stuff). The only caveat here is that the same function is called for every request; so the continuation after the yield may serve another client instead of the one you intend. However you can beat this problem by using a higher level function (generator here). Basically the function will have a dictionary of generators indexed by unique keys derived from the requests. Every time the function is called, check whether an entry already exists for the request in the dictionary. If not add a new function for that request. Then invoke the generator for the given request making sure to store whatever is yielded or returned by the generator. To keep the dictionary in memory let the main function now yield the stored value. Finally, so that the dictionary is not cleared every time the main function is called, start the function body by initializing the dictionary to an empty dictionary; then wrap everything else in an infinite while loop. This will ensure that the main function, also a generator, never really exits. When called the first time, the dictionary is initialized; then the while starts. In the while, the function creates and stores a generator in the dictionary if no entry already exists for the given request. Then the function invokes the generator for the request and yields whatever the generator returns or yields at the bottom of the while. When called again; the main function resumes at the top of the while. The code is like so:
def main_func(request, *args) :
funcs = {}
while True:
request_key = make_key(request)
If request_key not in funcs.keys():
def generator_func():
# your generator code here...
# remember to delete the func item in funcs before returning...
funcs[request_key] = generator_func
yield funcs[request_key] ()
def make_key(request):
# quick and dirty impl
return str(request.session)
I want to make sure I got down how to create tasklets and asyncrounous methods. What I have is a method that returns a list. I want it to be called from somewhere, and immediatly allow other calls to be made. So I have this:
future_1 = get_updates_for_user(userKey, aDate)
future_2 = get_updates_for_user(anotherUserKey, aDate)
somelist.extend(future_1)
somelist.extend(future_2)
....
#ndb.tasklet
def get_updates_for_user(userKey, lastSyncDate):
noteQuery = ndb.GqlQuery('SELECT * FROM Comments WHERE ANCESTOR IS :1 AND modifiedDate > :2', userKey, lastSyncDate)
note_list = list()
qit = noteQuery.iter()
while (yield qit.has_next_async()):
note = qit.next()
noteDic = note.to_dict()
note_list.append(noteDic)
raise ndb.Return(note_list)
Is this code doing what I'd expect it to do? Namely, will the two calls run asynchronously? Am I using futures correctly?
Edit: Well after testing, the code does produce the desired results. I'm a newbie to Python - what are some ways to test to see if the methods are running async?
It's pretty hard to verify for yourself that the methods are running concurrently -- you'd have to put copious logging in. Also in the dev appserver it'll be even harder as it doesn't really run RPCs in parallel.
Your code looks okay, it uses yield in the right place.
My only recommendation is to name your function get_updates_for_user_async() -- that matches the convention NDB itself uses and is a hint to the reader of your code that the function returns a Future and should be yielded to get the actual result.
An alternative way to do this is to use the map_async() method on the Query object; it would let you write a callback that just contains the to_dict() call:
#ndb.tasklet
def get_updates_for_user_async(userKey, lastSyncDate):
noteQuery = ndb.gql('...')
note_list = yield noteQuery.map_async(lambda note: note.to_dict())
raise ndb.Return(note_list)
Advanced tip: you can simplify this even more by dropping the #ndb.tasklet decorator and just returning the Future returned by map_async():
def get_updates_for_user_Async(userKey, lastSyncDate):
noteQuery = ndb.gql('...')
return noteQuery.map_async(lambda note: note.to_dict())
This is a general slight optimization for async functions that contain only one yield and immediately return the value yielded. (If you don't immediately get this you're in good company, and it runs the risk to be broken by a future maintainer who doesn't either. :-)