Why is _post_put_hook not running inside a transaction? - python

I have some code that queues up a task inside _post_put_hook.
The task retrieves the key and fetches the entity. However sometimes the worker fails because the object for that key hasn't been created yet, but will succeed when it next runs.Note that we're retrieving the object by key, so I expect the data to be consistent.
I'm only calling the enqueue on commit, so I'd expect the object to be created by the time the task runs. In the sample below, I find that _post_put_hook is not in a transaction which seems to be the cause of the issue, but why isn't it in a transaction?
Here's a sample:
#ndb.synctasklet
def log_usage(self):
#ndb.transactional_tasklet(xg=True)
def _txn():
yield Log.insert_document_log_async()
yield _txn()
class Log(ndb.Expando):
#classmethod
#ndb.tasklet
def insert_document_log_async(cls):
log = cls()
logging.debug("insert document log in transaction: {}".format(ndb.in_transaction()))
yield log.put_async()
#ndb.synctasklet
def _post_put_hook(self, future):
#ndb.synctasklet
def _callback_on_commit():
key = future.get_result()
yield SqlTaskHelper.enqueue_syncronise_sql_model_async(key)
logging.debug("_post_put_hook In transaction: {}".format(ndb.in_transaction()))
ndb.get_context().call_on_commit(lambda: _callback_on_commit())
The code is executed as follows:
log_usage is called which calls insert_document_log_async
When calling insert_document_log_async, logging indicates that we're in a transaction (insert document log in transaction: True).
But the _post_put_hook logging indicates we're not in a transaction (so call_on_commit is executed immediately, which is what I suspect the issue is). The task runs shortly after and the entity isn't always available.
I'd like to know why _post_put_hook is executing outside of a transaction.
Thanks

Your question was answered on Google Groups. I'm re-posting from there:
"Note that post hooks do not check whether the RPC was successful. The hook runs regardless of failure that might have occurred due to issues, more specifically the contention which is when you attempt to write to a single entity group too quickly. Also note that it is normal that a small number of datastore operations will result in timeout in normal operation. Read more here about the most common datastore issues and here how to avoid the contention.
In case you need any coding assistance, I suggest you post your inquiries on Stack Overflow where the community of developers are better prepared to assist you in that matter. Google Groups is oriented more towards general opinions, trends, and issues of general nature regarding Google Cloud Platform.
If an exception is detected by Datastore, it would be raised when the code calls get_result(), so the key would not return. However, note that “all post- hooks have a Future argument at the end of the call signature. This Future object holds the result of the action. You can call get_result on this Future to retrieve the result; you can be sure that get_result won't block, since the Future is complete by the time the hook is called.”
That said, in case you don’t have an exception, the future already has the result and get_result function is not blocking, occasionally failing to retrieve the key. Take a look at this Stack Overflow post with a suggestion to resolve an issue similar to your case."

Related

Check for atomic context

One of my methods doesn't work when run on atomic context. I want to ask Django if it's running a transaction.
The method can create a thread or a process and saves the result to database. This is a bit odd but there is a huge performance benefit when a process can be used.
I find that especially processes are a bit sketchy with Django. I know that Django will raise an exception if the method chooses to save the results in a process and the method is run on atomic context.
If I can check for an atomic context then I can throw an exception straight away (instead of getting odd errors) or force the method to only create a thread.
I found the is_managed() method but according to this question it's been removed in Django 1.8.
According to this ticket there are a couple ways to detect this: not transaction.get_autocommit() (using a public API) or transaction.get_connection().in_atomic_block (using a private API).

Contention problems in Google App Engine

I'm having contention problems in Google App Engine, and try to understand what's going on.
I have a request handler annotated with:
#ndb.transactional(xg=True, retries=5)
..and in that code I fetch some stuff, update some others etc. But sometimes an error like this one comes in the log during a request:
16:06:20.930 suspended generator _get_tasklet(context.py:329) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
Element {
type: "PlayerGameStates"
name: "hannes2"
}
>
)
16:06:20.930 suspended generator get(context.py:744) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
Element {
type: "PlayerGameStates"
name: "hannes2"
}
>
)
16:06:20.930 suspended generator get(context.py:744) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
Element {
type: "PlayerGameStates"
name: "hannes2"
}
>
)
16:06:20.936 suspended generator transaction(context.py:1004) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
Element {
type: "PlayerGameStates"
name: "hannes2"
}
>
)
..followed by a stack trace. I can update with the whole stack trace if needed, but it's kind of long.
I don't understand why this happens. Looking at the line in my code there the exception comes, I run get_by_id on a totally different entity (Round). The "PlayerGameStates", name "hannes2" that is mentioned in the error messages is the parent of another entity GameState, which have been get_async:ed from the database a few lines earlier;
# GameState is read by get_async
gamestate_future = GameState.get_by_id_async(id, ndb.Key('PlayerGameStates', player_key))
...
gamestate = gamestate_future.get_result()
...
Weird(?) thing is, there are no writes to the datastore occurring for that entity. My understanding is that contention errors can come if the same entity is updated at the same time, in parallell.. Or maybe if too many writes occur, in a short period of time..
But can it happen when reading entities also? ("suspended generator get.."??) And, is this happening after the 5 ndb.transaction retries..? I can't see anything in the log that indicates that any retries have been made.
Any help is greatly appreciated.
Yes, contention can happen for both read and write ops.
After a transaction starts - in your case when the handler annotated with #ndb.transactional() is invoked - any entity group accessed (by read or write ops, doesn't matter) is immediately marked as such. At that moment it is not known if by the end of transaction there will a write op or not - it doesn't even matter.
The too much contention error (which is different than a conflict error!) indicates that too many parallel transactions simultaneously try to access the same entity group. It can happen even if none of the transactions actually attempts to write!
Note: this contention is NOT emulated by the development server, it can only be seen when deployed on GAE, with the real datastore!
What can add to the confusion is the automatic re-tries of the transactions, which can happen after both actual write conflicts or just plain access contention. These retries may appear to the end-user as suspicious repeated execution of some code paths - the handler in your case.
Retries can actually make matter worse (for a brief time) - throwing even more accesses at the already heavily accessed entity groups - I've seen such patterns with transactions only working after the exponential backoff delays grow big enough to let things cool a bit (if the retries number is large enough) by allowing the transactions already in progress to complete.
My approach to this was to move most of the transactional stuff on push queue tasks, disable retries at the transaction and task level and instead re-queue the task entirely - fewer retries but spaced further apart.
Usually when you run into such problems you have to re-visit your data structures and/or the way you're accessing them (your transactions). In addition to solutions maintaining the strong consistency (which can be quite expensive) you may want to re-check if consistency is actually a must. In some cases it's added as a blanket requirement just because appears to simplify things. From my experience it doesn't :)
Another thing can can help (but only a bit) is using a faster (also more expensive) instance type - shorter execution times translate into a slightly lower risk of transactions overlapping. I noticed this as I needed an instance with more memory, which happened to also be faster :)

Django with Celery - existing object not found

I am having problem with executing celery task from another celery task.
Here is the problematic snippet (data object already exists in database, its attributes are just updated inside finalize_data function):
def finalize_data(data):
data = update_statistics(data)
data.save()
from apps.datas.tasks import optimize_data
optimize_data.delay(data.pk)
#shared_task
def optimize_data(data_pk):
data = Data.objects.get(pk=data_pk)
#Do something with data
Get call in optimize_data function fails with "Data matching query does not exist."
If I call the retrieve by pk function in finalize_data function it works fine. It also works fine if I delay the celery task call for some time.
This line:
optimize_data.apply_async((data.pk,), countdown=10)
instead of
optimize_data.delay(data.pk)
works fine. But I don't want to use hacks in my code. Is it possible that .save() call is asynchronously blocking access to that row/object?
I know that this is an old post but I stumbled on this problem today. Lee's answer pointed me to the correct direction but I think a better solution exists today.
Using the on_commit handler provided by Django this problem can be solved without a hackish way of countdowns in the code which might not be intuitive to the user about why it exsits.
I'm not sure if this existed when the question was posted but I'm just posting the answer so that people who come here in the future know about the alternative.
I'm guessing your caller is inside a transaction that hasn't committed before celery starts to process the task. Hence celery can't find the record. That is why adding a countdown makes it work.
A 1 second countdown will probably work as well as the 10 second one in your example. I've used 1 second countdowns throughout code to deal with this issue.
Another solution is to stop using transactions.
You could use an on_commit hook to make sure the celery task isn't triggered until after the transaction commits?
DjangoDocs#performing-actions-after-commit
It's a feature that was added in Django 1.9.
from django.db import transaction
def do_something():
pass # send a mail, invalidate a cache, fire off a Celery task, etc.
transaction.on_commit(do_something)
You can also wrap your function in a lambda:
transaction.on_commit(lambda: some_celery_task.delay('arg1'))
The function you pass in will be called immediately after a hypothetical database write made where on_commit() is called would be successfully committed.
If you call on_commit() while there isn’t an active transaction, the callback will be executed immediately.
If that hypothetical database write is instead rolled back (typically when an unhandled exception is raised in an atomic() block), your function will be discarded and never called.

When can DeadlineExceededError exception be thrown in Google App Engine

I am writing a deferred task which is intended to construct a file in the blobstore for download. I am modelling the code on the example given in the docs:
http://code.google.com/appengine/articles/deferred.html
The idea is to structure the code so that if there is a DeadlineExceededError the handler can tidy up and kick off a new deferred task to continue later.
What I'd like to know is when exactly can this exception be thrown? Are there any operations which are guaranteed to be atomic and therefore will not be interrupted?
In the example (referenced above) they update a variable called start_key as they finish processing each record, but say the main loop was interrupted between the extending of the to_put and to_delete lists then the data would be wrong, as it would do miss a set of deletes.
If an exception can be raised at any point then it could be halfway through the batch_write, or between the db.put and clearing of the to_put list.
This is logically equivalent to a thread safety problem, to solve it one normally has guaranteed atomic operations and non-atomic operations.
How does this work?
Thanks
A DeadlineExceededError can be thrown literally any time at all. If there were a time when it couldn't be thrown, an abusive app could simply execute that code in a loop.
You can avoid this several ways:
Proactively check how long you've been executing for and stop at a good time before you hit the deadline.
Put the exception handler somewhere that it can store the state as of the last set of completed operations (eg, discarding anything since the last iteration of the outer loop in which the exception was thrown)
Use backends, which do not have deadlines.

Communicating end of Queue

I'm learning to use the Queue module, and am a bit confused about how a queue consumer thread can be made to know that the queue is complete. Ideally I'd like to use get() from within the consumer thread and have it throw an exception if the queue has been marked "done". Is there a better way to communicate this than by appending a sentinel value to mark the last item in the queue?
original (most of this has changed; see updates below)
Based on some of the suggestions (thanks!) of Glenn Maynard and others, I decided to roll up a descendant of Queue.Queue that implements a close method. It's available in the form of a primitive (unpackaged) module. I'll clean this up a bit and package it properly when I have a bit more time. For now the module only contains the CloseableQueue class and the Closed exception class. I'm planning to expand it to also include subclasses of Queue.LifoQueue and Queue.PriorityQueue.
It's in a pretty preliminary state currently, which is to say that although it passes its test suite, I haven't actually used it for anything yet. Your mileage may vary. I'll keep this answer updated with exciting news.
The CloseableQueue class differs a bit from Glenn's suggestion in that closing the queue will prevent future puts, but not prevent future gets until the queue is emptied. This made the most sense to me; it seemed like functionality to clear the queue could be added as a separate mixin* that would be orthogonal to the closeability functionality. So basically with CloseableQueue, by closing the queue you indicate that the last element has been put. There's also an option to do this atomically by passing last=True to the final put call. Subsequent calls to put, and subsequent calls to get once the queue is emptied, as well as outstanding blocked calls matching those descriptions, will raise the Closed exception.
This is mostly useful for situations where a single producer is generating data for one or more consumers, but it could also be useful for a multi-multi arrangement where consumers are waiting for a particular item or set of items. In particular it doesn't provide a way to determine that all of a number of producers have finished production. Getting that working would entail the provision of some way to register producers (.open()?), as well as a way to indicate that producer registration is itself closed.
Suggestions and/or code reviews are quite welcome. I haven't written a whole lot of concurrency code, but hopefully the test suite is thorough enough that the fact that the code passes it is an indication of the code's quality, rather than the suite's lack thereof. I was able to reuse a bunch of the code from the Queue module's test suite: the file itself is included in this module and used as a basis for various subclasses and routines, including regression testing. This probably (hopefully) helped to avoid complete ineptitude in the testing department. The code itself just overrides Queue.get and Queue.put with fairly minimal changes, and adds the close and closed methods.
I've sort of intentionally avoided using any new-fangled fanciness like context managers in both the code itself and in the test suite in an effort to keep the code as backwards-compatible as is the Queue module itself, which is considerably backwards indeed. I'll probably add __enter__ and __exit__ methods at some point; otherwise, the contextlib's closing function should be applicable to a CloseableQueue instance.
*: Here I use the term "mixin" loosely. As the Queue module's classes are old-style, mixins would need to be mixed using class factory functions; some restrictions apply; offer void where prohibited by Guido.
update
The CloseableQueue module now provides CloseableLifoQueue and CloseablePriorityQueue classes. I've also added some convenience functions to support iteration. Still need to rework it as a proper package. There's a class factory function to allow for convenient subclassing of other Queue.Queue-derived classes.
update 2
CloseableQueue is now available via PyPI, e.g. with
$ easy_install CloseableQueue
Comments and criticism are welcome, especially from this answer's anonymous downvoter.
Queue's don't inherently have the idea of being complete or done. They can be used indefinitely. To close it up when you are done, you will indeed need to put None or some other magic value at the end and write the logic to check for it, as you described. The ideal way would probably be subclassing the Queue object.
See http://en.wikipedia.org/wiki/Queue_(data_structure) to learn more about queue in general.
A sentinel is a natural way to shut down a queue, but there are a couple things to watch out for.
First, remember that you may have more than one consumer, so you need to send a sentinel once for each running consumer, and guarantee that each consumer will only consume one sentinel, to ensure that each consumer receives its shutdown sentinel.
Second, remember that Queue defines an interface, and that when possible, code should behave regardless of the underlying Queue. You might have a PriorityQueue, or you might have some other class that exposes the same interface and returns values in some other order.
Unfortunately, it's hard to deal with both of these. To deal with the general case of different queues, a consumer that's shutting down must continue to consume values after receiving its shutdown sentinel until the queue is empty. That means that it may consume another thread's sentinel. This is a weakness of the Queue interface: it should have a Queue.shutdown call to cause an exception to be thrown by all consumers, but that's missing.
So, in practice:
if you're sure you're only ever using a regular Queue, simply send one sentinel per thread.
if you may be using a PriorityQueue, ensure that the sentinel has the lowest priority.
Queue is a FIFO (first in first out) register so remember that the consumer can be faster than producer. When consumers thread detect that the queue is empty normally realise one of following actions:
Send to API: switch to next thread.
Send to API: sleep some ms and than check again the queue.
Send to API: wait on event (like new message in queue).
If you wont that consumers thread terminate after job is complete than put in queue a sentinel value to terminate task.
The best practice way of doing this would be to have the queue itself notify a client that it has reached the 'done' state. The client can then take any action that is appropriate.
What you have suggested; checking the queue to see if it is done periodically, would be highly undesirable. Polling is an antipattern in multithreaded programming, you should always be using notifications.
EDIT:
So your saying that the queue itself knows that it's 'done' based on some criteria and needs to notify the clients of that fact. I think you are correct and the best way to do this is by throwing when a client calls get() and the queue is in the done state. If your throwing this would negate the need for a sentinel value on the client side. Internally the queue can detect that it is 'done' in any way it pleases e.g. queue is empty, it's state was set to done etc I don't see any need for a sentinel value.

Categories