Contention problems in Google App Engine

Contention problems in Google App Engine - python

I'm having contention problems in Google App Engine, and try to understand what's going on.
I have a request handler annotated with:
#ndb.transactional(xg=True, retries=5)
..and in that code I fetch some stuff, update some others etc. But sometimes an error like this one comes in the log during a request:
16:06:20.930 suspended generator _get_tasklet(context.py:329) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
Element {
type: "PlayerGameStates"
name: "hannes2"
}
>
)
16:06:20.930 suspended generator get(context.py:744) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
Element {
type: "PlayerGameStates"
name: "hannes2"
}
>
)
16:06:20.930 suspended generator get(context.py:744) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
Element {
type: "PlayerGameStates"
name: "hannes2"
}
>
)
16:06:20.936 suspended generator transaction(context.py:1004) raised TransactionFailedError(too much contention on these datastore entities. please try again. entity group key: app: "s~my-appname"
path <
Element {
type: "PlayerGameStates"
name: "hannes2"
}
>
)
..followed by a stack trace. I can update with the whole stack trace if needed, but it's kind of long.
I don't understand why this happens. Looking at the line in my code there the exception comes, I run get_by_id on a totally different entity (Round). The "PlayerGameStates", name "hannes2" that is mentioned in the error messages is the parent of another entity GameState, which have been get_async:ed from the database a few lines earlier;
# GameState is read by get_async
gamestate_future = GameState.get_by_id_async(id, ndb.Key('PlayerGameStates', player_key))
...
gamestate = gamestate_future.get_result()
...
Weird(?) thing is, there are no writes to the datastore occurring for that entity. My understanding is that contention errors can come if the same entity is updated at the same time, in parallell.. Or maybe if too many writes occur, in a short period of time..
But can it happen when reading entities also? ("suspended generator get.."??) And, is this happening after the 5 ndb.transaction retries..? I can't see anything in the log that indicates that any retries have been made.
Any help is greatly appreciated.

Yes, contention can happen for both read and write ops.
After a transaction starts - in your case when the handler annotated with #ndb.transactional() is invoked - any entity group accessed (by read or write ops, doesn't matter) is immediately marked as such. At that moment it is not known if by the end of transaction there will a write op or not - it doesn't even matter.
The too much contention error (which is different than a conflict error!) indicates that too many parallel transactions simultaneously try to access the same entity group. It can happen even if none of the transactions actually attempts to write!
Note: this contention is NOT emulated by the development server, it can only be seen when deployed on GAE, with the real datastore!
What can add to the confusion is the automatic re-tries of the transactions, which can happen after both actual write conflicts or just plain access contention. These retries may appear to the end-user as suspicious repeated execution of some code paths - the handler in your case.
Retries can actually make matter worse (for a brief time) - throwing even more accesses at the already heavily accessed entity groups - I've seen such patterns with transactions only working after the exponential backoff delays grow big enough to let things cool a bit (if the retries number is large enough) by allowing the transactions already in progress to complete.
My approach to this was to move most of the transactional stuff on push queue tasks, disable retries at the transaction and task level and instead re-queue the task entirely - fewer retries but spaced further apart.
Usually when you run into such problems you have to re-visit your data structures and/or the way you're accessing them (your transactions). In addition to solutions maintaining the strong consistency (which can be quite expensive) you may want to re-check if consistency is actually a must. In some cases it's added as a blanket requirement just because appears to simplify things. From my experience it doesn't :)
Another thing can can help (but only a bit) is using a faster (also more expensive) instance type - shorter execution times translate into a slightly lower risk of transactions overlapping. I noticed this as I needed an instance with more memory, which happened to also be faster :)

Related

Two flask apps using one database

Hello I don't think this is in the right place for this question but I don't know where to ask it. I want to make a website and an api for that website using the same SQLAlchemy database would just running them at the same time independently be safe or would this cause corruption from two write happening at the same time.

SQLA is a python wrapper for SQL. It is not it's own database. If you're running your website (perhaps flask?) and managing your api from the same script, you can simply use the same reference to your instance of SQLA. Meaning, when you use SQLA to connect to a database and save to a variable, what is really happening is it saves the connection to a variable, and you continually reference that variable, as opposed to the more inefficient method of creating a new connection every time. So when you say
using the same SQLAlchemy database
I believe you are actually referring to the actual underlying database itself, not the SQLA wrapper/connection to it.
If your website and API are not running in the same script (or even if they are, depending on how your API handles simultaneous requests), you may encounter a race condition, which, according to Wikipedia, is defined as:
the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This may be what you are referring to when you mentioned
would this cause corruption from two write happening at the same time.
To avoid such situations, when a process accesses a file, (depending on the OS,) check is performed to see if there is a "lock" on that file, and if so, the OS refuses to open that file. A lock is created when a process accesses a file (and there is no other process holding a lock on that file), such as by using with open(filename): and is released when the process no longer holds an open reference to the file (such as when python execution leaves the with open(filename): indentation block.) This may be the real issue you might encounter when using two simultaneous connections to a SQLite db.
However, if you are using something like MySQL, where you connect to a SQL server process, and NOT a file, since there is no direct access to a file, there will be no lock on the database, and you may run in to that nasty race condition in the following made up scenario:
Stack Overflow queries the reputation an account to see if it should be banned due to negative reputation.
AT THE EXACT SAME TIME, Someone upvotes an answer made by that account that sets it one point under the account ban threshold.
The outcome is now determined by the speed of execution of these 2 tasks.
If the upvoter has, say, a slow computer, and the "upvote" does not get processed by StackOverflow before the reputation query completes, the account will be banned. However, if there is some lag on Stack Overflow's end, and the upvote processes before the account query finishes, the account will not get banned.
The key concept behind this example is that all of these steps can occur within fractions of a second, and the outcome depends of the speed of execution on both ends.
To address the issue of data corruption, most databases have a system in place that properly order database read and writes, however, there are still semantic issues that may arise, such as the example given above.

Two applications can use the same database as the DB is a separate application that will be accessed by each flask app.
What you are asking can be done and is the methodology used by many large web applications, specially when the API is written in a different framework than the main application.
Since SQL databases are ACID compliant, they have a system in place to queue the multiple read/write requests put to it and perform them in the correct order while ensuring data reliability.
One question to ask though is whether it is useful to write two separate applications. For most flask-only projects the best approach would be to separate the project using blueprints, having a “main” blueprint and a “api” blueprint.

Overwriting class variable and concurrent Flask requests

I'm running a python Flask server to perform tricky algorithms, one of which assigns cables to tubes.
class Tube:
max_capacity = 5
cables: List[str]
def has_capacity(self):
return len(self.cables) < self.max_capacity
The max capacity was always 5, but now there's a new customer that actually has tubes that can fit 6 cables.
When I receive a request, I now just set Tube.max_capacity = request.args.get('max_capacity', 5). Then each instance of Tube will have the correct setting.
I was wondering if this will keep working if there are multiple requests being handled at the same time?
Are the Flask (I use Gunicorn as WSGI) processes all separate from each other such that this is safe to do? I don't want to end up with strange bugs because the max capacity changed halfway through a request because another request came in.
EDIT:
I tried this out and it appears to work as intended:
#app.route('/concurrency')
def concurrency():
my_value = randint(0, 100)
Concurrency.value = my_value
time.sleep(8)
return f"My value: {my_value} should be equal to Concurrency.value {Concurrency.value}"
class Concurrency:
value = 10
Still, I want to know more about how multiple Flask/Gunicorn requests work to be certain.

WSGI applications are typically served using multiple processes - eventually on different servers -, and requests from a same user will be handled by the first available process. IOW: you do NOT want to change any module or class level variables on a per-request basis, this is **garanteed* to mess up everything.
It's impossible to tell you exactly how to solve the issue without much more context, but in all cases, you'll have to rethink your design.
EDIT:
how do processes behave? If one of them sets the value, does another process see that value as well?
Of course not - each process is totally isolated from the others - so changing a module-level variable or class attribute will only affect the current process. But since processes are not tied to clients (which process will handle a given request is totally unpredictable), such kind of changes in one process will not necessarily be seen in the next request if it's served by another process. AND:
Or, is a process re-used, and then still has the value from the previous request?
process are of course reused, but that doesnt mean the same process will be reused for the next request from a same user - and this is the second part of the issue: when serving another user, your process will still use the "updated" max_capacity value from the previous user.
IOW, what you're doing is garanteed to mess up everything for all your users. That's why we use external (out of process) means to store and share per-user data between requests - either sessions (for volatile data) or a database (for permanent storage).

Why is _post_put_hook not running inside a transaction?

I have some code that queues up a task inside _post_put_hook.
The task retrieves the key and fetches the entity. However sometimes the worker fails because the object for that key hasn't been created yet, but will succeed when it next runs.Note that we're retrieving the object by key, so I expect the data to be consistent.
I'm only calling the enqueue on commit, so I'd expect the object to be created by the time the task runs. In the sample below, I find that _post_put_hook is not in a transaction which seems to be the cause of the issue, but why isn't it in a transaction?
Here's a sample:
#ndb.synctasklet
def log_usage(self):
#ndb.transactional_tasklet(xg=True)
def _txn():
yield Log.insert_document_log_async()
yield _txn()
class Log(ndb.Expando):
#classmethod
#ndb.tasklet
def insert_document_log_async(cls):
log = cls()
logging.debug("insert document log in transaction: {}".format(ndb.in_transaction()))
yield log.put_async()
#ndb.synctasklet
def _post_put_hook(self, future):
#ndb.synctasklet
def _callback_on_commit():
key = future.get_result()
yield SqlTaskHelper.enqueue_syncronise_sql_model_async(key)
logging.debug("_post_put_hook In transaction: {}".format(ndb.in_transaction()))
ndb.get_context().call_on_commit(lambda: _callback_on_commit())
The code is executed as follows:
log_usage is called which calls insert_document_log_async
When calling insert_document_log_async, logging indicates that we're in a transaction (insert document log in transaction: True).
But the _post_put_hook logging indicates we're not in a transaction (so call_on_commit is executed immediately, which is what I suspect the issue is). The task runs shortly after and the entity isn't always available.
I'd like to know why _post_put_hook is executing outside of a transaction.
Thanks

Your question was answered on Google Groups. I'm re-posting from there:
"Note that post hooks do not check whether the RPC was successful. The hook runs regardless of failure that might have occurred due to issues, more specifically the contention which is when you attempt to write to a single entity group too quickly. Also note that it is normal that a small number of datastore operations will result in timeout in normal operation. Read more here about the most common datastore issues and here how to avoid the contention.
In case you need any coding assistance, I suggest you post your inquiries on Stack Overflow where the community of developers are better prepared to assist you in that matter. Google Groups is oriented more towards general opinions, trends, and issues of general nature regarding Google Cloud Platform.
If an exception is detected by Datastore, it would be raised when the code calls get_result(), so the key would not return. However, note that “all post- hooks have a Future argument at the end of the call signature. This Future object holds the result of the action. You can call get_result on this Future to retrieve the result; you can be sure that get_result won't block, since the Future is complete by the time the hook is called.”
That said, in case you don’t have an exception, the future already has the result and get_result function is not blocking, occasionally failing to retrieve the key. Take a look at this Stack Overflow post with a suggestion to resolve an issue similar to your case."

What is proper workflow for insuring "transactional procedures" in case of exceptions

In programming web applications, Django in particular, sometimes we have a set of actions that must all succeed or all fail (in order to insure a predictable state of some sort). Now obviously, when we are working with the database, we can use transactions.
But in some circumstances, these (all or nothing) constraints are needed outside of a database context
(e.g. If payment is a success, we must send the product activation code or else risk customer complaints, etc)
But lets say on some fateful day, the send_code() function just failed time and again due to some temporary network error (that lasted for 1+ hours)
Should I log the error, and manually fix the problem, e.g. send the mail manually
Should I set up some kind of work queue, where when things fail, they just go back onto the end of the queue for future retry?
What if the logging/queueing systems also fail? (am I worrying too much at this point?)

We use microservices in our company and at least once a month, we have one of our microservices down for a while. We have Transaction model for the payment process and statuses for every step that go before we send a product to the user. If something goes wrong or one of the connected microservices is down, we mark it like status=error and save to the database. Then we use cron job to find and finish those processes. You need to try something for the beginning and if does not fit your needs, try something else.

How does app engine (python) manage memory across requests (Exceeded soft private memory limit)

I'm experiencing occasional Exceeded soft private memory limit error in a wide variety of request handlers in app engine. I understand that this error means that the RAM used by the instance has exceeded the amount allocated, and how that causes the instance to shut down.
I'd like to understand what might be the possible causes of the error, and to start, I'd like to understand how app engine python instances are expected to manage memory. My rudimentary assumptions were:
An F2 instance starts with 256 MB
When it starts up, it loads my application code - lets say 30 MB
When it handles a request it has 226 MB available
so long as that request does not exceed 226 MB (+ margin of error) the request completes w/o error
if it does exceed 226 MB + margin, the instance completes the request, logs the 'Exceeded soft private memory limit' error, then terminates - now go back to step 1
When that request returns, any memory used by it is freed up - ie. the unused RAM goes back to 226 MB
Step 3-4 are repeated for each request passed to the instance, indefinitely
That's how I presumed it would work, but given that I'm occasionally seeing this error across a fairly wide set of request handlers, I'm now not so sure. My questions are:
a) Does step #4 happen?
b) What could cause it not to happen? or not to fully happen? e.g. how could memory leak between requests?
c) Could storage in module level variables causes memory usage to leak? (I'm not knowingly using module level variables in that way)
d) What tools / techniques can I use to get more data? E.g. measure memory usage at entry to request handler?
In answers/comments, where possible, please link to the gae documentation.
[edit] Extra info: my app is congifured as threadsafe: false. If this has a bearing on the answer, please state what it is. I plan to change to threadsafe: true soon.
[edit] Clarification: This question is about the expected behavior of gae for memory management. So while suggestions like 'call gc.collect()' might well be partial solutions to related problems, they don't fully answer this question. Up until the point that I understand how gae is expected to behave, using gc.collect() would feel like voodoo programming to me.
Finally: If I've got this all backwards then I apologize in advance - I really cant find much useful info on this, so I'm mostly guessing..

App Engine's Python interpreter does nothing special, in terms of memory management, compared to any other standard Python interpreter. So, in particular, there is nothing special that happens "per request", such as your hypothetical step 4. Rather, as soon as any object's reference count decreases to zero, the Python interpreter reclaims that memory (module gc is only there to deal with garbage cycles -- when a bunch of objects never get their reference counts down to zero because they refer to each other even though there is no accessible external reference to them).
So, memory could easily "leak" (in practice, though technically it's not a leak) "between requests" if you use any global variable -- said variables will survive the instance of the handler class and its (e.g) get method -- i.e, your point (c), though you say you are not doing that.
Once you declare your module to be threadsafe, an instance may happen to serve multiple requests concurrently (up to what you've set as max_concurrent_requests in the automatic_scaling section of your module's .yaml configuration file; the default value is 8). So, your instance's RAM will need be a multiple of what each request needs.
As for (d), to "get more data" (I imagine you actually mean, get more RAM), the only thing you can do is configure a larger instance_class for your memory-hungry module.
To use less RAM, there are many techniques -- which have nothing to do with App Engine, everything to do with Python, and in particular, everything to do with your very specific code and its very specific needs.
The one GAE-specific issue I can think of is that ndb's caching has been reported to leak -- see https://code.google.com/p/googleappengine/issues/detail?id=9610 ; that thread also suggests workarounds, such as turning off ndb caching or moving to old db (which does no caching and has no leak). If you're using ndb and have not turned off its caching, that might be the root cause of "memory leak" problems you're observing.

Point 4 is an invalid asumption, Python's garbage collector doesn't return the memory that easily, Python's program is taking up that memory but it's not used until garbage collector has a pass. In the meantime if some other request requires more memory - new might be allocated, on top the memory from the first request. If you want to force Python to garbage collect, you can use gc.collect() as mentioned here

Take a look at this Q&A for approaches to check on garbage collection and for potential alternate explanations: Google App Engine DB Query Memory Usage

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.