App Engine Batch Put Too Large? - python

I occasionally get this error when I do batch puts.
RequestTooLargeError: The request to API call datastore_v3.Put() was too large.
The call that triggers this does a db.put call on a list of 1000+ entities. Each entity has a single db.TextProperty field, filled with about 20,000 characters. Each entity also has a parent entity, although none of the entities in the list passed to db.put share a common parent. Each of the parent entities store about 10 integers and aren't very large.
My first instinct was to split up the number of entities being passed to db.put, but
Any ideas on the cause of this?
Edit: Splitting up the entities does work. For example, I can do this:
for entity in entities: entity.put()
But the answer to this question suggests that the number of entities being put shouldn't matter. So still confused.

Remember that a quick test to find an entity's size is to try to write it to memcache. If it exceeds memcache's 1 meg limit the write will fail an you can catch the exception. May be of use if you follow Nick's suggestion to isolate the issue.

I used a StringProperty until now and just ran into the same issue, when data stored in that property eventually grew too big (it just passed 1 MB). I was able to fix it for now with a JsonProperty where compressed=True. Might be an option.

Is this batch update inside a transaction? if so, remember there is a 10 MB limit for a transaction size

Related

Ndb strong consistency and frequent writes

I'm trying to achieve strong consistency with ndb using python.
And looks like I'm missing something as my reads behave like they're not strongly consistent.
The query is:
links = Link.query(ancestor=lead_key).filter(Link.last_status ==
None).fetch(keys_only=True)
if links:
do_action()
The key structure is:
Lead root (generic key) -> Lead -> Website (one per lead) -> Link
I have many tasks that are executed concurrently using TaskQueue and this query is performed at the end of every task. Sometimes I'm getting "too much contention" exception when updating the last_status field but I deal with it using retries. Can it break strong consistency?
The expected behavior is having do_action() called when there are no links left with last_status equal to None. The actual behavior is inconsistent: sometimes do_action() is called twice and sometimes not called at all.
Using an ancestor key to get strong consistency has a limitation: you're limited to one update per second per entity group. One way to work around this is to shard the entity groups. Sharding Counters describes the technique. It's an old article, but as far as I know, the advise is still sound.
Adding to Dave's answer which is the 1st thing to check.
One thing which isn't well documented and can be a bit surprising is that the contention can be caused by read operations as well, not only by the write ones.
Whenever a transaction starts the entity groups being accessed (by read or write ops, doesn't matter) are marked as such. The too much contention error indicates that too many parallel transactions simultaneously try to access the same entity group. It can happen even if none of the transactions actually attempts to write!
Note: this contention is NOT emulated by the development server, it can only be seen when deployed on GAE, with the real datastore!
What can add to the confusion is the automatic re-tries of the transactions, which can happen after both actual write conflicts or just plain access contention. These retries may appear to the end-user as suspicious repeated execution of some code paths - which I suspect could explain your reports of do_action() being called twice.
Usually when you run into such problems you have to re-visit your data structures and/or the way you're accessing them (your transactions). In addition to solutions maintaining the strong consistency (which can be quite expensive) you may want to re-check if consistency is actually a must. In some cases it's added as a blanket requirement just because appears to simplify things. From my experience it doesn't :)
There is nothing in your sample that ensures that your code is only called once.
For the moment, I am going to assume that your "do_action" function does something to the Link entities, specifically that it sets the "last_status" property.
If you do not perform the query and the write to the Link Entity inside a transaction, then it is possible for two different requests (task queue tasks) to get results back from the query, then both write their new value to the Link entity (with the last write overwriting the previous value).
Remember that even if you do use a transaction, you don't know until the transaction is successfully completed that nobody else tried to perform a write. This is important if you are trying to do something external to datastore (for example, making a http request to an external system), as you may see http requests from transactions that would eventually fail with a concurrent modification exception.

How efficient is Google App Engine ndb.delete_multi()?

I'm working on something to clear my database of ~10,000 entities, and my plan is to put it in a task that deletes 200 at a time using ndb.delete_multi() and then recursively calls itself again until there are no entities left.
For now, I don't have the recursion in it yet so I could run the code a few times manually and check for errors, quota use, etc. The code is:
entities = MyModel.query_all(ndb.Key('MyModel', '*defaultMyModel')).fetch(200)
key_list = ndb.put_multi(entities)
ndb.delete_multi(key_list)
All the query_all() does is query MyModel and return everything.
I've done some testing by commenting out things and running the method, and it looks like the first two lines take up the expected amount of writes (~200).
Running the third line, ndb.delete_multi(), takes up about 8% of my 50,000 daily write allowance, so about 4000 writes--20 times as many as I think it should be doing.
I've also made sure the key_list contains only 200 keys with logging.
Any ideas on why this takes up so many writes? Am I using the method wrong? Or does it just use a ton of memory? In that case, is there any way for me to do this more efficiently?
Thanks.
When you delete an entity, the Datastore has to remove an entity and a record from an index for each indexed property as well as for each custom index. The number of writes is not dependent on which delete method you use.
Your code example is extremely inefficient. If you are deleting large numbers of entities than you will need to batch the below but, you should be retrieving data with a keys_only query and then deleting:
from google.appengine.ext import ndb
ndb.delete_multi(
MyModel.query().fetch(keys_only=True)
)
In regards to the number of write operations (see Andrei's answer), ensure only the fields on your model that are required to be indexed "have an index enabled".

Is fetch() better than list(Model.all().run()) for returning a list from a datastore query?

Using Google App Engine Python 2.7 Query Class -
I need to produce a list of results that I pass to my django template. There are two ways I've found to do this.
Use fetch, however in the docs it says that fetch should almost never be used. https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch
Use run() and then wrap it into list() thereby creating the list object.
Is one preferable to the other in terms of memory usage? Is there another way I could be doing this?
The key here is why fetch “should almost never be used”. The documentation says that fetch will get all the results, therefore having to keep all of them in memory at the same time. If the data you get is big, you will need lots of memory.
You say you can wrap run inside list. Sure, you can do that, but you will hit exactly the same problem—list will force all the elements into memory. So, this solution is actually discouraged on the same basis as using fetch.
Now, you could say: so what should I do? The answer is: in most cases you can deal with elements of your data one by one, without keeping them all in memory at the same time. For example, if all you need is to put the result data into a django template, and you know that it will be used at most once in your template, then the django template will happily take any iterator—so you can pass the run call result directly without wrapping it into list.
Similarly, if you need to do some processing, for example go over the results to find the element with the highest price or ranking, or whatever, you can just iterate over the result of run.
But if your usage requires having all the elements in memory (e.g.: your django template uses the data from the query several times), then you have a case where fetch or list(run(…)) actually has sense. In the end—this is just the typical trade-off: if you need for your application to apply an algorithm which requires all the data in memory, you need to pay for it by using up memory. So, you can either redesign your algorithms and usage to work with an iterator, or use fetch and pay for it by longer processing times and higher memory usage. Google of course encourages you to do the first thing. And this is what “should almost never be used” actually means.

Google Application Engine slow in case of Python

I am reading a "table" in Python in GAE that has 1000 rows and the program stops because the time limit is reached. (So it takes at least 20 seconds.)(
Is that possible that GAE is that slow? Is there a way to fix that?
Is this because I use free service and I do not pay for it?
Thank you.
The code itself is this:
liststocks=[]
userall=user.all() # this has three fields username... trying to optimise by this line
stocknamesall=stocknames.all() # 1 field, name of the stocks trying to optimise by this line too
for u in userall: # userall has 1000 users
for stockname in stocknamesall: # 4 stocks
astock= stocksowned() #it is also a "table", no relevance I think
astock.quantity = random.randint(1,100)
astock.nameid = u.key()
astock.stockid = stockname.key()
liststocks.append(astock);
GAE is slow when used inefficiently. Like any framework, sometimes you have to know a little bit about how it works in order to efficiently use it. Luckily, I think there is an easy improvement that will help your code a lot.
It is faster to use fetch() explicitly instead of using the iterator. The iterator causes entities to be fetched in "small batches" - each "small batch" results in a round-trip to the datastore to get more data. If you use fetch(), then you'll get all the data at once with just one round-trip to the datastore. In short, use fetch() if you know you are going to need lots of results.
In this case, using fetch() will help a lot - you can easily get all your users and stocknames in one round-trip to the datastore each. Right now you're making lots of extra round-trips to the datastore and re-fetching stockname entities too!
Try this (you said your table has 1000 rows, so I use fetch(1000) to make sure you get all the results; use a larger number if needed):
userall=user.all().fetch(1000)
stocknamesall=stocknames.all().fetch(1000)
# rest of the code as-is
To see where you could make additional improvements, please try out AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot (like this) of the appstats info about your request along with your post.

Getting a list of child entities in App Engine using get_by_key_name (Python)

My adventures with entity groups continue after a slightly embarrassing beginning (see Under some circumstances an App Engine get_by_key_name call using an existing key_name returns None).
I now see that I can't do a normal get_by_key_name call over a list of entities for child entities that have more than one parent entity. As the Model docs say,
Multiple entities requested by one
(get_by_key_name) call must all have
the same parent.
I've gotten into the habit of doing something like the following:
# Model just has the basic properties
entities = Model.get_by_key_name(key_names)
# ContentModel has all the text and blob properties for Model
content_entities = ContentModel.get_by_key_name(content_key_names)
for entity, content_entity in zip(entities, content_entities):
# do some stuff
Now that ContentModel entities are children of Model entities, this won't work because of the single-parent requirement.
An easy way to enable the above scenario with entity groups is to be able to pass a list of parents to a get_by_key_name call, but I'm guessing that there's a good reason why this isn't currently possible. I'm wondering if this is a hard rule (as in there is absolutely no way such a call could ever work) or if perhaps the db module could be modified so that this type of call would work, even if it meant a greater CPU expense.
I'd also really like to see how others accomplish this sort of task. I can think of a bunch of ways of handling it, like using GQL queries, but none I can think of approach the performance of a get_by_key_name call.
Just create a key list and do a get on it.
entities = Model.get_by_key_name(key_names)
content_keys = [db.Key.from_path('Model', name, 'ContentModel', name)
for name in key_names]
content_entities = ContentModel.get(content_keys)
Note that I assume the key_name for each ContentModel entity is the same as its parent Model. (For a 1:1 relationship, it makes sense to reuse the key_name.)
I'm embarrassed to say that the restriction ('must be in the same entity group') actually no longer applies in this case. Please do feel free to file a documentation bug!
In any case, get_by_key_name is only syntactic sugar for get, as Bill Katz illustrates. You can go a step further, even, and use db.get on a list of keys to get everything in one go - db.get doesn't care about model type.

Categories