Say I have a query that will be executed often, most likely yielding the same results.
Is it correct that using:
for key in qry.iter(keys_only=True):
item = key.get()
#do something with item
Would perform better than:
for item in qry:
#do something with item
Because in the first example, the query will only load the keys and subsequent calls to key.get() will take advantage of NDB's caching mechanism, whereas example 2 will always fetch the entities from the store? Or have I misunderstood something?
I would doubt that the second form would perform better -- it is always possible that the values are not in the cache, and then, presuming you are getting more than one entity back, you'd be making multiple roundtrips. That quickly gets slower.
A better approach is indeed what's shown in http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118 -- use ndb.multi_get(q.fetch(keys_only=True)). But even that is worse if your cache hit rate is too low; this is extensively discussed in the issue.
AFAIK It will not make any different, because internally, ndb caches everything, including query. If you are going to do other stuff with each one, try async api. that can save valuable time. edit : moreover, if ndb knows query in advance, it can even prefetch them.
I have read this six months back so not sure what is current behavior.
Related
I'm trying to achieve strong consistency with ndb using python.
And looks like I'm missing something as my reads behave like they're not strongly consistent.
The query is:
links = Link.query(ancestor=lead_key).filter(Link.last_status ==
None).fetch(keys_only=True)
if links:
do_action()
The key structure is:
Lead root (generic key) -> Lead -> Website (one per lead) -> Link
I have many tasks that are executed concurrently using TaskQueue and this query is performed at the end of every task. Sometimes I'm getting "too much contention" exception when updating the last_status field but I deal with it using retries. Can it break strong consistency?
The expected behavior is having do_action() called when there are no links left with last_status equal to None. The actual behavior is inconsistent: sometimes do_action() is called twice and sometimes not called at all.
Using an ancestor key to get strong consistency has a limitation: you're limited to one update per second per entity group. One way to work around this is to shard the entity groups. Sharding Counters describes the technique. It's an old article, but as far as I know, the advise is still sound.
Adding to Dave's answer which is the 1st thing to check.
One thing which isn't well documented and can be a bit surprising is that the contention can be caused by read operations as well, not only by the write ones.
Whenever a transaction starts the entity groups being accessed (by read or write ops, doesn't matter) are marked as such. The too much contention error indicates that too many parallel transactions simultaneously try to access the same entity group. It can happen even if none of the transactions actually attempts to write!
Note: this contention is NOT emulated by the development server, it can only be seen when deployed on GAE, with the real datastore!
What can add to the confusion is the automatic re-tries of the transactions, which can happen after both actual write conflicts or just plain access contention. These retries may appear to the end-user as suspicious repeated execution of some code paths - which I suspect could explain your reports of do_action() being called twice.
Usually when you run into such problems you have to re-visit your data structures and/or the way you're accessing them (your transactions). In addition to solutions maintaining the strong consistency (which can be quite expensive) you may want to re-check if consistency is actually a must. In some cases it's added as a blanket requirement just because appears to simplify things. From my experience it doesn't :)
There is nothing in your sample that ensures that your code is only called once.
For the moment, I am going to assume that your "do_action" function does something to the Link entities, specifically that it sets the "last_status" property.
If you do not perform the query and the write to the Link Entity inside a transaction, then it is possible for two different requests (task queue tasks) to get results back from the query, then both write their new value to the Link entity (with the last write overwriting the previous value).
Remember that even if you do use a transaction, you don't know until the transaction is successfully completed that nobody else tried to perform a write. This is important if you are trying to do something external to datastore (for example, making a http request to an external system), as you may see http requests from transactions that would eventually fail with a concurrent modification exception.
In my Python (Flask) code, I need to get the first element and the last one sorted by a given variable from a SQLAlchemy query.
I first wrote the following code :
first_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).first()
# Do some things
last_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.asc(Valuation.date)).first()
# Do other things
As these queries can be heavy for the PostgreSQL database, and as I am duplicating my code, I think it will be better to use only one request, but I don't know SQLAlchemy enough to do it...
(When queries are effectively triggered, for example ?)
What is the best solution to this problem ?
1) How to get First and Last record from a sql query? this is about how to get first and last records in one query.
2) Here are docs on sqlalchemy query. Specifically pay attention to union_all (to implement answers from above).
It also has info on when queries are triggered (basically, queries are triggered when you use methods, that returns results, like first() or all(). That means, Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)) will not emit query to database).
Also, if memory is not a problem, I'd say get all() objects from your first query and just get first and last result via python:
results = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).all()
first_valuation = results[0]
last_valuation = results[-1]
It will be faster than performing two (even unified) queries, but will potentially eat a lot of memory, if your database is large enough.
No need to complicate the process so much.
first_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).first()
# Do some things
last_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.asc(Valuation.date)).first()
This is what you've and it's good enough. It's not heavy for any database. If you think that it's becoming too heavy, then you can always use some index.
Don't try to get all the results using all() and retrieving from it in list style. When you do, all() it loads everything into the memory which is extremely and extremely bad if you have a lot of results. It's a lot better to execute just two queries to get those items.
Two code examples (simplified):
.get outside the transaction (object from .get passed into the transactional function)
#db.transactional
def update_object_1_txn(obj, new_value):
obj.prop1 = new_value
return obj.put()
.get inside the transaction
#db.transactional
def update_object2_txn(obj_key, new_value):
obj = db.get(obj_key)
obj.prop1 = new_value
return obj.put()
Is the first example logically sound? Is the transaction there useful at all, does it provide anything? I'm trying to better understand appengine's transactions. Would choosing the second option prevent from concurrent modifications for that object?
To answer your question in one word: yes, your second example is the way to do it. In the boundaries of a transaction, you get some data, change it, and commit the new value.
Your first one is not wrong, though, because you don't read from obj. So even though it might not have the same value that the earlier get returned, you wouldn't notice. Put another way: as written, your examples aren't good at illustrating the point of a transaction, which is usually called "test and set". See a good Wikipedia article on it here: http://en.wikipedia.org/wiki/Test-and-set
More specific to GAE, as defined in GAE docs, a transaction is:
a set of Datastore operations on one or more entities. Each transaction is guaranteed to be atomic, which means that transactions are never partially applied. Either all of the operations in the transaction are applied, or none of them are applied.
which tells you it doesn't have to be just for test and set, it could also be useful for ensuring the batch commit of several entities, etc.
Using Google App Engine Python 2.7 Query Class -
I need to produce a list of results that I pass to my django template. There are two ways I've found to do this.
Use fetch, however in the docs it says that fetch should almost never be used. https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch
Use run() and then wrap it into list() thereby creating the list object.
Is one preferable to the other in terms of memory usage? Is there another way I could be doing this?
The key here is why fetch “should almost never be used”. The documentation says that fetch will get all the results, therefore having to keep all of them in memory at the same time. If the data you get is big, you will need lots of memory.
You say you can wrap run inside list. Sure, you can do that, but you will hit exactly the same problem—list will force all the elements into memory. So, this solution is actually discouraged on the same basis as using fetch.
Now, you could say: so what should I do? The answer is: in most cases you can deal with elements of your data one by one, without keeping them all in memory at the same time. For example, if all you need is to put the result data into a django template, and you know that it will be used at most once in your template, then the django template will happily take any iterator—so you can pass the run call result directly without wrapping it into list.
Similarly, if you need to do some processing, for example go over the results to find the element with the highest price or ranking, or whatever, you can just iterate over the result of run.
But if your usage requires having all the elements in memory (e.g.: your django template uses the data from the query several times), then you have a case where fetch or list(run(…)) actually has sense. In the end—this is just the typical trade-off: if you need for your application to apply an algorithm which requires all the data in memory, you need to pay for it by using up memory. So, you can either redesign your algorithms and usage to work with an iterator, or use fetch and pay for it by longer processing times and higher memory usage. Google of course encourages you to do the first thing. And this is what “should almost never be used” actually means.
For a model like:
class Thing(ndb.Model):
visible = ndb.BooleanProperty()
made_by = ndb.KeyProperty(kind=User)
belongs_to = ndb.KeyProperty(kind=AnotherThing)
Essentially performing an 'or' query, but comparing different properties so I can't use a built in OR... I want to get all Thing (belonging to a particular AnotherThing) which either have visible set to True or visible is False and made_by is the current user.
Which would be less demanding on the datastore (ie financially cost less):
Query to get everything, ie: Thing.query(Thing.belongs_to == some_thing.key) and iterate through the results, storing the visible ones, and the ones that aren't visible but are made_by the current user?
Query to get the visible ones, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "True") and query separately to get the non-visible ones by the current user, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "False", Thing.made_by = current_user)?
Number 1. would get many unneeded results, like non-visible Things by other users - which I think is many reads of the datastore? 2. is two whole queries though, which is also possibly unnecessarily heavy, right? I'm still trying to work out what kinds of interaction with the database cause what kinds of costs.
I'm using ndb, tasklets and memcache where necessary, in case that's relevant.
Number two is going to be financially less for two reasons. First you pay for each read of the data store and for each returned entity in a query, therefore you will be paying more for the first one which you have to Read all data and query all data. The second way you only pay for what you need.
Secondly you also pay for backend or frontend time, and you will be using time to iterate through all your results in the first method, where as you need to spend no time for the second method.
I can't see a way where the first option is better. (maybe if you only have a few entities??)
To understand how reads and queries cost you scroll down a little on:
https://developers.google.com/appengine/docs/billing
You will see how Read, Writes and Smalls are added up for reads, writes and queries.
I would also just query for ones that are owned by the current user instead of visible=false and owner=current, this way you don't need a composite index which will save some time. You can also make visible a partial index this was saving some space as well (only index it when true, assuming you never need to query for false ones). You will need to do a litte work to remove duplicates, but that is probably not to bad.
You are probably best benchmarking both cases using real-world data. It's hard to determine things like this in the abstract, as there are many subtleties that may affect overall performance.
I would expect option 2 to be better though. Loading tons of objects that you don't care about is simply going to put a heavy burden on the data store that I don't think an extra query would be comparable to. Of course, it depends on how many extra things, etc.