Google NDB: How to Make a Keys Only Query By Id - python

Id like to check whether an entity still exist in a NDB DataStore. I have the Entity's ID and I do not this operation to count as a read operation but I can't see how to make a keys_only = True query while using get_by_id

Its not possible to use it with .get() operation.
You can do it with query but you will get one read operation anyway, and queries are slower and don't use memcache. But probably still worth to use if your entity is big enough.
Foo.query(Foo.key == ndb.Key(Foo, '11nNpmkaQk3iJ1kIFNQXAM')).get(keys_only=True)

Related

Google Cloud Datastore Indexes for count queries

Google cloud datastore mandates that there needs to be composite indexes built to query on multiple fields of one kind. Taking the following query for example,
class Greeting(ndb.Model):
user = ndb.StringProperty()
place = ndb.StringProperty()
# Query 1
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').fetch()
# Query 2
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').count()
I am using python with ndb to access cloud datastore. In the above example, Query 1 raises NeedIndexError if there is no composite index defined on user and place. But Query 2 works fine even if there is no index on user and place.
I would like to understand how cloud datastore fetches the count (Query 2) without the index when it mandates the index for fetching the list of entities (Query 1). I understand it stores Stats per kind per index which would result in quicker response for counts on existing indexes (Refer docs). But I'm unable to explain the above behaviour.
Note: There is no issue when querying on one property of a given kind as cloud datastore has indexes on a single properties by default.
There is no clear & direct explanation on why this happens but most likely its because how improved query planner works with zigzag indexes.
You can read more about this here: https://cloud.google.com/appengine/articles/indexselection#Improved_Query_Planner
The logic behind count() working and fetch() does not probably because with count() you don't need to keep in memory a lot of results.
So in case of count() you can easily scale by splitting work in multiple chunks processed in parallel and then just sum corresponding counts into one. You can't do this cheaply with cursors/recordsets.

Python, gae, ndb - get all keys in a kind

I know how to get all entities by a key using Book.get_by_id(key)
where Book is an ndb.Model.
How do I get all the keys within my Kind?
Is it using fetch()(https://cloud.google.com/appengine/docs/python/ndb/queryclass#Query_fetch) ?
I don't want to get the keys/IDs from a given entity or some value. Just retrieve all the available keys, so I could retrieve their respectful entities and display it all to the user
If you only want the keys, use the keys_only keyword in the fetch() method:
Book.query().fetch(keys_only=True)
Then you can fetch all the entities using ndb.get_multi(keys). According to Guido, this may be more efficient than returning the entities in the query (if the entities are already in the cache).
With all_books = Book.query().fetch() the all_books variable will now have every entity of your Book model.
Note though that when you have lots of entities in the Book model - it won't be a good idea to load&show them all at once. You will need some kind of pagination implementation (depending on what exactly you're doing) - otherwise your pages will load forever which will create a bad experience for your users.
Read more at https://cloud.google.com/appengine/docs/python/ndb/queries
If you only wish to get all keys just use
entity.query().fetch(key_only=True)
which will return a list of all keys in that entity group. If you wanna get the IDs and not keys you can also use:
map(lambda key: key.id(), entity.query().fetch(key_only=True))

Why is a keys_only query not returning strongly consistent results?

From what I have been reading in the Google Docs and other SO questions, keys_only queries should return strongly consistent results (here and here, for example).
My code looks something like this:
class ClientsPage(SomeHandler):
def get(self):
query = Client.query()
clients = query.fetch(keys_only=True)
self.write(len(clients))
Even though I am fetching the results with the keys_only=True parameter I am getting stale results right after the creation of a new Client object (which is a root entity). If there were 2 client objects before the insertion, it keeps showing 2 after inserting and redirecting. I have to manually refresh the page in order to see the number change to 3.
I understand I could use ancestor queries, but I am testing some things first and I was surprised to see that a keys_only query returned stale results. Can anyone please explain to me what's going on?
EDIT 1:
This happened in the development server, I have not tested it in production.
Eventual consistency exists because the Datastore needs time to update all indexes. Keys-only query is the same as all the other queries, except it tells the Datastore - I don't need the entire entity, just return me the key. The query still looks at the indexes to get the list of results.
In contrast, getting an entity by key does not need to look at the indexes, so it is always strongly consistent.

google app engine cross group transactions needing parent ancestor

From my understanding, #db.transactional(xg=True) allows for transactions across groups, however the following code returns "queries inside transactions must have ancestors".
#db.transactional(xg=True)
def insertUserID(self,userName):
user = User.gql("WHERE userName = :1", userName).get()
highestUser = User.all().order('-userID').get()
nextUserID = highestID + 1
user.userID = nextUserID
user.put()
Do you need to pass in the key for each entity despite being a cross group transaction? Can you please help modify this example accordingly?
An XG transaction can be applied across max 25 entity groups. Ancestor query limits the query to a single entity group, and you would be able to do queries within those 25 entity groups in a single XG transaction.
A transactional query without parent would potentially include all entity groups in the application and lock everything up, so you get an error message instead.
In app engine one usually tries to avoid monotonically increasing ids. The auto assigned ones might go like 101, 10001, 10002 and so on. If you know that you need monotonically increasng ids it and it'll work for you performance wise, how about:
Have some kind of model representation of userId to enable key_name
usage and direct lookup
Query for userId outside transaction, get highest candidate id
In transaction do get_or_insert; lookup UserId.get_by_key_name(candidateid+1). If
already present and pointing to a different user, try again with +2
and so on until you find a free one and create it, updating the
userid attribute of user at the same time.
If the XG-transaction of updating UserId+User is too slow, perhaps create UserId+task in transaction (not XG), and let the executing task associate UserId and User afterwards. Or a single backend that can serialize UserId creation and perhaps allow put_async if you retry to avoid holes in the sequence and do something like 50 creations per second.
If it's possible to use userName as key_name you can do direct lookup instead of query and make things faster and cheaper.
Cross group transactions allow you to perform a transaction across multiple groups, but they don't remove the prohibition on queries inside transactions. You need to perform the query outside the transaction, and pass the ID of the entity in (and then check any invariants specified in the query still hold) - or, as Shay suggests, use IDs so you don't have to do a query in the first place.
Every datastore entity has a key, a key (amount other things) has a numeric id that the AppEngine assign to it or key_name which you can give it.
In your case it looks like you can use the numeric id, after you call put() on the user entity you will have: user.key().id() (or user.key.id() if your using NDB) which will be unique for each user (as long as all the user have the same parent, which is None in your code).
This id is not sequential but guarantee to be unique.

Django objects.filter, how "expensive" would this be?

I am trying to make a search view in Django. It is a search form with freetext input + some options to select, so that you can filter on years and so on. This is some of the code I have in the view so far, the part that does the filtering. And I would like some input on how expensive this would be on the database server.
soknad_list = Soknad.objects.all()
if var1:
soknad_list = soknad_list.filter(pub_date__year=var1)
if var2:
soknad_list = soknad_list.filter(muncipality__name__exact=var2)
if var3:
soknad_list = soknad_list.filter(genre__name__exact=var3)
# TEXT SEARCH
stop_word_list = re.compile(STOP_WORDS, re.IGNORECASE)
search_term = '%s' % request.GET['q']
cleaned_search_term = stop_word_list.sub('', search_term)
cleaned_search_term = cleaned_search_term.strip()
if len(cleaned_search_term) != 0:
soknad_list = soknad_list.filter(Q(dream__icontains=cleaned_search_term) | Q(tags__icontains=cleaned_search_term) | Q(name__icontains=cleaned_search_term) | Q(school__name__icontains=cleaned_search_term))
So what I do is, first make a list of all objects, then I check which variables exists (I fetch these with GET on an earlier point) and then I filter the results if they exists. But this doesn't seem too elegant, it probably does a lot of queries to achieve the result, so is there a better way to this?
It does exactly what I want, but I guess there is a better/smarter way to do this. Any ideas?
filter itself doesn't execute a query, no query is executed until you explicitly fetch items from query (e.g. get), and list( query ) also executes it.
You can see the query that will be generated by using:
soknad_list.query.as_sql()[0]
You can then put that into your database shell to see how long the query takes, or use EXPLAIN (if your database backend supports it) to see how expensive it is.
As Aaron mentioned, you should get a hold of the query text that is going to be run against the database and use an EXPLAIN (or other some method) to view the query execution plan. Once you have a hold of the execution plan for the query you can see what is going on in the database itself. There are a lot of operations that see very expensive to run through procedural code that are very trivial for any database to run, especially if you provide indexes that the database can use for speeding up your query.
If I read your question correctly, you're retrieving a result set of all rows in the Soknad table. Once you have these results back you use the filter() method to trim down your results meet your criteria. From looking at the Django documentation, it looks like this will do an in-memory filter rather than re-query the database (of course, this really depends on which data access layer you're using and not on Django itself).
The most optimal solution would be to use a full-text search engine (Lucene, ferret, etc) to handle this for you. If that is not available or practical the next best option would be to to construct a query predicate (WHERE clause) before issuing your query to the database and let the database perform the filtering.
However, as with all things that involve the database, the real answer is 'it depends.' The best suggestion is to try out several different approaches using data that is close to production and benchmark them over at least 3 iterations before settling on a final solution to the problem. It may be just as fast, or even faster, to filter in memory rather than filter in the database.

Categories