Google Cloud Datastore Indexes for count queries

Google Cloud Datastore Indexes for count queries - python

Google cloud datastore mandates that there needs to be composite indexes built to query on multiple fields of one kind. Taking the following query for example,
class Greeting(ndb.Model):
user = ndb.StringProperty()
place = ndb.StringProperty()
# Query 1
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').fetch()
# Query 2
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').count()
I am using python with ndb to access cloud datastore. In the above example, Query 1 raises NeedIndexError if there is no composite index defined on user and place. But Query 2 works fine even if there is no index on user and place.
I would like to understand how cloud datastore fetches the count (Query 2) without the index when it mandates the index for fetching the list of entities (Query 1). I understand it stores Stats per kind per index which would result in quicker response for counts on existing indexes (Refer docs). But I'm unable to explain the above behaviour.
Note: There is no issue when querying on one property of a given kind as cloud datastore has indexes on a single properties by default.

There is no clear & direct explanation on why this happens but most likely its because how improved query planner works with zigzag indexes.
You can read more about this here: https://cloud.google.com/appengine/articles/indexselection#Improved_Query_Planner
The logic behind count() working and fetch() does not probably because with count() you don't need to keep in memory a lot of results.
So in case of count() you can easily scale by splitting work in multiple chunks processed in parallel and then just sum corresponding counts into one. You can't do this cheaply with cursors/recordsets.

Related

How to compare multiple dates on an NDB query?

I need to fetch objects on an NDB queries that match a given start and end date, but I'm not able to do this traditionally simple query because NDB is complaining:
from google.appengine.ext import ndb
from datetime import datetime
from server.page.models import Post
now = datetime.now()
query = Post.query(
Post.status == Post.STATUS_ACTIVE,
Post.date_published_start <= now,
Post.date_published_end >= now,
)
count = query.count()
Error:
BadRequestError: Only one inequality filter per query is supported.
Encountered both date_published_start and date_published_end
Is there any workarounds for this?

Dynamically obtaining a single result list that can be directly used for pagination without any further processing is not possible due to the limitation of a single inequality filter per query limitation. Related GAE 4301 issue.
As Jeff mentioned, filtering by one inequality (ideally the most restrictive one) followed by further dynamic processing of the results is always an option, inefficient as you noted, but unavoidable if you need total flexibility of the search.
You could improve the performance by using a projection query - reducing the amount of data transfered from the datastore to just the relevant properties.
You could also try to perform 2 keys-only queries, one for each inequality, then compute the intersection of the results - this could give you the pagination counts and list of entities (as keys) faster. Finally you'd get the entities for the current page by direct key lookups for the keys in the page list, ideally batched (using ndb.get_multi()).
Depending on the intended use you might have other alternatives in some cases (additional work required, of course).
You could restrict the scope of the queries. Instead of querying all Post entities since the begining of time maybe just results in a certain year or month would suffice in certain cases. Then you could add the year and/or month Post properties which you can include as equality filters in your queries, potentially reducing the number of results to process dynamically from thousands to, say, hundreds or less.
You could also avoid the queries altogether for typical, often-use cases. For example if the intended use is to generate a few kinds of monthly reports you could have some Report entities containing lists of Post keys for each such report kind/month which you could update whenever a Post entity's relevant properties change. Instead of querying Posts entities for a report you'd instead just use the already available lists from the respective Report entity. You could also store/cache the actual report upon generation, for direct re-use (instead of re-generating it at every access).

Another workaround for querying with multiple filter and inequalities is to use the Search API.
https://cloud.google.com/appengine/training/fts_adv/lesson1#query_options
From the documentation:
For example, the query job tag:"very important" sent < 2011-02-28
finds documents with the term job in any field, and also contain the
phrase very important in a tag field, and a sent date prior to
February 28, 2011.
Just put your data from Datastore query into Search documents and run your query on these documents.

Counting the number of distinct strings given by a GQL Query in Python

Suppose I have the following GQL database,
class Signatories(db.Model):
name = db.StringProperty()
event = db.StringProperty()
This database holds information regarding events that people have signed up for. Say I have the following entries in the database in the format (event_name, event_desc): (Bob, TestEvent), (Bob, TestEvent2), (Fred, TestEvent), (John, TestEvent).
But the dilemma here is that I cannot just aggregate all of Bob's events into one entity because I'd like to Query for all the people signed up for a specific event and also I'd like to add such entries without having to manually update the entry every single time.
How could I count the number of distinct strings given by a GQL Query in Python (in my example, I am specifically trying to see how many people are currently signed up for events)?
I have tried using the old mcount = db.GqlQuery("SELECT name FROM Signatories").count(), however this of course returns the total number of strings in the list, regardless of the uniqueness of each string.
I have also tried using count = len(member), where member = db.GqlQuery("SELECT name FROM Signatories"), but unfortunately, this only returns an error.

You can't - at least not directly. (By the way you don't have a GQL database).
If you have a small number of items, then fetch them into memory, and use a set operation to produce the unique set and then count
If you have larger numbers of entities that make in memory filtering and counting problematic then your strategy will be to aggregate the count as you create them,
e.g.
create a separate entity each time you create an event that has the pair of strings as the key. This way you will only have one entity the data store representing the specific pair. Then you can do a straight count.
However as you get large numbers of these entities you will need to start performing some additional work to count them as the single query.count() will become too expensive. You then need to start looking at counting strategies using the datastore.

Google NDB: How to Make a Keys Only Query By Id

Id like to check whether an entity still exist in a NDB DataStore. I have the Entity's ID and I do not this operation to count as a read operation but I can't see how to make a keys_only = True query while using get_by_id

Its not possible to use it with .get() operation.
You can do it with query but you will get one read operation anyway, and queries are slower and don't use memcache. But probably still worth to use if your entity is big enough.
Foo.query(Foo.key == ndb.Key(Foo, '11nNpmkaQk3iJ1kIFNQXAM')).get(keys_only=True)

google app engine cross group transactions needing parent ancestor

From my understanding, #db.transactional(xg=True) allows for transactions across groups, however the following code returns "queries inside transactions must have ancestors".
#db.transactional(xg=True)
def insertUserID(self,userName):
user = User.gql("WHERE userName = :1", userName).get()
highestUser = User.all().order('-userID').get()
nextUserID = highestID + 1
user.userID = nextUserID
user.put()
Do you need to pass in the key for each entity despite being a cross group transaction? Can you please help modify this example accordingly?

An XG transaction can be applied across max 25 entity groups. Ancestor query limits the query to a single entity group, and you would be able to do queries within those 25 entity groups in a single XG transaction.
A transactional query without parent would potentially include all entity groups in the application and lock everything up, so you get an error message instead.
In app engine one usually tries to avoid monotonically increasing ids. The auto assigned ones might go like 101, 10001, 10002 and so on. If you know that you need monotonically increasng ids it and it'll work for you performance wise, how about:
Have some kind of model representation of userId to enable key_name
usage and direct lookup
Query for userId outside transaction, get highest candidate id
In transaction do get_or_insert; lookup UserId.get_by_key_name(candidateid+1). If
already present and pointing to a different user, try again with +2
and so on until you find a free one and create it, updating the
userid attribute of user at the same time.
If the XG-transaction of updating UserId+User is too slow, perhaps create UserId+task in transaction (not XG), and let the executing task associate UserId and User afterwards. Or a single backend that can serialize UserId creation and perhaps allow put_async if you retry to avoid holes in the sequence and do something like 50 creations per second.
If it's possible to use userName as key_name you can do direct lookup instead of query and make things faster and cheaper.

Cross group transactions allow you to perform a transaction across multiple groups, but they don't remove the prohibition on queries inside transactions. You need to perform the query outside the transaction, and pass the ID of the entity in (and then check any invariants specified in the query still hold) - or, as Shay suggests, use IDs so you don't have to do a query in the first place.

Every datastore entity has a key, a key (amount other things) has a numeric id that the AppEngine assign to it or key_name which you can give it.
In your case it looks like you can use the numeric id, after you call put() on the user entity you will have: user.key().id() (or user.key.id() if your using NDB) which will be unique for each user (as long as all the user have the same parent, which is None in your code).
This id is not sequential but guarantee to be unique.

How to get the distinct value of one of my models in Google App Engine

I have a model, below, and I would like to get all the distinct area values. The SQL equivalent is select distinct area from tutorials
class Tutorials(db.Model):
path = db.StringProperty()
area = db.StringProperty()
sub_area = db.StringProperty()
title = db.StringProperty()
content = db.BlobProperty()
rating = db.RatingProperty()
publishedDate = db.DateTimeProperty()
published = db.BooleanProperty()
I know that in Python I can do
a = ['google.com', 'livejournal.com', 'livejournal.com', 'google.com', 'stackoverflow.com']
b = set(a)
b
>>> set(['livejournal.com', 'google.com', 'stackoverflow.com'])
But that would require me moving the area items out of the query into another list and then running set against the list (sounds very inefficient) and if I have a distinct item that is in position 1001 in the datastore I wouldnt see it because of the fetch limit of 1000.
I would like to get all the distinct values of area in my datastore to dump it to the screen as links.

Datastore cannot do this for you in a single query. A datastore request always returns a consecutive block of results from an index, and an index always consists of all the entities of a given type, sorted according to whatever orders are specified. There's no way for the query to skip items just because one field has duplicate values.
One option is to restructure your data. For example introduce a new entity type representing an "area". On adding a Tutorial you create the corresponding "area" if it doesn't already exist, and on deleting a Tutoral delete the corresponding "area" if no Tutorials remain with the same "area". If each area stored a count of Tutorials in that area, this might not be too onerous (although keeping things consistent with transactions etc would actually be quite fiddly). I expect that the entity's key could be based on the area string itself, meaning that you can always do key lookups rather than queries to get area entities.
Another option is to use a queued task or cron job to periodically create a list of all areas, accumulating it over multiple requests if need be, and put the results either in the datastore or in memcache. That would of course mean the list of areas might be temporarily out of date at times (or if there are constant changes, it might never be entirely in date), which may or may not be acceptable to you.
Finally, if there are likely to be very few areas compared with tutorials, you could do it on the fly by requesting the first Tutorial (sorted by area), then requesting the first Tutorial whose area is greater than the area of the first, and so on. But this requires one request per distinct area, so is unlikely to be fast.

The DISTINCT keyword has been introduced in release 1.7.4.

This has been asked before, and the conclusion was that using sets is fine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.