According to the documentation, when using ancestor queries the following limitation will be enforced:
Ancestor queries allow you to make strongly consistent queries to the
datastore, however entities with the same ancestor are limited to 1
write per second.
class Customer(ndb.Model):
name = ndb.StringProperty()
class Purchase(ndb.Model):
price = ndb.IntegerProperty
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
purchase3 = Purchase(ancestor=customer_entity.key)
purchase1.put()
purchase2.put()
purchase3.put()
Taking the same example, if I was about to write three purchases at same time, would I get an exception, as its less than a second apart?
Here you can find two excellent videos about the datastore, strong consistency and entity groups. Datastore Introduction and Datastore Query, Index and Transaction.
About your example. You can use a put_multi() which "counts" for a single entity group write.
Related
Google cloud datastore mandates that there needs to be composite indexes built to query on multiple fields of one kind. Taking the following query for example,
class Greeting(ndb.Model):
user = ndb.StringProperty()
place = ndb.StringProperty()
# Query 1
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').fetch()
# Query 2
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').count()
I am using python with ndb to access cloud datastore. In the above example, Query 1 raises NeedIndexError if there is no composite index defined on user and place. But Query 2 works fine even if there is no index on user and place.
I would like to understand how cloud datastore fetches the count (Query 2) without the index when it mandates the index for fetching the list of entities (Query 1). I understand it stores Stats per kind per index which would result in quicker response for counts on existing indexes (Refer docs). But I'm unable to explain the above behaviour.
Note: There is no issue when querying on one property of a given kind as cloud datastore has indexes on a single properties by default.
There is no clear & direct explanation on why this happens but most likely its because how improved query planner works with zigzag indexes.
You can read more about this here: https://cloud.google.com/appengine/articles/indexselection#Improved_Query_Planner
The logic behind count() working and fetch() does not probably because with count() you don't need to keep in memory a lot of results.
So in case of count() you can easily scale by splitting work in multiple chunks processed in parallel and then just sum corresponding counts into one. You can't do this cheaply with cursors/recordsets.
I need to fetch objects on an NDB queries that match a given start and end date, but I'm not able to do this traditionally simple query because NDB is complaining:
from google.appengine.ext import ndb
from datetime import datetime
from server.page.models import Post
now = datetime.now()
query = Post.query(
Post.status == Post.STATUS_ACTIVE,
Post.date_published_start <= now,
Post.date_published_end >= now,
)
count = query.count()
Error:
BadRequestError: Only one inequality filter per query is supported.
Encountered both date_published_start and date_published_end
Is there any workarounds for this?
Dynamically obtaining a single result list that can be directly used for pagination without any further processing is not possible due to the limitation of a single inequality filter per query limitation. Related GAE 4301 issue.
As Jeff mentioned, filtering by one inequality (ideally the most restrictive one) followed by further dynamic processing of the results is always an option, inefficient as you noted, but unavoidable if you need total flexibility of the search.
You could improve the performance by using a projection query - reducing the amount of data transfered from the datastore to just the relevant properties.
You could also try to perform 2 keys-only queries, one for each inequality, then compute the intersection of the results - this could give you the pagination counts and list of entities (as keys) faster. Finally you'd get the entities for the current page by direct key lookups for the keys in the page list, ideally batched (using ndb.get_multi()).
Depending on the intended use you might have other alternatives in some cases (additional work required, of course).
You could restrict the scope of the queries. Instead of querying all Post entities since the begining of time maybe just results in a certain year or month would suffice in certain cases. Then you could add the year and/or month Post properties which you can include as equality filters in your queries, potentially reducing the number of results to process dynamically from thousands to, say, hundreds or less.
You could also avoid the queries altogether for typical, often-use cases. For example if the intended use is to generate a few kinds of monthly reports you could have some Report entities containing lists of Post keys for each such report kind/month which you could update whenever a Post entity's relevant properties change. Instead of querying Posts entities for a report you'd instead just use the already available lists from the respective Report entity. You could also store/cache the actual report upon generation, for direct re-use (instead of re-generating it at every access).
Another workaround for querying with multiple filter and inequalities is to use the Search API.
https://cloud.google.com/appengine/training/fts_adv/lesson1#query_options
From the documentation:
For example, the query job tag:"very important" sent < 2011-02-28
finds documents with the term job in any field, and also contain the
phrase very important in a tag field, and a sent date prior to
February 28, 2011.
Just put your data from Datastore query into Search documents and run your query on these documents.
I am storing a key of an entity as a property of another in order to relate them. We are in a refactor stage at this point in the project so I was thinking about introducing ancestors.
Is there a performance difference between the two approaches? Any given advantages that I might gain if we introduce ancestors?
class Book(ndb.Model):
...
class Article(ndb.Model):
book_key = ndb.KeyProperty(kind=Book, required=True)
book_key = ndb.Key("Book", 12345)
1st ancestor query approach
qry = Article.query(ancestor=book_key)
2st simple key query approach
qry = Article.query(book_key=book_key)
The ancestor query will always be fully consistent. Querying by book_key, on the other hand, will not necessarily be consistent: you may find that recent changes will not be shown in that query.
On the other hand, introducing an ancestor imposes a limit on the number of updates: you can only do one update per second to any entity group (ie the ancestor and its children).
It's a trade-off for you as to which one is more important in your app.
From my understanding, #db.transactional(xg=True) allows for transactions across groups, however the following code returns "queries inside transactions must have ancestors".
#db.transactional(xg=True)
def insertUserID(self,userName):
user = User.gql("WHERE userName = :1", userName).get()
highestUser = User.all().order('-userID').get()
nextUserID = highestID + 1
user.userID = nextUserID
user.put()
Do you need to pass in the key for each entity despite being a cross group transaction? Can you please help modify this example accordingly?
An XG transaction can be applied across max 25 entity groups. Ancestor query limits the query to a single entity group, and you would be able to do queries within those 25 entity groups in a single XG transaction.
A transactional query without parent would potentially include all entity groups in the application and lock everything up, so you get an error message instead.
In app engine one usually tries to avoid monotonically increasing ids. The auto assigned ones might go like 101, 10001, 10002 and so on. If you know that you need monotonically increasng ids it and it'll work for you performance wise, how about:
Have some kind of model representation of userId to enable key_name
usage and direct lookup
Query for userId outside transaction, get highest candidate id
In transaction do get_or_insert; lookup UserId.get_by_key_name(candidateid+1). If
already present and pointing to a different user, try again with +2
and so on until you find a free one and create it, updating the
userid attribute of user at the same time.
If the XG-transaction of updating UserId+User is too slow, perhaps create UserId+task in transaction (not XG), and let the executing task associate UserId and User afterwards. Or a single backend that can serialize UserId creation and perhaps allow put_async if you retry to avoid holes in the sequence and do something like 50 creations per second.
If it's possible to use userName as key_name you can do direct lookup instead of query and make things faster and cheaper.
Cross group transactions allow you to perform a transaction across multiple groups, but they don't remove the prohibition on queries inside transactions. You need to perform the query outside the transaction, and pass the ID of the entity in (and then check any invariants specified in the query still hold) - or, as Shay suggests, use IDs so you don't have to do a query in the first place.
Every datastore entity has a key, a key (amount other things) has a numeric id that the AppEngine assign to it or key_name which you can give it.
In your case it looks like you can use the numeric id, after you call put() on the user entity you will have: user.key().id() (or user.key.id() if your using NDB) which will be unique for each user (as long as all the user have the same parent, which is None in your code).
This id is not sequential but guarantee to be unique.
in case i have a user Model and article Model, user and article are one-to-many relation. so i can access article like this
user = session.query(User).filter(id=1).one()
print user.articles
but this will list user's all articles, what if i want to limit articles to 10 ? in rails there is an all() method which can have limit / offset in it. in sqlalchemy there also is an all() method, but take no params, how to achieve this?
Edit:
it seems user.articles[10:20] is valid, but the sql didn't use 10 / 20 in queries. so in fact it will load all matched data, and filter in python?
The solution is to use a dynamic relationship as described in the collection configuration techniques section of the SQLAlchemy documentation.
By specifying the relationship as
class User(...):
# ...
articles = relationship('Articles', order_by='desc(Articles.date)', lazy='dynamic')
you can then write user.articles.limit(10) which will generate and execute a query to fetch the last ten articles by the user. Or you can use the [x:y] syntax if you prefer which will automatically generate a LIMIT clause.
Performance should be reasonable unless you want to query the past ten articles for 100 or so users (in which instance at least 101 queries will be sent to the server).