How to compare multiple dates on an NDB query? - python

I need to fetch objects on an NDB queries that match a given start and end date, but I'm not able to do this traditionally simple query because NDB is complaining:
from google.appengine.ext import ndb
from datetime import datetime
from server.page.models import Post
now = datetime.now()
query = Post.query(
Post.status == Post.STATUS_ACTIVE,
Post.date_published_start <= now,
Post.date_published_end >= now,
)
count = query.count()
Error:
BadRequestError: Only one inequality filter per query is supported.
Encountered both date_published_start and date_published_end
Is there any workarounds for this?

Dynamically obtaining a single result list that can be directly used for pagination without any further processing is not possible due to the limitation of a single inequality filter per query limitation. Related GAE 4301 issue.
As Jeff mentioned, filtering by one inequality (ideally the most restrictive one) followed by further dynamic processing of the results is always an option, inefficient as you noted, but unavoidable if you need total flexibility of the search.
You could improve the performance by using a projection query - reducing the amount of data transfered from the datastore to just the relevant properties.
You could also try to perform 2 keys-only queries, one for each inequality, then compute the intersection of the results - this could give you the pagination counts and list of entities (as keys) faster. Finally you'd get the entities for the current page by direct key lookups for the keys in the page list, ideally batched (using ndb.get_multi()).
Depending on the intended use you might have other alternatives in some cases (additional work required, of course).
You could restrict the scope of the queries. Instead of querying all Post entities since the begining of time maybe just results in a certain year or month would suffice in certain cases. Then you could add the year and/or month Post properties which you can include as equality filters in your queries, potentially reducing the number of results to process dynamically from thousands to, say, hundreds or less.
You could also avoid the queries altogether for typical, often-use cases. For example if the intended use is to generate a few kinds of monthly reports you could have some Report entities containing lists of Post keys for each such report kind/month which you could update whenever a Post entity's relevant properties change. Instead of querying Posts entities for a report you'd instead just use the already available lists from the respective Report entity. You could also store/cache the actual report upon generation, for direct re-use (instead of re-generating it at every access).

Another workaround for querying with multiple filter and inequalities is to use the Search API.
https://cloud.google.com/appengine/training/fts_adv/lesson1#query_options
From the documentation:
For example, the query job tag:"very important" sent < 2011-02-28
finds documents with the term job in any field, and also contain the
phrase very important in a tag field, and a sent date prior to
February 28, 2011.
Just put your data from Datastore query into Search documents and run your query on these documents.

Related

Django ORM Limit QuerySet By Using Start Limit

I am recently building a Django project, which deals with result set of
20 k rows.
All these data is responses as JSON, and am parsing this to use in template.
Currently, I am using objects.all() from django ORM.
I would like to know, if we can get complete result set in parts.
Say, if result is 10k rows, then split in 2K rows each.
My approach would be to lazy load data, using a limit variable incremented by 2k at a time.
Would like to know if, this approach is feasible or any help in this regards?
Yes, you can make use of .iterator(…) [Django-doc]. As the documentation says:
for obj in MyModel.objects.all().iterator(chunk_size=2000):
# … do something with obj
pass
This will fetch the records in chucks of 2'000 in thus case. If you set the limit higher, it will fetch more records per query, but then you need more memory to store all these records at a specific moment. Setting the chunk_size lower will result in less memory usage, but more queries to the database.
You might however be interested in pagination [Django-doc] instead. In that case the request contains a page number, and you return only a limited number of records. This is often better since not all clients per se need all the data, and furthermore the client often will need to be able to process the data itself, and if the chunks are too large, the client can get flooded as well.

Firestore query takes a too long time to get the value of only one field

. Hi, community.
I have a question/issue about firestore query from Firebase.
I have a collection of around 18000 documents. I would like to get the value of a single same field of some of these documents. I use the python firestore_v1 library from google-cloud-python client. So, for example with list_edges.length = 250:
[db_firestore.document(f"edges/{edge['id']}").get({"distance"}).to_dict()["distance"] for edge in list_edges]
it takes like 30+ seconds to be evaluated, meanwhile with the equal collection on MongoDB it takes not more than 3 seconds doing this and loading the whole object, not only a one field:
list(db_mongo["edges"].find({"city_id":{"$eq":city_id},"id": {"$in": [edge_id for edge in list_edges]}}))
...having said that, I thought the solution could be separate the large collection by city_id, so I create a new collection and copy the corresponded documents inside, so now the query looks like:
[db_firestore.document(f"edges/7/edges/{edge['id']}").get({"distance"}).to_dict()["distance"] for edge in list_edges]
where 7 is a city_id.
However, it takes the same time. So, maybe the issue is around the .get() method, but I could not find any optimized solution for my case.
Could you help me with this? Thanks!
EDITED
I've got the answer from firestore support. The problem is that I make 250 requests doing .get() for each document separately. The idea is to get all the data I want in only one request, so I need to modify the query.
Let's assume I have the next DB:
edges collection with multiples edge_id documents. For each new request, I use a new generated list of edges I need to catch.
In MongoDB, I can do it with the $in operator (having edge_id inside the document), but in firestore, the 'in' operator only accepts up to 10 equality.
So, I need to find out another way to do this.
Any ideas? Thanks!
Firebase recently added support for a limited in operation. See:
The blog post announcing the feature.
The documentation on in and array-contains-any queries.
From the latter:
cities_ref = db.collection(u'cities')
query = cities_ref.where(u'country', u'in', [u'USA', u'Japan'])
A few caveats though:
You can have at most 10 values in the in clause, and you can have only on in (or array-contains-any) clause in query.
I am not sure if you can use this operator to select by ID.

Google Cloud Datastore Indexes for count queries

Google cloud datastore mandates that there needs to be composite indexes built to query on multiple fields of one kind. Taking the following query for example,
class Greeting(ndb.Model):
user = ndb.StringProperty()
place = ndb.StringProperty()
# Query 1
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').fetch()
# Query 2
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').count()
I am using python with ndb to access cloud datastore. In the above example, Query 1 raises NeedIndexError if there is no composite index defined on user and place. But Query 2 works fine even if there is no index on user and place.
I would like to understand how cloud datastore fetches the count (Query 2) without the index when it mandates the index for fetching the list of entities (Query 1). I understand it stores Stats per kind per index which would result in quicker response for counts on existing indexes (Refer docs). But I'm unable to explain the above behaviour.
Note: There is no issue when querying on one property of a given kind as cloud datastore has indexes on a single properties by default.
There is no clear & direct explanation on why this happens but most likely its because how improved query planner works with zigzag indexes.
You can read more about this here: https://cloud.google.com/appengine/articles/indexselection#Improved_Query_Planner
The logic behind count() working and fetch() does not probably because with count() you don't need to keep in memory a lot of results.
So in case of count() you can easily scale by splitting work in multiple chunks processed in parallel and then just sum corresponding counts into one. You can't do this cheaply with cursors/recordsets.

Efficient way to use filter() twice in Django

I am relatively new to Django and Python, but I have not been able to quite figure this one out.
I essentially want to query the database using filter for a large number of users. Then I want to make a bunch of queries on this just this section of users. So I thought it would be most efficient do first query for my larger filter parameters, and then make my separate filter queries on that set. In code, it looks like this
#Get the big groups of users, like all people with brown hair.
group_of_users = Data.objects.filter(......)
#Now get all the people with brown hair and blue eyes, and then all with green eyes, etc.
for each haircolor :
subset_of_group = group_of_users.filter(....)
That is just pseudo-code by the way, I am not that inept. I thought this would be more efficient, but it seems that if eliminate the first query and simply just get the querysets in the for loop, it is much faster (actually timed).
I fear this is because when I filter first, and then filter each time in the for loop, it is actually doing both sets of filter queries on each for loop execution. So really, doing twice the amount of work I want. I thought with caching this would not matter, as the first filter results would be cached and it would still be faster, but again, I timed it with multiple tests and the single filter is faster. Any ideas?
EDIT:
So it seems that querying for a set of data, and then trying to further query only against that set of data, is not possible. Rather, I should query for a set of data and then further parse that data using regular Python.
As garnertb ans lanzz said, it doesn't matter where you use the filter function, the only thing that matters is when you evaluate the query (see when querysets are evaluated). My guess is that in your tests, you evaluate the queryset somewhere in your code, and that you do more evaluations in your test with separate filter calls.
Whenever a queryset is evaluated, its results are cached. However, this cache does not carry over if you use another method, such as filter or order_by, on the queryset. SO you can't try to evaluate the bigger set, and use filtering on the queryset to retrieve the smaller sets without doing another query.
If you only have a small set of haircolours, you can get away with doing a query for each haircolour. However, if you have many of them, the amount of queries will have a severe impact on performance. In that case it might be better to do a query for the full set of users you want to use, and the do subsequent processing in python:
qs = Data.objects.filter(hair='brown')
objects = dict()
for obj in qs:
objects.setdefault(obj.haircolour, []).append(obj)
for (k, v) in objects.items():
print "Objects for colour '%s':" % k
for obj in v:
print "- %s" % obj
Filtering Django querysets does not perform any database operation, until you actually try to access the result. Filtering only adds conditions to the queryset, which are then used to build the final query when you access the result of the query.
When you assign group_of_users = Data.objects.filter(...), no data is retrieved from the database; you just get a queryset that knows that you want records that satisfy a specific condition (the filtering parameters you supplied to Data.objects.filter), but it does not pre-fetch those actual users. After that, when you assign subset_of_group = group_of_users.filter(....), you don't filter just that previous group of users, but only add more conditions to the queryset; still no data has been retrived from the database at this point. Only when you actually try to access the results of the queryset (by e.g. iterating over the queryset, or by slicing it, or by accessing a single index in it), the queryset will build an (usually) single query that would retrieve only user records that satisfy all filtering conditions you have accumulated in your querysets up to that point. It will still need to filter your entire users table to find those matching users; it cannot take advantage of the "previously retrieved" users from the group_of_users = Data.objects.filter(...) queryset, because nothing has been actually retrieved at that point.
Your approach is exactly right and it is efficient. The Querysets don't touch the database until they are evaluated, so you can add as many filters as you like and the database won't be touched. Django's excellent documentation provides all the information you need to figure out what operations cause the Queryset to be evaluated.

Need a way to count entities in GAE datastore that meet a certain condition? (over 1000 entities)

I'm building an app on GAE that needs to report on events occurring. An event has a type and I also need to report by event type.
For example, say there is an event A, B and C. They occur periodically at random. User logs in and creates a set of entities to which those events can be attributed. When the user comes back to check the status, I need to be able to tell how many events of A, B and/or C occurred during a specific time range, say a day or a month.
The 1000 limit is throwing a wrench into how I would normally do it. I don't need to retrieve all of the entities and present them to the user, but I do need to show the total count for a specific date range. Any suggestions?
I'm a bit of python/GAE noob...
App Engine is not a relational database and you won't be able to quickly do counts on the fly like this. The best approach is to update the counts at write time, not generate them at read time.
When generating counts, there are two general approaches that scale well with App Engine to minimize write contention:
Store the count in Memcache or local memory and periodically flush. This is the simplest solution, but it can be volatile and data loss is probable.
Use a Sharded Counter. This approach is a bit more reliable but more complex. You won't be able to sort easily by count, but you could also periodically flush to another indexed count field periodically and sort by that.
Results of datastore count() queries
and offsets for all datastore queries
are no longer capped at 1000.
Since Version 1.3.6
My approach would be to have an aggregate model or models to keep track of event types, dates and counts. I'm not 100% how you should model this given your requirements, though.
Then, I'd fire off deferred tasks to asynchronously update the appropriate aggregate models whenever a user does something that triggers an event.
Nick Johnson's Background work with the deferred library article has much more information, and provides a framework that you might find useful for doing the kind of aggregation you're talking about.
Would a solution using cursors (like the one below) work for you? I personally use this method to count the number of entries in a scenario similar to yours, and haven't seen yet any problems with it (although I run on a schedule, since constant querying of the data store is pretty taxing on the CPU quota).
def count(query):
i = 0
while True:
result = query.fetch(1000)
i = i + len(result)
if len(result) < 1000:
break
cursor = query.cursor()
query.with_cursor(cursor)
return i
This post is quite old, but I would like to provide a useful reference. App Engine now offers a built-in API to access datastore statistics:
For Python,
from google.appengine.ext.db import stats
global_stat = stats.GlobalStat.all().get()
print 'Total bytes stored: %d' % global_stat.bytes
print 'Total entities stored: %d' % global_stat.count
For Java,
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.Query;
// ...
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
Entity globalStat = datastore.prepare(new Query("__Stat_Total__")).asSingleEntity();
Long totalBytes = (Long) globalStat.getProperty("bytes");
Long totalEntities = (Long) globalStat.getProperty("count");
It is also possible to filter entities number only for a particular kind. Take a look at this reference:
https://developers.google.com/appengine/docs/python/datastore/stats
https://developers.google.com/appengine/docs/java/datastore/stats
This sounds very similar to a question that I posed on StackOverflow.
How to get the distinct value of one of my models in Google App Engine I needed to know how to get a distinct values for an entities within my models and there is going to be over 1000 entities for that model.
Hope that helps.

Categories