What is the appropriate way to handle counts in app engine(ndb or db)?
I have two projects on is a django-nonrel and the other is a pure django project but both need the ability to take a query and get a count back. The results could be greater than 1,000.
I saw some posts that said I could use Sharded Counters but they are counting all entities. I need to be able to know how many entities have the following properties x=1,y=True,z=3
#Is this the appropriate way?
count = some_entity.gql(query_string).count(SOME_LARGE_NUMBER)
The datastore is not good at this sort of query, because of tradeoffs to make it distributed. These include fairly slow reads, and very limited indexing.
If there are a limited set of statistics you need (number of users, articles, etc) then you can keep running totals in a separate entity. This means you need to do two writes(puts) when something changes: one for the entity that changes, and one to update the stats entity. But you only need one read(get) to get your statistics, instead of however many entities they are distilled from.
You may be uncomfortable with this because it goes against what we all learned about normalisation, but it is far more efficient and in many cases works fine. You can always have a cron job periodically do your queries to check the statistics are accurate, if this is critical.
Since you are using db.Model here is one way on how can you count all your entities with some filters that might be over the 1000 that is a hard limit (if it's still applicable):
FETCH_LIMIT = 1000
def count_model(x=1, y=True, z=3):
model_qry = MyModel.all(keys_only=True)
model_qry.filter('x =', x)
model_qry.filter('y =', y)
model_qry.filter('z =', z)
count = None
total = 0
cursor = None
while count != 0:
if cursor:
count = model_qry.with_cursor(cursor).count()
else:
count = model_qry.count(limit=FETCH_LIMIT)
total += count
cursor = model_qry.cursor()
return total
If you're going to use the above in a request then you might timeout so consider using Task Queues instead.
Also as FoxyLad proposed, it is much better to keep running totals in a separate entity, for performance reasons and having the above method as a cron job that runs on regular basis to have the stats in a perfect sync.
Related
I'm working on something to clear my database of ~10,000 entities, and my plan is to put it in a task that deletes 200 at a time using ndb.delete_multi() and then recursively calls itself again until there are no entities left.
For now, I don't have the recursion in it yet so I could run the code a few times manually and check for errors, quota use, etc. The code is:
entities = MyModel.query_all(ndb.Key('MyModel', '*defaultMyModel')).fetch(200)
key_list = ndb.put_multi(entities)
ndb.delete_multi(key_list)
All the query_all() does is query MyModel and return everything.
I've done some testing by commenting out things and running the method, and it looks like the first two lines take up the expected amount of writes (~200).
Running the third line, ndb.delete_multi(), takes up about 8% of my 50,000 daily write allowance, so about 4000 writes--20 times as many as I think it should be doing.
I've also made sure the key_list contains only 200 keys with logging.
Any ideas on why this takes up so many writes? Am I using the method wrong? Or does it just use a ton of memory? In that case, is there any way for me to do this more efficiently?
Thanks.
When you delete an entity, the Datastore has to remove an entity and a record from an index for each indexed property as well as for each custom index. The number of writes is not dependent on which delete method you use.
Your code example is extremely inefficient. If you are deleting large numbers of entities than you will need to batch the below but, you should be retrieving data with a keys_only query and then deleting:
from google.appengine.ext import ndb
ndb.delete_multi(
MyModel.query().fetch(keys_only=True)
)
In regards to the number of write operations (see Andrei's answer), ensure only the fields on your model that are required to be indexed "have an index enabled".
I have to increment three different counters in a single transaction. Beside that I have to manipulate three other entities as well. I get
too many entity groups in a single transaction
I've used the recipie from https://developers.google.com/appengine/articles/sharding_counters to implement my counters. I increment my counters inside some model (class) methods depending on business logic.
As a workaround I implemented a deferred increment method that uses tasks to update the counter. But that doesn't scale well if the number of counters increases further as there is a limit of tasks in a single transaction as well (I thinks it's 5) and I guess it's not the most effective way.
I also found https://github.com/DocSavage/sharded_counter/blob/master/counter.py which seems to ensure updating the counter even in case of a db error through memcache. But I don't want to increment my counters if the transaction fails.
Another idea is to remember the counters I have to increment during a web request and to increment them in a single deferred task. I don't know how to implement this in a clean and thread safe way without passing objects created in the request to the model methods. I think this code would be ugly and not in the same transcation:
def my_request_handler():
counter_session = model.counter_session()
model.mylogic(counter_session, other_params)
counter_session.write()
Any experiences or ideas?
BTW: I'm using python, ndb and flask
It would be ok if the counter is not 100% accurate.
As said in Transactions and entity groups:
the simplest approach is to determine which entities you need to be
able to process in the same transaction. Then, when you create those
entities, place them in the same entity group by declaring them with a
common ancestor. They will then all be in the same entity group and
you will always be able to update and read them transactionally.
For a model like:
class Thing(ndb.Model):
visible = ndb.BooleanProperty()
made_by = ndb.KeyProperty(kind=User)
belongs_to = ndb.KeyProperty(kind=AnotherThing)
Essentially performing an 'or' query, but comparing different properties so I can't use a built in OR... I want to get all Thing (belonging to a particular AnotherThing) which either have visible set to True or visible is False and made_by is the current user.
Which would be less demanding on the datastore (ie financially cost less):
Query to get everything, ie: Thing.query(Thing.belongs_to == some_thing.key) and iterate through the results, storing the visible ones, and the ones that aren't visible but are made_by the current user?
Query to get the visible ones, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "True") and query separately to get the non-visible ones by the current user, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "False", Thing.made_by = current_user)?
Number 1. would get many unneeded results, like non-visible Things by other users - which I think is many reads of the datastore? 2. is two whole queries though, which is also possibly unnecessarily heavy, right? I'm still trying to work out what kinds of interaction with the database cause what kinds of costs.
I'm using ndb, tasklets and memcache where necessary, in case that's relevant.
Number two is going to be financially less for two reasons. First you pay for each read of the data store and for each returned entity in a query, therefore you will be paying more for the first one which you have to Read all data and query all data. The second way you only pay for what you need.
Secondly you also pay for backend or frontend time, and you will be using time to iterate through all your results in the first method, where as you need to spend no time for the second method.
I can't see a way where the first option is better. (maybe if you only have a few entities??)
To understand how reads and queries cost you scroll down a little on:
https://developers.google.com/appengine/docs/billing
You will see how Read, Writes and Smalls are added up for reads, writes and queries.
I would also just query for ones that are owned by the current user instead of visible=false and owner=current, this way you don't need a composite index which will save some time. You can also make visible a partial index this was saving some space as well (only index it when true, assuming you never need to query for false ones). You will need to do a litte work to remove duplicates, but that is probably not to bad.
You are probably best benchmarking both cases using real-world data. It's hard to determine things like this in the abstract, as there are many subtleties that may affect overall performance.
I would expect option 2 to be better though. Loading tons of objects that you don't care about is simply going to put a heavy burden on the data store that I don't think an extra query would be comparable to. Of course, it depends on how many extra things, etc.
I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:
for thing in session.query(Things):
analyze(thing)
To avoid this, I find I have to build my own iterator that bites off in chunks:
lastThingID = None
while True:
things = query.filter(Thing.id < lastThingID).limit(querySize).all()
if not rows or len(rows) == 0:
break
for thing in things:
lastThingID = row.id
analyze(thing)
Is this normal or is there something I'm missing regarding SA built-in generators?
The answer to this question seems to indicate that the memory consumption is not to be expected.
Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.
But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.
So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.
I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.
The window function approach is on the wiki and I use it with great success.
Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.
I am not a database expert, but when using SQLAlchemy as a simple Python abstraction layer (ie, not using the ORM Query object) I've came up with a satisfying solution to query a 300M-row table without exploding memory usage...
Here is a dummy example:
from sqlalchemy import create_engine, select
conn = create_engine("DB URL...").connect()
q = select([huge_table])
proxy = conn.execution_options(stream_results=True).execute(q)
Then, I use the SQLAlchemy fetchmany() method to iterate over the results in a infinite while loop:
while 'batch not empty': # equivalent of 'while True', but clearer
batch = proxy.fetchmany(100000) # 100,000 rows at a time
if not batch:
break
for row in batch:
# Do your stuff here...
proxy.close()
This method allowed me to do all kind of data aggregation without any dangerous memory overhead.
NOTE the stream_results works with Postgres and the pyscopg2 adapter, but I guess it won't work with any DBAPI, nor with any database driver...
There is an interesting usecase in this blog post that inspired my above method.
I've been looking into efficient traversal/paging with SQLAlchemy and would like to update this answer.
I think you can use the slice call to properly limit the scope of a query and you could efficiently reuse it.
Example:
window_size = 10 # or whatever limit you like
window_idx = 0
while True:
start,stop = window_size*window_idx, window_size*(window_idx+1)
things = query.slice(start, stop).all()
if things is None:
break
for thing in things:
analyze(thing)
if len(things) < window_size:
break
window_idx += 1
In the spirit of Joel's answer, I use the following:
WINDOW_SIZE = 1000
def qgen(query):
start = 0
while True:
stop = start + WINDOW_SIZE
things = query.slice(start, stop).all()
if len(things) == 0:
break
for thing in things:
yield thing
start += WINDOW_SIZE
Using LIMIT/OFFSET is bad, because you need to find all {OFFSET} columns before, so the larger is OFFSET - the longer request you get.
Using windowed query for me also gives bad results on large table with large amount of data (you wait first results for too long, that it's not good in my case for chunked web response).
Best approach given here https://stackoverflow.com/a/27169302/450103. In my case I resolved problem simply using index on datetime field and fetching next query with datetime>=previous_datetime. Stupid, because I used that index in different cases before, but thought that for fetching all data windowed query would be better. In my case I was wrong.
AFAIK, the first variant still gets all the tuples from the table (with one SQL query) but builds the ORM presentation for each entity when iterating. So it is more efficient than building a list of all entities before iterating but you still have to fetch all the (raw) data into memory.
Thus, using LIMIT on huge tables sounds like a good idea to me.
If you're working with Postgres or an RDBMS that supports cursors, it is very simple to iterate efficiently over a lot of rows:
with db_session() as session:
for partition in session.stream(select(Thing.id)).partitions(1000):
for item in partition:
analyze(item)
This creates a forward cursor that fetches the result in batches of 1000 rows, which results in minimal memory usage on the server and on the client.
I am reading a "table" in Python in GAE that has 1000 rows and the program stops because the time limit is reached. (So it takes at least 20 seconds.)(
Is that possible that GAE is that slow? Is there a way to fix that?
Is this because I use free service and I do not pay for it?
Thank you.
The code itself is this:
liststocks=[]
userall=user.all() # this has three fields username... trying to optimise by this line
stocknamesall=stocknames.all() # 1 field, name of the stocks trying to optimise by this line too
for u in userall: # userall has 1000 users
for stockname in stocknamesall: # 4 stocks
astock= stocksowned() #it is also a "table", no relevance I think
astock.quantity = random.randint(1,100)
astock.nameid = u.key()
astock.stockid = stockname.key()
liststocks.append(astock);
GAE is slow when used inefficiently. Like any framework, sometimes you have to know a little bit about how it works in order to efficiently use it. Luckily, I think there is an easy improvement that will help your code a lot.
It is faster to use fetch() explicitly instead of using the iterator. The iterator causes entities to be fetched in "small batches" - each "small batch" results in a round-trip to the datastore to get more data. If you use fetch(), then you'll get all the data at once with just one round-trip to the datastore. In short, use fetch() if you know you are going to need lots of results.
In this case, using fetch() will help a lot - you can easily get all your users and stocknames in one round-trip to the datastore each. Right now you're making lots of extra round-trips to the datastore and re-fetching stockname entities too!
Try this (you said your table has 1000 rows, so I use fetch(1000) to make sure you get all the results; use a larger number if needed):
userall=user.all().fetch(1000)
stocknamesall=stocknames.all().fetch(1000)
# rest of the code as-is
To see where you could make additional improvements, please try out AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot (like this) of the appstats info about your request along with your post.