I am reading a "table" in Python in GAE that has 1000 rows and the program stops because the time limit is reached. (So it takes at least 20 seconds.)(
Is that possible that GAE is that slow? Is there a way to fix that?
Is this because I use free service and I do not pay for it?
Thank you.
The code itself is this:
liststocks=[]
userall=user.all() # this has three fields username... trying to optimise by this line
stocknamesall=stocknames.all() # 1 field, name of the stocks trying to optimise by this line too
for u in userall: # userall has 1000 users
for stockname in stocknamesall: # 4 stocks
astock= stocksowned() #it is also a "table", no relevance I think
astock.quantity = random.randint(1,100)
astock.nameid = u.key()
astock.stockid = stockname.key()
liststocks.append(astock);
GAE is slow when used inefficiently. Like any framework, sometimes you have to know a little bit about how it works in order to efficiently use it. Luckily, I think there is an easy improvement that will help your code a lot.
It is faster to use fetch() explicitly instead of using the iterator. The iterator causes entities to be fetched in "small batches" - each "small batch" results in a round-trip to the datastore to get more data. If you use fetch(), then you'll get all the data at once with just one round-trip to the datastore. In short, use fetch() if you know you are going to need lots of results.
In this case, using fetch() will help a lot - you can easily get all your users and stocknames in one round-trip to the datastore each. Right now you're making lots of extra round-trips to the datastore and re-fetching stockname entities too!
Try this (you said your table has 1000 rows, so I use fetch(1000) to make sure you get all the results; use a larger number if needed):
userall=user.all().fetch(1000)
stocknamesall=stocknames.all().fetch(1000)
# rest of the code as-is
To see where you could make additional improvements, please try out AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot (like this) of the appstats info about your request along with your post.
Related
I'm working on something to clear my database of ~10,000 entities, and my plan is to put it in a task that deletes 200 at a time using ndb.delete_multi() and then recursively calls itself again until there are no entities left.
For now, I don't have the recursion in it yet so I could run the code a few times manually and check for errors, quota use, etc. The code is:
entities = MyModel.query_all(ndb.Key('MyModel', '*defaultMyModel')).fetch(200)
key_list = ndb.put_multi(entities)
ndb.delete_multi(key_list)
All the query_all() does is query MyModel and return everything.
I've done some testing by commenting out things and running the method, and it looks like the first two lines take up the expected amount of writes (~200).
Running the third line, ndb.delete_multi(), takes up about 8% of my 50,000 daily write allowance, so about 4000 writes--20 times as many as I think it should be doing.
I've also made sure the key_list contains only 200 keys with logging.
Any ideas on why this takes up so many writes? Am I using the method wrong? Or does it just use a ton of memory? In that case, is there any way for me to do this more efficiently?
Thanks.
When you delete an entity, the Datastore has to remove an entity and a record from an index for each indexed property as well as for each custom index. The number of writes is not dependent on which delete method you use.
Your code example is extremely inefficient. If you are deleting large numbers of entities than you will need to batch the below but, you should be retrieving data with a keys_only query and then deleting:
from google.appengine.ext import ndb
ndb.delete_multi(
MyModel.query().fetch(keys_only=True)
)
In regards to the number of write operations (see Andrei's answer), ensure only the fields on your model that are required to be indexed "have an index enabled".
Using Google App Engine Python 2.7 Query Class -
I need to produce a list of results that I pass to my django template. There are two ways I've found to do this.
Use fetch, however in the docs it says that fetch should almost never be used. https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch
Use run() and then wrap it into list() thereby creating the list object.
Is one preferable to the other in terms of memory usage? Is there another way I could be doing this?
The key here is why fetch “should almost never be used”. The documentation says that fetch will get all the results, therefore having to keep all of them in memory at the same time. If the data you get is big, you will need lots of memory.
You say you can wrap run inside list. Sure, you can do that, but you will hit exactly the same problem—list will force all the elements into memory. So, this solution is actually discouraged on the same basis as using fetch.
Now, you could say: so what should I do? The answer is: in most cases you can deal with elements of your data one by one, without keeping them all in memory at the same time. For example, if all you need is to put the result data into a django template, and you know that it will be used at most once in your template, then the django template will happily take any iterator—so you can pass the run call result directly without wrapping it into list.
Similarly, if you need to do some processing, for example go over the results to find the element with the highest price or ranking, or whatever, you can just iterate over the result of run.
But if your usage requires having all the elements in memory (e.g.: your django template uses the data from the query several times), then you have a case where fetch or list(run(…)) actually has sense. In the end—this is just the typical trade-off: if you need for your application to apply an algorithm which requires all the data in memory, you need to pay for it by using up memory. So, you can either redesign your algorithms and usage to work with an iterator, or use fetch and pay for it by longer processing times and higher memory usage. Google of course encourages you to do the first thing. And this is what “should almost never be used” actually means.
I am writing a reusable django application for returning json result for jquery ui autocomplete.
Currently i am storing the Class/function for getting the result in a dictionary with a unique key for each class/function.
When a request comes then I selects the corresponding class/function from the dict and returns the output.
My query is whether is the best practice to do the above or are there some other tricks to obtains the same result.
Sample GIST : https://gist.github.com/ajumell/5483685
You seem to be talking about a form of memoization.
This is OK, as long as you don't rely on that result being in the dictionary. This is because the memory will be local to each process, and you can't guarantee subsequent requests being handled by the same process. But if you have a fallback where you generate the result, this is a perfectly good optimization.
That's a very general question. It primary depends on the infrastructure of your code. The way your class and models are defined and the dynamics of the application.
Second, is important to have into account the resources of the server where your application is running. How much memory do you have available, and how much disk space so you can take into account what would be better for the application.
Last but not least, it's important to take into account how much operations does it need to put all these resources in memory. Memory is volatile, so if your application restarts you'll have to instantiate all the classes again and maybe this is to much work.
Resuming, as an optimization is very good choice to keep in memory objects that are queried often (that's what cache is all about) but you have to take into account all of the previous stuff.
Storing a series of functions in a dictionary and conditionally selecting one based on the request is a perfectly acceptable way to handle it.
If you would like a more specific answer it would be very helpful to post your actual code. And secondly, this might be better suited to codereview.stackexchange
For a model like:
class Thing(ndb.Model):
visible = ndb.BooleanProperty()
made_by = ndb.KeyProperty(kind=User)
belongs_to = ndb.KeyProperty(kind=AnotherThing)
Essentially performing an 'or' query, but comparing different properties so I can't use a built in OR... I want to get all Thing (belonging to a particular AnotherThing) which either have visible set to True or visible is False and made_by is the current user.
Which would be less demanding on the datastore (ie financially cost less):
Query to get everything, ie: Thing.query(Thing.belongs_to == some_thing.key) and iterate through the results, storing the visible ones, and the ones that aren't visible but are made_by the current user?
Query to get the visible ones, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "True") and query separately to get the non-visible ones by the current user, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "False", Thing.made_by = current_user)?
Number 1. would get many unneeded results, like non-visible Things by other users - which I think is many reads of the datastore? 2. is two whole queries though, which is also possibly unnecessarily heavy, right? I'm still trying to work out what kinds of interaction with the database cause what kinds of costs.
I'm using ndb, tasklets and memcache where necessary, in case that's relevant.
Number two is going to be financially less for two reasons. First you pay for each read of the data store and for each returned entity in a query, therefore you will be paying more for the first one which you have to Read all data and query all data. The second way you only pay for what you need.
Secondly you also pay for backend or frontend time, and you will be using time to iterate through all your results in the first method, where as you need to spend no time for the second method.
I can't see a way where the first option is better. (maybe if you only have a few entities??)
To understand how reads and queries cost you scroll down a little on:
https://developers.google.com/appengine/docs/billing
You will see how Read, Writes and Smalls are added up for reads, writes and queries.
I would also just query for ones that are owned by the current user instead of visible=false and owner=current, this way you don't need a composite index which will save some time. You can also make visible a partial index this was saving some space as well (only index it when true, assuming you never need to query for false ones). You will need to do a litte work to remove duplicates, but that is probably not to bad.
You are probably best benchmarking both cases using real-world data. It's hard to determine things like this in the abstract, as there are many subtleties that may affect overall performance.
I would expect option 2 to be better though. Loading tons of objects that you don't care about is simply going to put a heavy burden on the data store that I don't think an extra query would be comparable to. Of course, it depends on how many extra things, etc.
Say I have a query that will be executed often, most likely yielding the same results.
Is it correct that using:
for key in qry.iter(keys_only=True):
item = key.get()
#do something with item
Would perform better than:
for item in qry:
#do something with item
Because in the first example, the query will only load the keys and subsequent calls to key.get() will take advantage of NDB's caching mechanism, whereas example 2 will always fetch the entities from the store? Or have I misunderstood something?
I would doubt that the second form would perform better -- it is always possible that the values are not in the cache, and then, presuming you are getting more than one entity back, you'd be making multiple roundtrips. That quickly gets slower.
A better approach is indeed what's shown in http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118 -- use ndb.multi_get(q.fetch(keys_only=True)). But even that is worse if your cache hit rate is too low; this is extensively discussed in the issue.
AFAIK It will not make any different, because internally, ndb caches everything, including query. If you are going to do other stuff with each one, try async api. that can save valuable time. edit : moreover, if ndb knows query in advance, it can even prefetch them.
I have read this six months back so not sure what is current behavior.