I'm working on something to clear my database of ~10,000 entities, and my plan is to put it in a task that deletes 200 at a time using ndb.delete_multi() and then recursively calls itself again until there are no entities left.
For now, I don't have the recursion in it yet so I could run the code a few times manually and check for errors, quota use, etc. The code is:
entities = MyModel.query_all(ndb.Key('MyModel', '*defaultMyModel')).fetch(200)
key_list = ndb.put_multi(entities)
ndb.delete_multi(key_list)
All the query_all() does is query MyModel and return everything.
I've done some testing by commenting out things and running the method, and it looks like the first two lines take up the expected amount of writes (~200).
Running the third line, ndb.delete_multi(), takes up about 8% of my 50,000 daily write allowance, so about 4000 writes--20 times as many as I think it should be doing.
I've also made sure the key_list contains only 200 keys with logging.
Any ideas on why this takes up so many writes? Am I using the method wrong? Or does it just use a ton of memory? In that case, is there any way for me to do this more efficiently?
Thanks.
When you delete an entity, the Datastore has to remove an entity and a record from an index for each indexed property as well as for each custom index. The number of writes is not dependent on which delete method you use.
Your code example is extremely inefficient. If you are deleting large numbers of entities than you will need to batch the below but, you should be retrieving data with a keys_only query and then deleting:
from google.appengine.ext import ndb
ndb.delete_multi(
MyModel.query().fetch(keys_only=True)
)
In regards to the number of write operations (see Andrei's answer), ensure only the fields on your model that are required to be indexed "have an index enabled".
Related
I'm looking for ideas on how to improve a report that takes up to 30 minutes to process on the server, I'm currently working with Django and MySQL but if there is a solution that requires changing the language or SQL database I'm open to it.
The report I'm talking about reads multiple excel files and insert all the rows from those files into a table (the report table) with a range from 12K to 15K records, the table has around 50 columns. This part doesn't take that much time.
Once I have all the records on the report table I start applying multiple phases of business logic so I end having something like this:
def create_report():
business_logic_1()
business_logic_2()
business_logic_3()
business_logic_4()
Each function of the business_logic_X does something very similar, it starts by doing a ReportModel.objects.all() and then it applies multiple calculations like checking dates, quantities, etc and updates the record. Since it's a 12K record table it quickly starts adding time to the complete report.
The reason I'm going multiple functions separately and no all processing in one pass it's because the logic from the first function needs to be completed so the logic on the next functions works (ex. the first function finds all related records and applies the same status for all of them.
The first thing that I know could be optimized is somehow caching the objects.all() instead of calling it in each function but I'm not sure how to pass it to the next function without saving the records first.
I already optimized the report a bit by using update_fields on the save method of the functions and that saved a bit of time.
My question is, is there a better approach to this kind of problem? Is Django/MySQL the right stack for this?
What takes time is the business logic that you're doing in Django. So it does several round trips between the database and the application.
It sounds like there are several tables involved, so I would suggest that you write your query in raw sql and once you have the results you get that into the application, if you need it.
The orm has a method "raw" that you can use. Or you could drop down to even lower level and interface with Your database directly.
Unless I see more what you do, I can't give any more specific advice
Using Google App Engine Python 2.7 Query Class -
I need to produce a list of results that I pass to my django template. There are two ways I've found to do this.
Use fetch, however in the docs it says that fetch should almost never be used. https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch
Use run() and then wrap it into list() thereby creating the list object.
Is one preferable to the other in terms of memory usage? Is there another way I could be doing this?
The key here is why fetch “should almost never be used”. The documentation says that fetch will get all the results, therefore having to keep all of them in memory at the same time. If the data you get is big, you will need lots of memory.
You say you can wrap run inside list. Sure, you can do that, but you will hit exactly the same problem—list will force all the elements into memory. So, this solution is actually discouraged on the same basis as using fetch.
Now, you could say: so what should I do? The answer is: in most cases you can deal with elements of your data one by one, without keeping them all in memory at the same time. For example, if all you need is to put the result data into a django template, and you know that it will be used at most once in your template, then the django template will happily take any iterator—so you can pass the run call result directly without wrapping it into list.
Similarly, if you need to do some processing, for example go over the results to find the element with the highest price or ranking, or whatever, you can just iterate over the result of run.
But if your usage requires having all the elements in memory (e.g.: your django template uses the data from the query several times), then you have a case where fetch or list(run(…)) actually has sense. In the end—this is just the typical trade-off: if you need for your application to apply an algorithm which requires all the data in memory, you need to pay for it by using up memory. So, you can either redesign your algorithms and usage to work with an iterator, or use fetch and pay for it by longer processing times and higher memory usage. Google of course encourages you to do the first thing. And this is what “should almost never be used” actually means.
For a model like:
class Thing(ndb.Model):
visible = ndb.BooleanProperty()
made_by = ndb.KeyProperty(kind=User)
belongs_to = ndb.KeyProperty(kind=AnotherThing)
Essentially performing an 'or' query, but comparing different properties so I can't use a built in OR... I want to get all Thing (belonging to a particular AnotherThing) which either have visible set to True or visible is False and made_by is the current user.
Which would be less demanding on the datastore (ie financially cost less):
Query to get everything, ie: Thing.query(Thing.belongs_to == some_thing.key) and iterate through the results, storing the visible ones, and the ones that aren't visible but are made_by the current user?
Query to get the visible ones, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "True") and query separately to get the non-visible ones by the current user, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "False", Thing.made_by = current_user)?
Number 1. would get many unneeded results, like non-visible Things by other users - which I think is many reads of the datastore? 2. is two whole queries though, which is also possibly unnecessarily heavy, right? I'm still trying to work out what kinds of interaction with the database cause what kinds of costs.
I'm using ndb, tasklets and memcache where necessary, in case that's relevant.
Number two is going to be financially less for two reasons. First you pay for each read of the data store and for each returned entity in a query, therefore you will be paying more for the first one which you have to Read all data and query all data. The second way you only pay for what you need.
Secondly you also pay for backend or frontend time, and you will be using time to iterate through all your results in the first method, where as you need to spend no time for the second method.
I can't see a way where the first option is better. (maybe if you only have a few entities??)
To understand how reads and queries cost you scroll down a little on:
https://developers.google.com/appengine/docs/billing
You will see how Read, Writes and Smalls are added up for reads, writes and queries.
I would also just query for ones that are owned by the current user instead of visible=false and owner=current, this way you don't need a composite index which will save some time. You can also make visible a partial index this was saving some space as well (only index it when true, assuming you never need to query for false ones). You will need to do a litte work to remove duplicates, but that is probably not to bad.
You are probably best benchmarking both cases using real-world data. It's hard to determine things like this in the abstract, as there are many subtleties that may affect overall performance.
I would expect option 2 to be better though. Loading tons of objects that you don't care about is simply going to put a heavy burden on the data store that I don't think an extra query would be comparable to. Of course, it depends on how many extra things, etc.
I occasionally get this error when I do batch puts.
RequestTooLargeError: The request to API call datastore_v3.Put() was too large.
The call that triggers this does a db.put call on a list of 1000+ entities. Each entity has a single db.TextProperty field, filled with about 20,000 characters. Each entity also has a parent entity, although none of the entities in the list passed to db.put share a common parent. Each of the parent entities store about 10 integers and aren't very large.
My first instinct was to split up the number of entities being passed to db.put, but
Any ideas on the cause of this?
Edit: Splitting up the entities does work. For example, I can do this:
for entity in entities: entity.put()
But the answer to this question suggests that the number of entities being put shouldn't matter. So still confused.
Remember that a quick test to find an entity's size is to try to write it to memcache. If it exceeds memcache's 1 meg limit the write will fail an you can catch the exception. May be of use if you follow Nick's suggestion to isolate the issue.
I used a StringProperty until now and just ran into the same issue, when data stored in that property eventually grew too big (it just passed 1 MB). I was able to fix it for now with a JsonProperty where compressed=True. Might be an option.
Is this batch update inside a transaction? if so, remember there is a 10 MB limit for a transaction size
I am reading a "table" in Python in GAE that has 1000 rows and the program stops because the time limit is reached. (So it takes at least 20 seconds.)(
Is that possible that GAE is that slow? Is there a way to fix that?
Is this because I use free service and I do not pay for it?
Thank you.
The code itself is this:
liststocks=[]
userall=user.all() # this has three fields username... trying to optimise by this line
stocknamesall=stocknames.all() # 1 field, name of the stocks trying to optimise by this line too
for u in userall: # userall has 1000 users
for stockname in stocknamesall: # 4 stocks
astock= stocksowned() #it is also a "table", no relevance I think
astock.quantity = random.randint(1,100)
astock.nameid = u.key()
astock.stockid = stockname.key()
liststocks.append(astock);
GAE is slow when used inefficiently. Like any framework, sometimes you have to know a little bit about how it works in order to efficiently use it. Luckily, I think there is an easy improvement that will help your code a lot.
It is faster to use fetch() explicitly instead of using the iterator. The iterator causes entities to be fetched in "small batches" - each "small batch" results in a round-trip to the datastore to get more data. If you use fetch(), then you'll get all the data at once with just one round-trip to the datastore. In short, use fetch() if you know you are going to need lots of results.
In this case, using fetch() will help a lot - you can easily get all your users and stocknames in one round-trip to the datastore each. Right now you're making lots of extra round-trips to the datastore and re-fetching stockname entities too!
Try this (you said your table has 1000 rows, so I use fetch(1000) to make sure you get all the results; use a larger number if needed):
userall=user.all().fetch(1000)
stocknamesall=stocknames.all().fetch(1000)
# rest of the code as-is
To see where you could make additional improvements, please try out AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot (like this) of the appstats info about your request along with your post.