Slow Django database operations on Google App Engine

Slow Django database operations on Google App Engine - python

I'm testing Google App Engine and Django-nonrel with free quota. It seems to me, that the database operations to Datastore a hideously slow.
Take for example this simplified function processing a request, which takes in a multipart/form-data of XML blobs, parses them and inserts them to the database:
def post(request):
fields = cgi.FieldStorage(request)
with transaction.commit_on_success():
for xmlblob in fields.getlist('xmlblob'):
blob_object = parse_xml(xmlblob)
blob_object.save()
Blob_object has five fields, all of them of type CharField.
For just ca. 30 blobs (with about 1 kB of XML altogether), that function takes 5 seconds to return, and uses over 30000 api_cpu_ms. CPU time should equivalent to the amount of work a 1,2 GHz Intel x86 processor could do in that time, but I am pretty sure it would not take 30 seconds to insert 30 rows to a database for any x86 processor available.
Without saving objects to database (that is, just parsing the XML and throwing away the result) the request takes merely milliseconds.
So should Google App Engine really be so slow, that I can't save even a few dozen entities to the Datastore in a normal request, or am I missing something here? And of course, even if I would do the inserts in some Backend or by using a Task Queue, it would still cost hundreds of times more that what would seem acceptable.
Edit: I found out, that by default, GAE does two index writes per property for each entity. Most of those properties should not be indexed, so the question is: how can I set properties unindexed on Django-nonrel?
I still do feel though, that even with index writes, the database operation is taking ridiculous amount of time.

In the absence of batch operations, there's not much you can do to reduce wallclock times. Batch operations are pretty essential to reducing wallclock time on App Engine (or any distributed platform with RPCs, really).
Under the current billing model, CPU milliseconds reported by the datastore reflect the cost of the operation rather than the actual time it took, and are a way of billing for resources. Under the new billing model, these will be billed explicitly as datastore operations, instead.

I have not found a real answer yet, but I made some calculations for the cost. Currently every indexed property field costs around $0.20 to $0.30 per 10k inserts. With the upcoming billing model (Pricing FAQ) the cost will be exactly $0.1 per 100k operations, or $0.2 per indexed field per 100k inserts with 2 index write operations per insert.
So as the price seems to go down by a factor of ten, the observed slowness is indeed unexpected behaviour. As the free quota is well enough for my test runs, and the with new pricing model coming, I wont let it bother me at this time.

Related

Managing Heroku RAM for Unique Application

I have a Flask application that allows users to query a ~small database (2.4M rows) using SQL. It's similar to a HackerRank but more limited in scope. It's deployed on Heroku.
I've noticed during testing that I can predictably hit an R14 error (memory quota exceeded) or R15 (memory quota greatly exceeded) by running large queries. The queries that typically cause this are outside what a normal user might do, such as SELECT * FROM some_huge_table. That said, I am concerned that these errors will become a regular occurrence for even small queries when 5, 10, 100 users are querying at the same time.
I'm looking for some advice on how to manage memory quotas for this type of interactive site. Here's what I've explored so far:
Changing the # of gunicorn workers. This has had some effect but I still hit R14 and R15 errors consistently.
Forced limits on user queries, based on either text or the EXPLAIN output. This does work to reduce memory usage, but I'm afraid it won't scale to even a very modest # of users.
Moving to a higher Heroku tier. The plan I use currently provides ~512MB RAM. The largest plan is around 14GB. Again, this would help but won't even moderately scale, to say nothing of the associated costs.
Reducing the size of the database significantly. I would like to avoid this if possible. Doing the napkin math on a table with 1.9M rows going to 10k or 50k, the application would have greatly reduced memory needs and will scale better, but will still have some moderate max usage limit.
As you can see, I'm a novice at best when it comes to memory management. I'm looking for some strategies/ideas on how to solve this general problem, and if it's the case that I need to either drastically cut the data size or throw tons of $ at this, that's OK too.
Thanks

Coming from my personal experience, I see two approaches:
1. plan for it
Coming from your example, this means you try to calculate the maximum memory that the request would use, multiply it by the number of gunicorn workers, and use dynos big enough.
With a different example this could be valid, I don't think it is for you.
2. reduce memory usage, solution 1
The fact that too much application memory is used makes me think that likely in your code you are loading the whole result-set into memory (probably even multiple times in multiple formats) before returning it to the client.
In the end, your application is only getting the data from the database and converting it to some output format (JSON/CSV?).
What you are probably searching for is streaming responses.
Your Flask-view will work on a record-by-record base. It will read a single record, convert it to your output format, and return a single record.
Both your database client library and Flask will support this (on most databases it is called cursors / iterators).
2. reduce memory usage, solution 2
other services often go for simple pagination or limiting resultsets to manage server-side memory.
security sidenote
it sounds like the users can actually define the SQL statement in their API requests. This is a security and application risk. Apart from doing INSERT, UPDATE, or DELETE statements, the user could create a SQL statement that will not only blow your application memory, but also break your database.

What is the Google Appengine Ndb GQL query max limit?

I am looking around in order to get an answer what is the max limit of results I can have from a GQL query on Ndb on Google AppEngine. I am using an implementation with cursors but it will be much faster if I retrieve them all at once.

This depends on lots of things like the size of the entities and the number of values that need to look up in the index, so it's best to benchmark it for your specific application. Also beware that if you find that on a sunny day it takes e.g. 10 seconds to load all your items, that probably means that some small fraction of your queries will run into a timeout due to natural variations in datastore performance, and occasionally your app will hit the timeout all the time when the datastore is having a bad day (it happens).

Basically you don't have the old limit of 1000 entities per query anymore, but consider using a reasonable limit, because you can hit the time out error and it's better to get them in batches so users won't wait during load time.

Is there a way to cache the fetch output?

I'm working on a closed system running in the cloud.
What I need is a search function that uses user-typed-in regexp to filter the rows in a dataset.
phrase = re.compile(request.get("query"))
data = Entry.all().fetch(50000) #this takes around 10s when there are 6000 records
result = x for x in data if phrase.search(x.title)
Now, the database itself won't change too much, and there will be no more than 200-300 searches a day.
Is there a way to somehow cache all the Entries (I expect that there will be no more than 50.000 of them, each no bigger than 500 bytes), so retrieving them won't take up >10 seconds? Or perhaps to parallelize it? I don't mind 10cpu seconds, but I do mind 10 second that the user has to wait.
To address any answers like "index it and use .filter()" - the query is a regexp, and I don't know about any indexing mechanism that would allow to use a regexp.

You can also use cachepy or performance engine (shameless plug) to store the data on app engine's local instances, so you can have faster access to all entities without getting limited by memcache boundaries or datastore latency.
Hint: A local instance gets killed if it surpasses about 185 MB of memory, so you can store actually quite a lot of data in it if you know what you're doing.

Since there is a bounded number of entries, you can memcache all entries and then do the filtering in memory like you've outlined. However, note that each memcache entry cannot exceed 1mb. But you can fetch up to 32mb of memcache entries in parallel.
So split the entries into sub sets, memcache the subsets and then read them in parallel by precomputing the memcache key.
More here:
http://code.google.com/appengine/docs/python/memcache/functions.html

Since your data is on the order of 20MB, you may be able to load it entirely into local instance memory, which will be as fast as you can get. Alternately, you could store it as a data file alongside your app, reading which will be faster than accessing the datastore.

Do a mass db.delete on App Engine, without eating CPU

We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.
One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:
query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)
We limited ourselves to deleting 100 ParentKind entities at a time; each ParentKind had around 40 child ChildKindA and ChildKindB entities total - perhaps 4000 entities.
This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.
This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?
Update
We changed our queries to be keys_only, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.
Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:
query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)
The results from Appstats:
service.call #RPCs real time api time
datastore_v3.RunQuery 22 352ms 555ms
datastore_v3.Delete 1 366ms 132825ms
taskqueue.BulkAdd 1 7ms 0ms
The Delete call is the single most expensive part of the operation!
Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo query.

We recently added a bulk-delete handler, documented here. It takes the most efficient possible approach to bulk deletion, though it still consumes CPU quota.

If you want to spread out the CPU burn, you could create a map reduce job. It will still iterate over every entity (this is a current limitation of the mapper API). However, you can check if each entity meets the condition and delete or not at that time.
To slow down the CPU usage, assign the mapper to a task queue that you've configured to run slower than normal. You can spread the run time out over several days and not eat up all your CPU quota.

SimpleDB query performance improvement using boto

I am trying to use the SimpleDB in following way.
I want to keep 48 hrs worth data at anytime into simpledb and query it for different purposes.
Each domain has 1 hr worth data, so at any time there are 48 domains present in the simpledb.
As the new data is constantly uploaded, I delete the oldest domain and create a new domain for each new hour.
Each domain is about 50MB in size, the total size of all the domains is around 2.2 GB.
The item in the domain has following type of attributes
identifier - around 50 characters long -- 1 per item
timestamp - timestamp value -- 1 per item
serial_n_data - 500-1000 bytes data -- 200 per item
I'm using python boto library to upload and query the data.
I send 1 item/sec with around 200 attributes in the domain.
For one of the application of this data, I need to get all the data from all the 48 domains.
The Query looks like, "SELECT * FROM domain", for all the domains.
I use 8 threads to query data with each thread taking responsibility of few domains.
e.g domain 1-6 thread 1
domain 7-12 thread 2 and so on
It takes close to 13 minutes to get the entire data.I am using boto's select method for this.I need much more faster performance than this. Any suggestions on speed up the querying process? Is there any other language that I can use, which can speed up the things?

Use more threads
I would suggest inverting your threads/domain ratio from 1/6 to something closer to 30/1. Most of the time taken to pull down large chunks of data from SimpleDB is going to be spent waiting. In this situation upping the thread count will vastly improve your throughput.
One of the limits of SimpleDB is the query response size cap at 1MB. This means pulling down the 50MB in a single domain will take a minimum of 50 Selects (the original + 49 additional pages). These must occur sequentially because the NextToken from the current response is needed for the next request. If each Select takes 2+ seconds (not uncommon with large responses and high request volume) you spend 2 minutes on each domain. If every thread has to iterate thru each of 6 domains in turn, that's about 12 minutes right there. One thread per domain should cut that down to about 2 minutes easily.
But you should be able to do much better than that. SimpleDB is optimized for concurrency. I would try 30 threads per domain, giving each thread a portion of the hour to query on, since it is log data after all. For example:
SELECT * FROM domain WHERE timestamp between '12:00' and '12:02'
(Obviously, you'd use real timestamp values) All 30 queries can be kicked off without waiting for any responses. In this way you still need to make at least 50 queries per domain, but instead of making them all sequentially you can get a lot more concurrency. You will have to test for yourself how many threads gives you the best throughput. I would encourage you to try up to 60 per domain, breaking the Select conditions down to one minute increments. If it works for you then you will have fully parallel queries and most likely have eliminated all follow up pages. If you get 503 ServiceUnavailable errors, scale back the threads.
The domain is the basic unit of scalability for SimpleDB so it is good that you have a convenient way to partition your data. You just need take advantage of the concurrency. Rather than 13 minutes, I wouldn't be surprised if you were able to get the data in 13 seconds for an app running on EC2 in the same region. But the actual time it takes will depend on a number of other factors.
Cost Concerns
As a side note, I should mention the costs of what you are doing, even though you haven't raised the issue. CreateDomain and DeleteDomain are heavyweight operations. Normally I wouldn't advise using them so often. You are charged about 25 seconds of box usage each time so creating and deleting one each hour adds up to about $70 per month just for domain management. You can store orders of magnitude more data in a domain than the 50MB you mention. So you might want to let the data accumulate more before you delete. If your queries include the timestamp (or could be made to include the timestamp) query performance may not be hurt at all by having an extra GB of old data in the domain. In any case, GetAttributes and PutAttributes will never suffer a performance hit with a large domain size, it is only queries that don't make good use of a selective index. You'd have to test your queries to see. That is just a suggestion, I realize that the create/delete is cleaner conceptually.
Also writing 200 attributes at a time is expensive, due to a quirk in the box usage formula. The box usage for writes is proportional to the number of attributes raised to the power of 3 ! The formula in hours is:
0.0000219907 + 0.0000000002 N^3
For the base charge plus the per attribute charge, where N is the number of attributes. In your situation, if you write all 200 attributes in a single request, the box usage charges will be about $250 per million items ($470 per million if you write 256 attributes). If you break each request in to 4 requests with 50 attributes each, you will quadruple your PutAttributes volume, but reduce the box usage charges by an order of magnitude to about $28 per million items. If you are able break the requests down, then it may be worth doing. If you cannot (due to request volume, or just the nature of your app) it means that SimpleDB can end up being extremely unappealing from a cost standpoint.

I have had the same issue as you Charlie. After profiling the code, I have narrowed the performance problem down to SSL. It seems like that is where it is spending most of it's time and hence CPU cycles.
I have read of a problem in the httplib library (which boto uses for SSL) where the performance doesn't increase unless the packets are over a certain size, though that was for Python 2.5 and may have already been fixed.

SBBExplorer uses Multithreaded BatchPutAttributes to achieve high write throughput while uploading bulk data to Amazon SimpleDB. SDB Explorer allows multiple parallel uploads. If you have the bandwidth, you can take full advantage of that bandwidth by running number of BatchPutAttributes processes at once in parallel queue that will reduce the time spend in processing.
http://www.sdbexplorer.com/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.