We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.
One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:
query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)
We limited ourselves to deleting 100 ParentKind entities at a time; each ParentKind had around 40 child ChildKindA and ChildKindB entities total - perhaps 4000 entities.
This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.
This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?
Update
We changed our queries to be keys_only, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.
Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:
query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)
The results from Appstats:
service.call #RPCs real time api time
datastore_v3.RunQuery 22 352ms 555ms
datastore_v3.Delete 1 366ms 132825ms
taskqueue.BulkAdd 1 7ms 0ms
The Delete call is the single most expensive part of the operation!
Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo query.
We recently added a bulk-delete handler, documented here. It takes the most efficient possible approach to bulk deletion, though it still consumes CPU quota.
If you want to spread out the CPU burn, you could create a map reduce job. It will still iterate over every entity (this is a current limitation of the mapper API). However, you can check if each entity meets the condition and delete or not at that time.
To slow down the CPU usage, assign the mapper to a task queue that you've configured to run slower than normal. You can spread the run time out over several days and not eat up all your CPU quota.
Related
Recently I had to import 48,000 records into Google App Engine. The stored 'tables' are 'ndb.model' types. Each of these records is checked against a couple of other 'tables' in the 'database' for integrity purposes and then written (.put()).
To do this, I uploaded a .csv file into Google Cloud Storage and processed it from there in a task queue. This processed about 10 .csv rows per second and errored after 41,000 records with an out of memory error. Splitting the .csv file into 2 sets of 24,000 records each fixed this problem.
So, my questions are:
a) is this the best way to do this?
b) is there a faster way (the next upload might be around 400,000 records)? and
c) how do I get over (or stop) the out of memory error?
Many thanks,
David
1) Have you thought about (even temporarily) upgrading your server instances?
https://cloud.google.com/appengine/docs/standard/#instance_classes
2) I don't think a 41000 row csv is enough to run out of memory, so you probably need to change your processing:
a) Break up the processing using multiple tasks, rolling your own cursor to process a couple thousand at a time, then spinning up a new task.
b) Experiment with ndb.put_multi()
Sharing some code of your loop and puts might help
The ndb in-context cache could be contributing to the memory errors. From the docs:
With executing long-running queries in background tasks, it's possible for the in-context cache to consume large amounts of memory. This is because the cache keeps a copy of every entity that is retrieved or stored in the current context. To avoid memory exceptions in long-running tasks, you can disable the cache or set a policy that excludes whichever entities are consuming the most memory.
You can prevent caching on a case by case basis by setting a context option in your ndb calls, for example
foo.put(use_cache=False)
Completely disabling caching might degrade performance if you are often using the same objects for your comparisons. If that's the case, you could flush the cache periodically to stop it getting too big.
if some_condition:
context = ndb.get_context()
context.clear_cache()
I'm testing Google App Engine and Django-nonrel with free quota. It seems to me, that the database operations to Datastore a hideously slow.
Take for example this simplified function processing a request, which takes in a multipart/form-data of XML blobs, parses them and inserts them to the database:
def post(request):
fields = cgi.FieldStorage(request)
with transaction.commit_on_success():
for xmlblob in fields.getlist('xmlblob'):
blob_object = parse_xml(xmlblob)
blob_object.save()
Blob_object has five fields, all of them of type CharField.
For just ca. 30 blobs (with about 1 kB of XML altogether), that function takes 5 seconds to return, and uses over 30000 api_cpu_ms. CPU time should equivalent to the amount of work a 1,2 GHz Intel x86 processor could do in that time, but I am pretty sure it would not take 30 seconds to insert 30 rows to a database for any x86 processor available.
Without saving objects to database (that is, just parsing the XML and throwing away the result) the request takes merely milliseconds.
So should Google App Engine really be so slow, that I can't save even a few dozen entities to the Datastore in a normal request, or am I missing something here? And of course, even if I would do the inserts in some Backend or by using a Task Queue, it would still cost hundreds of times more that what would seem acceptable.
Edit: I found out, that by default, GAE does two index writes per property for each entity. Most of those properties should not be indexed, so the question is: how can I set properties unindexed on Django-nonrel?
I still do feel though, that even with index writes, the database operation is taking ridiculous amount of time.
In the absence of batch operations, there's not much you can do to reduce wallclock times. Batch operations are pretty essential to reducing wallclock time on App Engine (or any distributed platform with RPCs, really).
Under the current billing model, CPU milliseconds reported by the datastore reflect the cost of the operation rather than the actual time it took, and are a way of billing for resources. Under the new billing model, these will be billed explicitly as datastore operations, instead.
I have not found a real answer yet, but I made some calculations for the cost. Currently every indexed property field costs around $0.20 to $0.30 per 10k inserts. With the upcoming billing model (Pricing FAQ) the cost will be exactly $0.1 per 100k operations, or $0.2 per indexed field per 100k inserts with 2 index write operations per insert.
So as the price seems to go down by a factor of ten, the observed slowness is indeed unexpected behaviour. As the free quota is well enough for my test runs, and the with new pricing model coming, I wont let it bother me at this time.
I'm working on a closed system running in the cloud.
What I need is a search function that uses user-typed-in regexp to filter the rows in a dataset.
phrase = re.compile(request.get("query"))
data = Entry.all().fetch(50000) #this takes around 10s when there are 6000 records
result = x for x in data if phrase.search(x.title)
Now, the database itself won't change too much, and there will be no more than 200-300 searches a day.
Is there a way to somehow cache all the Entries (I expect that there will be no more than 50.000 of them, each no bigger than 500 bytes), so retrieving them won't take up >10 seconds? Or perhaps to parallelize it? I don't mind 10cpu seconds, but I do mind 10 second that the user has to wait.
To address any answers like "index it and use .filter()" - the query is a regexp, and I don't know about any indexing mechanism that would allow to use a regexp.
You can also use cachepy or performance engine (shameless plug) to store the data on app engine's local instances, so you can have faster access to all entities without getting limited by memcache boundaries or datastore latency.
Hint: A local instance gets killed if it surpasses about 185 MB of memory, so you can store actually quite a lot of data in it if you know what you're doing.
Since there is a bounded number of entries, you can memcache all entries and then do the filtering in memory like you've outlined. However, note that each memcache entry cannot exceed 1mb. But you can fetch up to 32mb of memcache entries in parallel.
So split the entries into sub sets, memcache the subsets and then read them in parallel by precomputing the memcache key.
More here:
http://code.google.com/appengine/docs/python/memcache/functions.html
Since your data is on the order of 20MB, you may be able to load it entirely into local instance memory, which will be as fast as you can get. Alternately, you could store it as a data file alongside your app, reading which will be faster than accessing the datastore.
I'm using the Python GAE SDK.
I have some processing that needs to be done on 6000+ instances of MyKind. It is too slow to be done in a single request, so I'm using the task queue. If I make a single task process only one entity, then it should take only a few seconds.
The documentation says that only 100 tasks can be added in a "batch". (What do they mean by that? In one request? In one task?)
So, assuming that "batch" means "request", I'm trying to figure out what the best way is to create a task for each entity in the datastore. What do you think?
It's easier if I can assume that the order of MyKind will never change. (The processing will never actually change the MyKind instances - it only creates new instances of other types.) I could just make a bunch of tasks, giving each one an offset of where to start, spaced less than 100 apart. Then, each task could create individual tasks that do the actual processing.
But what if there are so many entities that the original request can't add all the necessary scheduling tasks? This makes me think I need a recursive solution - each task looks at the range it is given. If only one element is present in the range, it does processing on it. Otherwise, it subdivides the range further into subsequent tasks.
If I can't count on using offsets and limits to identify entities (because their ordering isn't ensured to be constant), maybe I could just use their keys? But then I could be sending 1000s of keys around, which seems unwieldy.
Am I going down the right path here, or is there another design I should consider?
When you run code like taskqueue.add(url='/worker', params={'cursor': cursor}) you are enqueueing a task; scheduling a request to execute out of band using the parameters you provide. You can apparently schedule up to 100 of these in one operation.
I don't think you want to, though. Task chaining would make this a lot simpler:
Your worker task would do something like this:
Run a query to fetch some records for processing. If a cursor was provided in the task params, use it. Limit the query to 10 records, or whatever you think can finish in 30 seconds.
Process your 10 records
If your query returned 10 records, enqueue another task and pass it the updated cursor from your query so it can pick up where you left off.
If you got fewer than 10 records, you're done. Hooray! Fire off an email or something and quit.
With this route, you only need to kick off the first task, and the rest add themselves.
Note that if a task fails, App Engine will retry it until it succeeds, so you don't need to worry about a datastore hiccup causing one task to timeout and break the chain.
Edit:
The steps above do not guarantee that an entity will be processed only once. Tasks should generally run only once, but Google does advise you to design for idempotence. If it's a major concern, here's one way to handle it:
Put a status flag on each entity to be processed, or create a complementary entity to hold the flag. It should have states akin to Pending, Processing, and Processed.
When you fetch a new entity to process, transactionally lock and increment the processing flag. Only run the entity if it's Pending. When processing finishes, increment the flag again.
Note that it's not strictly necessary to add the processing flag to every entity before you start. Your "pending" state can just mean the property or corresponding entity doesn't exist yet.
Also depending on your design, you can do what I did, which is number all of the records that need to be processed. I process about 3500 items, each taking 3 seconds or so to process. To avoid overlap, timeouts and to account for expansion in the future, my first task gets the list of all the unique items of that kind from the database. Then it divides it up into lists of 500 each item identifier, looping until it accounts for all the unique items in my database and posts each chunk of 500 identifiers to the second tier of handler tasks. Each of the second handler tasks , which currently is seven or eight different tasks, then has a unique list of 500 items, and each of those handler tasks add 500 tasks, one for each unique identifier.
Since it's all managed through loops and counting based on the number of unique items in my database then I can add as many unique items as I want, and the number of tasks will expand to accommodate them with absolutely no duplication. I use it for tracking prices in a game on a daily basis, so it's all fired off using a cron job and requires no intervention on my part at all.
I am trying to use the SimpleDB in following way.
I want to keep 48 hrs worth data at anytime into simpledb and query it for different purposes.
Each domain has 1 hr worth data, so at any time there are 48 domains present in the simpledb.
As the new data is constantly uploaded, I delete the oldest domain and create a new domain for each new hour.
Each domain is about 50MB in size, the total size of all the domains is around 2.2 GB.
The item in the domain has following type of attributes
identifier - around 50 characters long -- 1 per item
timestamp - timestamp value -- 1 per item
serial_n_data - 500-1000 bytes data -- 200 per item
I'm using python boto library to upload and query the data.
I send 1 item/sec with around 200 attributes in the domain.
For one of the application of this data, I need to get all the data from all the 48 domains.
The Query looks like, "SELECT * FROM domain", for all the domains.
I use 8 threads to query data with each thread taking responsibility of few domains.
e.g domain 1-6 thread 1
domain 7-12 thread 2 and so on
It takes close to 13 minutes to get the entire data.I am using boto's select method for this.I need much more faster performance than this. Any suggestions on speed up the querying process? Is there any other language that I can use, which can speed up the things?
Use more threads
I would suggest inverting your threads/domain ratio from 1/6 to something closer to 30/1. Most of the time taken to pull down large chunks of data from SimpleDB is going to be spent waiting. In this situation upping the thread count will vastly improve your throughput.
One of the limits of SimpleDB is the query response size cap at 1MB. This means pulling down the 50MB in a single domain will take a minimum of 50 Selects (the original + 49 additional pages). These must occur sequentially because the NextToken from the current response is needed for the next request. If each Select takes 2+ seconds (not uncommon with large responses and high request volume) you spend 2 minutes on each domain. If every thread has to iterate thru each of 6 domains in turn, that's about 12 minutes right there. One thread per domain should cut that down to about 2 minutes easily.
But you should be able to do much better than that. SimpleDB is optimized for concurrency. I would try 30 threads per domain, giving each thread a portion of the hour to query on, since it is log data after all. For example:
SELECT * FROM domain WHERE timestamp between '12:00' and '12:02'
(Obviously, you'd use real timestamp values) All 30 queries can be kicked off without waiting for any responses. In this way you still need to make at least 50 queries per domain, but instead of making them all sequentially you can get a lot more concurrency. You will have to test for yourself how many threads gives you the best throughput. I would encourage you to try up to 60 per domain, breaking the Select conditions down to one minute increments. If it works for you then you will have fully parallel queries and most likely have eliminated all follow up pages. If you get 503 ServiceUnavailable errors, scale back the threads.
The domain is the basic unit of scalability for SimpleDB so it is good that you have a convenient way to partition your data. You just need take advantage of the concurrency. Rather than 13 minutes, I wouldn't be surprised if you were able to get the data in 13 seconds for an app running on EC2 in the same region. But the actual time it takes will depend on a number of other factors.
Cost Concerns
As a side note, I should mention the costs of what you are doing, even though you haven't raised the issue. CreateDomain and DeleteDomain are heavyweight operations. Normally I wouldn't advise using them so often. You are charged about 25 seconds of box usage each time so creating and deleting one each hour adds up to about $70 per month just for domain management. You can store orders of magnitude more data in a domain than the 50MB you mention. So you might want to let the data accumulate more before you delete. If your queries include the timestamp (or could be made to include the timestamp) query performance may not be hurt at all by having an extra GB of old data in the domain. In any case, GetAttributes and PutAttributes will never suffer a performance hit with a large domain size, it is only queries that don't make good use of a selective index. You'd have to test your queries to see. That is just a suggestion, I realize that the create/delete is cleaner conceptually.
Also writing 200 attributes at a time is expensive, due to a quirk in the box usage formula. The box usage for writes is proportional to the number of attributes raised to the power of 3 ! The formula in hours is:
0.0000219907 + 0.0000000002 N^3
For the base charge plus the per attribute charge, where N is the number of attributes. In your situation, if you write all 200 attributes in a single request, the box usage charges will be about $250 per million items ($470 per million if you write 256 attributes). If you break each request in to 4 requests with 50 attributes each, you will quadruple your PutAttributes volume, but reduce the box usage charges by an order of magnitude to about $28 per million items. If you are able break the requests down, then it may be worth doing. If you cannot (due to request volume, or just the nature of your app) it means that SimpleDB can end up being extremely unappealing from a cost standpoint.
I have had the same issue as you Charlie. After profiling the code, I have narrowed the performance problem down to SSL. It seems like that is where it is spending most of it's time and hence CPU cycles.
I have read of a problem in the httplib library (which boto uses for SSL) where the performance doesn't increase unless the packets are over a certain size, though that was for Python 2.5 and may have already been fixed.
SBBExplorer uses Multithreaded BatchPutAttributes to achieve high write throughput while uploading bulk data to Amazon SimpleDB. SDB Explorer allows multiple parallel uploads. If you have the bandwidth, you can take full advantage of that bandwidth by running number of BatchPutAttributes processes at once in parallel queue that will reduce the time spend in processing.
http://www.sdbexplorer.com/