We're designing a system that will take thousands of rows at a time and send them via JSON to a REST API built on Google App Engine. Typically 3-300KB of data but let's say in extreme cases a few MB.
The REST API app will then adapt this data to models on the server and save them to the Datastore. Are we likely to (eventually if not immediately) encounter any performance bottlenecks here with Google App Engine, whether it's working with that many models or saving so many rows of data at a time to the datastore?
The client does a GET to get thousands of records, then a PUT with thousands of records. Is there any reason for this to take more than a few seconds, and necessitate the need for a Task queues API?
The only bottleneck in App Engine (apart from the single entity group limitation) is how many entities you can process in a single thread on a single instance. This number depends on your use case and the quality of your code. Once you reach a limit, you can (a) use a more powerful instance, (b) use multi-threading and/or (c) add more instances to scale up your processing capacity to any level you desire.
Task API is a very useful tool for large data loads. It allows you to split your job into a large number of smaller tasks, set the desired processing rate, and let App Engine automatically adjust the number of instances to meet that rate. Another option is a MapReduce API.
This is a really good question, one that I've been asked in interviews, seen pop up in a lot of different situations as well. Your system essentially consists of two things:
Savings (or writing) models to the data store
Reading from the data store.
From my experience of this problem, when you view these two things differently you're able to come up with solid solutions to both. I typically use a cache, such as memcachd, in order to keep data easily accessible for reading. At the same time, for writing, I try to have a main db and a few slave instances as well. All the writes will go to the slave instances (thereby not locking up the main db for reads that sync to the cache), and the writes to the slave db's can be distributed in a round robin approach there by ensuring that your insert statements are not skewed by any of the model's attributes having a high occurance.
Related
Recently I had to import 48,000 records into Google App Engine. The stored 'tables' are 'ndb.model' types. Each of these records is checked against a couple of other 'tables' in the 'database' for integrity purposes and then written (.put()).
To do this, I uploaded a .csv file into Google Cloud Storage and processed it from there in a task queue. This processed about 10 .csv rows per second and errored after 41,000 records with an out of memory error. Splitting the .csv file into 2 sets of 24,000 records each fixed this problem.
So, my questions are:
a) is this the best way to do this?
b) is there a faster way (the next upload might be around 400,000 records)? and
c) how do I get over (or stop) the out of memory error?
Many thanks,
David
1) Have you thought about (even temporarily) upgrading your server instances?
https://cloud.google.com/appengine/docs/standard/#instance_classes
2) I don't think a 41000 row csv is enough to run out of memory, so you probably need to change your processing:
a) Break up the processing using multiple tasks, rolling your own cursor to process a couple thousand at a time, then spinning up a new task.
b) Experiment with ndb.put_multi()
Sharing some code of your loop and puts might help
The ndb in-context cache could be contributing to the memory errors. From the docs:
With executing long-running queries in background tasks, it's possible for the in-context cache to consume large amounts of memory. This is because the cache keeps a copy of every entity that is retrieved or stored in the current context. To avoid memory exceptions in long-running tasks, you can disable the cache or set a policy that excludes whichever entities are consuming the most memory.
You can prevent caching on a case by case basis by setting a context option in your ndb calls, for example
foo.put(use_cache=False)
Completely disabling caching might degrade performance if you are often using the same objects for your comparisons. If that's the case, you could flush the cache periodically to stop it getting too big.
if some_condition:
context = ndb.get_context()
context.clear_cache()
This is a rather specific question to advanced users of celery. Let me explain the use case I have:
Usecase
I have to run ~1k-100k tasks that will run a simulation (movie) and return the data of the simulation as a rather large list of smaller objects (frames), say 10k-100k per frame and 1k frames. So the total amount of data produced will be very large, but assume that I have a database that can handle this. Speed is not a key factor here. Later I need to compute features from each frame which can be done completely independent.
The frames look like a dict that point to some numpy arrays and other simple data like strings and numbers and have a unique identifier UUID.
Important is that the final objects of interest are arbitrary joins and splits of these generated lists. As a metaphor consider the result movies be chop and recombined into new movies. These final lists (movies) are then basically a list of references to the frames using their UUIDs.
Now, I consider using celery to get these first movies and since these will end up in the backend DB anyway I might just keep these results indefinitely, at least the ones I specify to keep.
My question
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID. And if so does that make sense because of overhead and performance, etc.
Another possibility would be to not return anything and let the worker store the result in a DB. Is that preferred? It seems unnecessary to have a second channel of communication to another DB when Celery can do this already.
I am also interested in comments on using Celery in general for highly independent tasks that run long (>1h) and return large result objects. A fail is not problematic and can just be restarted. The resulting movies are stochastic! so functional approaches can be problematic. Even storing the random seed might not garantuee reproducible results! although I do not have side-effects. I just might have lots of workers available that are widely distributed. Imagine lots of desktop machines in a closed environment where every worker helps even if it is slow. Network speed and security is not an issue here. I know that this is not the original use case, but it seemed very easy to use it for these cases. The best analogy I found are projects like Folding#Home.
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID.
Yes, you can configure celery to store its results in a NoSQL database such as redis for access by UUID later. The two settings that will control the behavior of interest for you are result_expires and result_backend.
result_backend will specify which NoSQL database you want to store your results in (e.g., elasticsearch or redis) while result_expires will specify how long after a task completes that the task's result will be available for access.
After the task completes, you can access the results in python like this:
from celery.result import AsyncResult
result = task_name.delay()
print result.id
uuid = result.id
checked_result = AsyncResult(uuid)
# and you can access the result output here however you'd like
And if so does that make sense because of overhead and performance, etc.
I think this strategy makes perfect sense. I have typically used this a number of times when generating long-running reports for web users. The initial post will return the UUID from the celery task. The web client can poll the app sever via javascript using the UUID to see if the task is ready/complete. Once the report is ready, the page can redirect the user to the route that will allow the user to download or view the report by passing in the UUID.
I have an function that operates on a single collection, recursively doing two aggregations then updating documents. All my indexes are correct.
At the moment I cannot refactor this code, and when the function runs it slams Mongo for about 10 minutes to process the data. Seems to grow based on collection size, averaging about 60 seconds for every additional 3k documents. This collection can grow to hundreds of thousands of documents. (The documents are small - about 10 keys with very small values each.)
There is no need for the result of this function to be real-time - it is scheduled, so it's perfectly fine for me to throttle it back.
The question is, is there any way to tell mongo to limit the CPU it grants to an operation? Or should I address the throttling in Python using sleep or some other method?
recursively doing two aggregations then updating documents
It looks like you need to consider to re-model your schema. The flexibility of MongoDB documents schema is something to optimise your process. See MongoDB: Data Modeling for more infos, examples and patterns.
The question is, is there any way to tell mongo to limit the CPU it grants to an operation?
MongoDB does not have a feature to limit a CPU usage per operation. This feature may not make sense in distributed fashion. For example, a limit of 1 CPU for an operation that spans multiple shards may not be as simple/desired anymore.
Alternatively, depending on your use case, if the function does not have to be real-time, you could utilise secondary read-preference.
Essentially, directing your scheduled reporting to a secondary member, allowing your primary to process other data.
Although make sure you read the pros and cons of this secondary read beforehand.
See also MongoDB Replication
I'm looking for a way to get the size of my database. Right now, I've got a class method in my datastore model called query_all which does exactly that: it queries all entities of that model and returns them.
In my code, I call that like this:
query = MyModel.query_all(ndb.Key('MyModel', '*defaultMyModel'))
count = query.count()
return 'Database size: {}'.format(count)
Now this works, but because my database is pretty big (it has ~10,000 entities), every time I call this it takes away 20% of my Datastore Small Operations quota (I can do 0.05 million ops per day).
Is there any way for me to do this more efficiently? I think it's the .count() that's causing the problem and not the query_all(...), but I'm not sure.
Any ideas?
Thanks.
Don't use the datastore for reporting:
Rollup the data in realtime and use Memcache.
When I want to track some consumable resource:
What I have done in the past is create a QuotaManagementService that I call when I create or destroy some resource.
On every insert/delete operation you increment/decrement the count for that resource.
One easy way to do with with GAE is with MemcacheService, it even has inc()/dec() methods for you to manipulate the counts. You can also do this async so that it doesn't affect latency.
Since Memcache is an in memory cache only you would want to persist the information some how. You might want to store the value in the datastore periodically, maybe from a background task. Or you could just live with recreating it when your application is updated.
Google Analytics
You could also pump create/delete events into Google Analytics so you can just go to the dashboard and see how many calls have been made any other detailed stuff you wouldn't have to write much code for this either.
Either way this can be pretty transparent with Python
You could probably just create a Decorator to automatically inc/dec the counts and not even manage it with code.
Memcache writes are free so I don't think I would write every event to a DataStore counter if they were frequent.
Here is some example code for using the DataStore directly instead of Memcache for high volume writes to a counter.
Our situation is as follows:
We are working on a schoolproject where the intention is that multiple teams walk around in a city with smarthphones and play a city game while walking.
As such, we can have 10 active smarthpones walking around in the city, all posting their location, and requesting data from the google appengine.
Someone is behind a webbrowser,watching all these teams walk around, and sending them messages etc.
We are using the datastore the google appengine provides to store all the data these teams send and request, to store the messages and retrieve them etc.
However we soon found out we where at our max limit of reads and writes, so we searched for a solution to be able to retrieve periodic updates(which cost the most reads and writes) without using any of the limited resources google provides. And obviously, because it's a schoolproject we don't want to pay for more reads and writes.
Storing this information in global variables seemed an easy and quick solution, which it was... but when we started to truly test we noticed some of our data was missing and then reappearing. Which turned out to be because there where so many requests being done to the cloud that a new instance was made, and instances don't keep these global variables persistent.
So our question is:
Can we somehow make sure these global variables are always the same on every running instance of google appengine.
OR
Can we limit the amount of instances ever running, no matter how many requests are done to '1'.
OR
Is there perhaps another way to store this data in a better way, without using the datastore and without using globals.
You should be using memcache. If you use the ndb (new database) library, you can automatically cache the results of queries. Obviously this won't improve your writes much, but it should significantly improve the numbers of reads you can do.
You need to back it with the datastore as data can be ejected from memcache at any time. If you're willing to take the (small) chance of losing updates you could just use memcache. You could do something like store just a message ID in the datastore and have the controller periodically verify that every message ID has a corresponding entry in memcache. If one is missing the controller would need to reenter it.
Interesting question. Some bad news first, I don't think there's a better way of storing data; no, you won't be able to stop new instances from spawning and no, you cannot make seperate instances always have the same data.
What you could do is have the instances perioidically sync themselves with a master record in the datastore, by choosing the frequency of this intelligently and downloading/uploading the information in one lump you could limit the number of read/writes to a level that works for you. This is firmly in the kludge territory though.
Despite finding the quota for just about everything else, I can't find the limits for free read/write so it is possible that they're ludicrously small but the fact that you're hitting them with a mere 10 smartphones raises a red flag to me. Are you certain that the smartphones are being polled (or calling in) at a sensible frequency? It sounds like you might be hammering them unnecessarily.
Consider jabber protocol for communication between peers. Free limits are on quite high level for it.
First, definitely use memcache as Tim Delaney said. That alone will probably solve your problem.
If not, you should consider a push model. The advantage is that your clients won't be asking you for new data all the time, only when something has actually changed. If the update is small enough that you can deliver it in the push message, you won't need to worry about datastore reads on memcache misses, or any other duplicate work, for all those clients: you read the data once when it changes and push it out to everyone.
The first option for push is C2DM (Android) or APN (iOS). These are limited on the amount of data they send and the frequency of updates.
If you want to get fancier you could use XMPP instead. This would let you do more frequent updates with (I believe) bigger payloads but might require more engineering. For a starting point, see Stack Overflow questions about Android and iOS.
Have fun!