What's the best way to handle tasks executed in Celery where the result is large? I'm thinking of things like table dumps and the like, where I might be returning data in the hundreds of megabytes.
I'm thinking that the naive approach of cramming the message into the result database is not going to serve me here, much less if I use AMQP for my result backend. However, I have some of these where latency is an issue; depending on the particular instance of the export, sometimes I have to block until it returns and directly emit the export data from the task client (an HTTP request came in for the export content, it doesn't exist, but must be provided in the response to that request ... no matter how long that takes)
So, what's the best way to write tasks for this?
One option would be to have a static HTTP server running on all of your worker machines. Your task can then dump the large result to a unique file in the static root and return a URL reference to the file. The receiver can then fetch the result at its leisure.
eg. Something vaguely like this:
#task
def dump_db(db):
# Some code to dump the DB to /srv/http/static/db.sql
return 'http://%s/%s.sql' % (socket.gethostname(), db)
You would of course need some means of reaping old files, as well as guaranteeing uniqueness, and probably other issues, but you get the general idea.
I handle this by structuring my app to write the multi-megabyte results into files, which I them memmap into memory so they are shared among all processes that use that data... This totally finesses the question of how to get the results to another machine, but if the results are that large, it sounds like the these tasks are internal tasks coordinate between server processes.
Related
This is a rather specific question to advanced users of celery. Let me explain the use case I have:
Usecase
I have to run ~1k-100k tasks that will run a simulation (movie) and return the data of the simulation as a rather large list of smaller objects (frames), say 10k-100k per frame and 1k frames. So the total amount of data produced will be very large, but assume that I have a database that can handle this. Speed is not a key factor here. Later I need to compute features from each frame which can be done completely independent.
The frames look like a dict that point to some numpy arrays and other simple data like strings and numbers and have a unique identifier UUID.
Important is that the final objects of interest are arbitrary joins and splits of these generated lists. As a metaphor consider the result movies be chop and recombined into new movies. These final lists (movies) are then basically a list of references to the frames using their UUIDs.
Now, I consider using celery to get these first movies and since these will end up in the backend DB anyway I might just keep these results indefinitely, at least the ones I specify to keep.
My question
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID. And if so does that make sense because of overhead and performance, etc.
Another possibility would be to not return anything and let the worker store the result in a DB. Is that preferred? It seems unnecessary to have a second channel of communication to another DB when Celery can do this already.
I am also interested in comments on using Celery in general for highly independent tasks that run long (>1h) and return large result objects. A fail is not problematic and can just be restarted. The resulting movies are stochastic! so functional approaches can be problematic. Even storing the random seed might not garantuee reproducible results! although I do not have side-effects. I just might have lots of workers available that are widely distributed. Imagine lots of desktop machines in a closed environment where every worker helps even if it is slow. Network speed and security is not an issue here. I know that this is not the original use case, but it seemed very easy to use it for these cases. The best analogy I found are projects like Folding#Home.
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID.
Yes, you can configure celery to store its results in a NoSQL database such as redis for access by UUID later. The two settings that will control the behavior of interest for you are result_expires and result_backend.
result_backend will specify which NoSQL database you want to store your results in (e.g., elasticsearch or redis) while result_expires will specify how long after a task completes that the task's result will be available for access.
After the task completes, you can access the results in python like this:
from celery.result import AsyncResult
result = task_name.delay()
print result.id
uuid = result.id
checked_result = AsyncResult(uuid)
# and you can access the result output here however you'd like
And if so does that make sense because of overhead and performance, etc.
I think this strategy makes perfect sense. I have typically used this a number of times when generating long-running reports for web users. The initial post will return the UUID from the celery task. The web client can poll the app sever via javascript using the UUID to see if the task is ready/complete. Once the report is ready, the page can redirect the user to the route that will allow the user to download or view the report by passing in the UUID.
I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.
I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.
As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.
Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:
OOM command not allowed when used memory > 'maxmemory'.
I have two questions:
Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
Is there a way to not serialise the parameter but rather a reference to it?
Your thoughts on the best solution are much appreciated!
Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:
Load up your list of key/value pairs in a file.
Upload the file to Amazon S3.
Get the resulting file URL, and pass that into your RQ task.
In your worker task, download the file.
Parse the file line-by-line, inserting the documents into Mongo.
Using the method above, you'll be able to:
Quickly break up your tasks into manageable chunks.
Upload these small, compressed files to S3 quickly (use gzip).
Greatly reduce your redis usage by requiring much less data to be passed over the wires.
Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
Greatly reduce memory consumption on your worker by processing the file one line at-a-time.
For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.
Hope this helps!
It turns out that the solution that worked for is to save the data to Amazon S3 storage, and then pass the URI to function in the background task.
Is it viable to have a logger entity in app engine for writing logs? I'll have an app with ~1500req/sec and am thinking about doing it with a taskqueue. Whenever I receive a request, I would create a task and put it in a queue to write something to a log entity (with a date and string properties).
I need this because I have to put statistics in the site that I think that doing it this way and reading the logs with a backend later would solve the problem. Would rock if I had programmatic access to the app engine logs (from logging), but since that's unavailable, I dont see any other way to do it..
Feedback is much welcome
There are a few ways to do this:
Accumulate logs and write them in a single datastore put at the end of the request. This is the highest latency option, but only slightly - datastore puts are fairly fast. This solution also consumes the least resources of all the options.
Accumulate logs and enqueue a task queue task with them, which writes them to the datastore (or does whatever else you want with them). This is slightly faster (task queue enqueues tend to be quick), but it's slightly more complicated, and limited to 100kb of data (which hopefully shouldn't be a limitation).
Enqueue a pull task with the data, and have a regular push task or a backend consume the queue and batch-and-insert into the datastore. This is more complicated than option 2, but also more efficient.
Run a backend that accumulates and writes logs, and make URLFetch calls to it to store logs. The urlfetch handler can write the data to the backend's memory and return asynchronously, making this the fastest in terms of added user latency (less than 1ms for a urlfetch call)! This will require waiting for Python 2.7, though, since you'll need multi-threading to process the log entries asynchronously.
You might also want to take a look at the Prospective Search API, which may allow you to do some filtering and pre-processing on the log data.
How about keeping a memcache data structure of request info (recorded as they arrive) and then run an every 5 minute (or faster) cron job that crunches the stats on the last 5 minutes of requests from the memcache and just records those stats in the data store for that 5 minute interval. The same (or a different) cron job could then clear the memcache too - so that it doesn't get too big.
Then you can run big-picture analysis based on the aggregate of 5 minute interval stats, which might be more manageable than analyzing hours of 1500req/s data.
I have pretty standard Django+Rabbitmq+Celery setup with 1 Celery task and 5 workers.
Task uploads the same (I simplify a bit) big file (~100MB) asynchronously to a number of remote PCs.
All is working fine at the expense of using lots of memory, since every task/worker load that big file into memory separatelly.
What I would like to do is to have some kind of cache, accessible to all tasks, i.e. load the file only once. Django caching based on locmem would be perfect, but like documentation says: "each process will have its own private cache instance" and I need this cache accessible to all workers.
Tried to play with Celery signals like described in #2129820, but that's not what I need.
So the question is: is there a way I can define something global in Celery (like a class based on dict, where I could load the file or smth). Or is there a Django trick I could use in this situation ?
Thanks.
Why not simply stream the upload(s) from disk instead of loading the whole file in memory ?
It seems to me that what you need is memcached backed for django. That way each task in Celery will have access to it.
Maybe you can use threads instead of processes for this particular task. Since threads all share the same memory, you only need one copy of the data in memory, but you still get parallel execution.
( this means not using Celery for this task )
im looking to write a daemon that:
reads a message from a queue (sqs, rabbit-mq, whatever ...) containing a path to a zip file
updates a record in the database saying something like "this job is processing"
reads the aforementioned archive's contents and inserts a row into a database w/ information culled from file meta data for each file found
duplicates each file to s3
deletes the zip file
marks the job as "complete"
read next message in queue, repeat
this should be running as a service, and initiated by a message queued when someone uploads a file via the web frontend. the uploader doesn't need to immediately see the results, but the upload be processed in the background fairly expediently.
im fluent with python, so the very first thing that comes to mind is writing a simple server with twisted to handle each request and carry out the process mentioned above. but, ive never written anything like this that would run in a multi-user context. its not going to service hundreds of uploads per minute or hour, but it'd be nice if it could handle several at a time, reasonable. i also am not terribly familiar with writing multi-threaded applications and dealing with issues like blocking.
how have people solved this in the past? what are some other approaches i could take?
thanks in advance for any help and discussion!
I've used Beanstalkd as a queueing daemon to very good effect (some near-time processing and image resizing - over 2 million so far in the last few weeks). Throw a message into the queue with the zip filename (maybe from a specific directory) [I serialise a command and parameters in JSON], and when you reserve the message in your worker-client, no one else can get it, unless you allow it to time out (when it goes back to the queue to be picked up).
The rest is the unzipping and uploading to S3, for which there are other libraries.
If you want to handle several zip files at once, run as many worker processes as you want.
I would avoid doing anything multi-threaded and instead use the queue and the database to synchronize as many worker processes as you care to start up.
For this application I think twisted or any framework for creating server applications is going to be overkill.
Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: How do you create a daemon in Python?
Add some logging, maybe a try/except block to email out failures to you.
i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:
django view accepts, stores upload
a celery Task is dispatched to process the upload. all work is done inside the Task.