I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.
I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.
As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.
Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:
OOM command not allowed when used memory > 'maxmemory'.
I have two questions:
Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
Is there a way to not serialise the parameter but rather a reference to it?
Your thoughts on the best solution are much appreciated!
Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:
Load up your list of key/value pairs in a file.
Upload the file to Amazon S3.
Get the resulting file URL, and pass that into your RQ task.
In your worker task, download the file.
Parse the file line-by-line, inserting the documents into Mongo.
Using the method above, you'll be able to:
Quickly break up your tasks into manageable chunks.
Upload these small, compressed files to S3 quickly (use gzip).
Greatly reduce your redis usage by requiring much less data to be passed over the wires.
Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
Greatly reduce memory consumption on your worker by processing the file one line at-a-time.
For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.
Hope this helps!
It turns out that the solution that worked for is to save the data to Amazon S3 storage, and then pass the URI to function in the background task.
Related
I am working with a legacy project of Kubeflow, the pipelines have a few components in order to apply some kind of filters to data frame.
In order to do this, each component downloads the data frame from S3 applies the filter and uploads it into S3 again.
In the components where the data frame is used for training or validating the models, download from S3 the data frame.
The question is about if this is a best practice, or is better to share the data frame directly between components, because the upload to the S3 can fail, and then fail the pipeline.
Thanks
As always with questions asking for "best" or "recommended" method, the primary answer is: "it depends".
However, there are certain considerations worth spelling out in your case.
Saving to S3 in between pipeline steps.
This stores intermediate result of the pipeline and as long as the steps take long time and are restartable it may be worth doing that. What "long time" means is dependent on your use case though.
Passing the data directly from component to component. This saves you storage throughput and very likely the not insignificant time to store and retrieve the data to / from S3. The downside being: if you fail mid-way in the pipeline, you have to start from scratch.
So the questions are:
Are the steps idempotent (restartable)?
How often the pipeline fails?
Is it easy to restart the processing from some mid-point?
Do you care about the processing time more than the risk of loosing some work?
Do you care about the incurred cost of S3 storage/transfer?
The question is about if this is a best practice
The best practice is to use the file-based I/O and built-in data-passing features. The current implementation uploads the output data to storage in upstream components and downloads the data in downstream components. This is the safest and most portable option and should be used until you see that it no longer works for you (100GB datasets will probably not work reliably).
or is better to share the data frame directly between components
How can you "directly share" in-memory python object between different python programs running in containers on different machines?
because the upload to the S3 can fail, and then fail the pipeline.
The failed pipeline can just be restarted. The caching feature will make sure that already finished tasks won't be re-executed.
Anyways, what is the alternative? How can you send the data between distributed containerized programs without sending it over the network?
Recently I had to import 48,000 records into Google App Engine. The stored 'tables' are 'ndb.model' types. Each of these records is checked against a couple of other 'tables' in the 'database' for integrity purposes and then written (.put()).
To do this, I uploaded a .csv file into Google Cloud Storage and processed it from there in a task queue. This processed about 10 .csv rows per second and errored after 41,000 records with an out of memory error. Splitting the .csv file into 2 sets of 24,000 records each fixed this problem.
So, my questions are:
a) is this the best way to do this?
b) is there a faster way (the next upload might be around 400,000 records)? and
c) how do I get over (or stop) the out of memory error?
Many thanks,
David
1) Have you thought about (even temporarily) upgrading your server instances?
https://cloud.google.com/appengine/docs/standard/#instance_classes
2) I don't think a 41000 row csv is enough to run out of memory, so you probably need to change your processing:
a) Break up the processing using multiple tasks, rolling your own cursor to process a couple thousand at a time, then spinning up a new task.
b) Experiment with ndb.put_multi()
Sharing some code of your loop and puts might help
The ndb in-context cache could be contributing to the memory errors. From the docs:
With executing long-running queries in background tasks, it's possible for the in-context cache to consume large amounts of memory. This is because the cache keeps a copy of every entity that is retrieved or stored in the current context. To avoid memory exceptions in long-running tasks, you can disable the cache or set a policy that excludes whichever entities are consuming the most memory.
You can prevent caching on a case by case basis by setting a context option in your ndb calls, for example
foo.put(use_cache=False)
Completely disabling caching might degrade performance if you are often using the same objects for your comparisons. If that's the case, you could flush the cache periodically to stop it getting too big.
if some_condition:
context = ndb.get_context()
context.clear_cache()
Part of a project I am involved in consists of developing a scientific application for internal usage, working with large collection of files (about 20000) each of which is ~100Mb in size. Files are accompanied with meta information, used to select subsets of the whole set.
Update after reading response
Yes, processing is located in the single server room.
The application selects two subsets of these files. On the first stage it processes each file individually and independently yielding up to 30 items from one file, for the second stage. Each resulting item is also stored in a file, and file size varies from 5 to 60Kb.
On the second stage the app processes all possible pairs of results that have been produced on the first stage where the first element of a pair has came from the 1st subset and 2nd - from the 2nd - a cross-join or Cartesian product of two sets.
Typical amount of items in the first subset is thousands, and in the 2nd - tens of thousands. Therefore amount of all possible pairs in the second stage is hundreds of millions.
Typical time for processing of a single source 100Mb file is about 1 second, of a single pair of 1st stage results - microseconds. The application is not for real-time processing, its general use case would be to submit a job for overnight calculation and obtain results in the morning.
We have got already a version of an application, developed earlier, when we have had much less data. It is developed in Python and uses file system and data structures from the Python library. The computations are performed on 10 PCs, connected with self-developed software, written with Twisted. Files are stored on NAS and on PCs' local drives. Now the app performs very poorly, especially on the second stage and after it, during aggregation of results.
Currently I am looking at MongoDB to accomplish this task. However, I do not have too much experience with such tools and open for suggestions.
I have conducted some experiments with MongoDB and PyMongo and have found that loading the whole file from the database takes about 10 seconds over the Gigabit ethernet. Minimal chunk size for processing is ~3Mb and it is retrieved for 320 msec. Loading files from a local drive is faster.
MongoDB config contained a single line with a path.
However, very appealing feature of the database is its ability to store metainformation and support search for it, as well as automatic replication.
This is also a persistent data storage, therefore, the computations can be continued after accidental stop (currently we have to start over).
So, my questions are.
Is MongoDB a right choice?
If yes, then what are the guide lines for a data model?
Is it possible to improve retrieval time for files?
Or, is it reasonable to store files in a file system, as before, and store paths to them in the database?
Creation of a list of all possible pairs for the 2nd stage has been performed in the client python code and also took rather long time (I haven't measured it).
Will MongoDB server do better?
In this case you could go for sharded gridfs as a part of mongo more here
That will allow for faster file retrieval process and still have metadata together with file record.
other way to speed up when using only replica set is to have a kind of logic balancer and get files one time from master other time form slave (or other slave in a king of roundRobin way).
Storing files in file system will be always a bit faster and as long this is about one server room (-> processed locally) - I probably will stick to it, but with huge concern of backup.
What's the best way to handle tasks executed in Celery where the result is large? I'm thinking of things like table dumps and the like, where I might be returning data in the hundreds of megabytes.
I'm thinking that the naive approach of cramming the message into the result database is not going to serve me here, much less if I use AMQP for my result backend. However, I have some of these where latency is an issue; depending on the particular instance of the export, sometimes I have to block until it returns and directly emit the export data from the task client (an HTTP request came in for the export content, it doesn't exist, but must be provided in the response to that request ... no matter how long that takes)
So, what's the best way to write tasks for this?
One option would be to have a static HTTP server running on all of your worker machines. Your task can then dump the large result to a unique file in the static root and return a URL reference to the file. The receiver can then fetch the result at its leisure.
eg. Something vaguely like this:
#task
def dump_db(db):
# Some code to dump the DB to /srv/http/static/db.sql
return 'http://%s/%s.sql' % (socket.gethostname(), db)
You would of course need some means of reaping old files, as well as guaranteeing uniqueness, and probably other issues, but you get the general idea.
I handle this by structuring my app to write the multi-megabyte results into files, which I them memmap into memory so they are shared among all processes that use that data... This totally finesses the question of how to get the results to another machine, but if the results are that large, it sounds like the these tasks are internal tasks coordinate between server processes.
I have pretty standard Django+Rabbitmq+Celery setup with 1 Celery task and 5 workers.
Task uploads the same (I simplify a bit) big file (~100MB) asynchronously to a number of remote PCs.
All is working fine at the expense of using lots of memory, since every task/worker load that big file into memory separatelly.
What I would like to do is to have some kind of cache, accessible to all tasks, i.e. load the file only once. Django caching based on locmem would be perfect, but like documentation says: "each process will have its own private cache instance" and I need this cache accessible to all workers.
Tried to play with Celery signals like described in #2129820, but that's not what I need.
So the question is: is there a way I can define something global in Celery (like a class based on dict, where I could load the file or smth). Or is there a Django trick I could use in this situation ?
Thanks.
Why not simply stream the upload(s) from disk instead of loading the whole file in memory ?
It seems to me that what you need is memcached backed for django. That way each task in Celery will have access to it.
Maybe you can use threads instead of processes for this particular task. Since threads all share the same memory, you only need one copy of the data in memory, but you still get parallel execution.
( this means not using Celery for this task )