I am working on a project, to be deployed on Heroku in Django, which has around 12 update functions. They take around 15 minutes to run each. Let's call them update1(), update2()...update10().
I am deploying with one worker dyno on Heroku, and I would like to run up to n or more of these at once (They are not really computationally intensive, they are all HTML parsers, but the data is time-sensitive, so I would like them to be called as often as possible).
I've read a lot of Celery and APScheduler documentation, but I'm not really sure which is the best/easiest for me. Do scheduled tasks run concurrently if the times overlap with one another (ie. if I run one every 2 minutes, and another every 3 minutes, or do they wait until each one finishes?)
Any way I can queue these functions, so at least a few of them are running at once? What is the suggested number of simultaneous calls for this use-case?
Based on you use case description you do not need a Scheduler, so APScheduler will not match your requirements well.
Do you have a web dyno besides your worker dyno? The usual design pattern for this type of processing is to set up a control thread or control process (your web dyno) that accepts requests. These requests are then placed on a request queue.
This queue is read by one or more worker threads or worker processes (you worker dyno). I have not worked with Celery, but it looks like a match with your requirements. How many worker threads or worker dyno's you will need is difficult to determine based on your description. You will need to specify also how many requests for updates you will need to process per second. Also, you will need to specify if the request is CPU bound or IO bound.
Related
I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?
The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.
I am currently using celery to run my background tasks.
Let's say I want to run these background tasks on weekly basis.
Is it a good idea to throw 20 tasks under one worker?
Each tasks would make about at least 800 web requests.
In the celery docs it says :
A single Celery process can process millions of tasks a minute, with sub-millisecond round-trip latency (using RabbitMQ, librabbitmq, and optimized settings).
So basically; one task could be for one user; I would need to run at least 50 different tasks; each of them making about 800 web request. I thought; maybe I would need new worker for each tasks; but reviewing the doc it doesn't seem like I need multiple workers for each tasks; instead I can throw all at once; and would be just fine. I don't feel confident about that though; What should I do in my case; if I am making 800 web requests per task; should I need multiple workers? Or should I just do everything under one worker.
I'm building a web application that has some long-running jobs to do, and some very short. The jobs are invoked by users on the website, and can run anywhere from a few seconds to several hours. The spawned job needs to provide status updates to the user via the website.
I'm new to heroku, and am thinking the proper approach is to spawn a new dyno for each background task, and to communicate status via database or a memcache record. Maybe?
My question is whether this is the technically feasibly, and advisable approach?
I ask it because the documentation has a different mindset: that the worker dyno pulls jobs off a queue, and if you want things to go faster you run more dynos. I'm not sure that will workâcould a 10 second job get blocked waiting for a couple of 10 hour jobs to finish? There is a way of determining the size of the job but, again, there is a highly variable amount of work to do before it is known.
I've not found any examples suggesting it is even possible for the web dyno to run up workers ad-hoc. Is it? Is the solution to multi-thread the worker dyno? Is so, what about potential memory space issues?
What's my best approach?
thx.
I'm doing some metric analysis on on my web app, which makes extensive use of celery. I have one metric which measures the full trip from a post_save signal through a celery task (which itself calls a number of different celery tasks) to the end of that task. I've been hitting the server with up to 100 requests in 5 seconds.
What I find interesting is that when I hit the server with hundreds of requests (which entails thousands of celery worker processes being queued), the time it takes for the trip from post save to the end of the main celery task increases significantly, even though I never do any additional database calls, and none of the celery tasks should be blocking the main task.
Could the fact that there are so many celery tasks in the queue when I make a bunch of requests really quickly be slowing down the logic in my post_save function and main celery task? That is, could the processing associated with getting the sub-tasks that the main celery task creates onto a crowded queue be having a significant impact on the time it takes to reach the end of the main celery task?
It's impossible to really answer your question without an in-depth analysis of your actual code AND benchmark protocol, and while having some working experience with Python, Django and Celery I wouldn't be able to do such an in-depth analysis. Now there are a couple very obvious points :
if your workers are running on the same computer as your Django instance, they will compete with Django process(es) for CPU, RAM and IO.
if the benchmark "client" is also running on the same computer then you have a "heisenbench" case - bombing a server with 100s of HTTP request per second also uses a serious amount of resources...
To make a long story short: concurrent / parallel programming won't give you more processing power, it will only allow you to (more or less) easily scale horizontally.
I'm not sure about slowing down, but it can cause your application to hang. I've had this problem where one application would backup several other queues with no workers. My application could then no longer queue messages.
If you open up a django shell and try to queue a task. Then hit ctrl+c. I can't quite remember what the stack trace should be, but if you post it here I could confirm it.
I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.
There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.