Doing Actions Simultaneously Versus In A Queue

Doing Actions Simultaneously Versus In A Queue - python

Say, there are 40 different users of a mobile app calling the server, which delivers some content created using FFMPEG.
It takes about 5 seconds to create the content for each user.
I was just wondering if FFMPEG would process, the commands simultaneously, or if it would be done in a queue.
Basically would it take approximately 5 seconds for everyone, or would it take 5 seconds - 200 seconds for each person, depending on their queue positioning.
Also, if it would be done via queueing, how would it be possible to change the task to become simultaneous because I don't want my users to wait for a long time.

Depends on how many worker processes you have.
Since you added the Heroku tag, I'm assuming you're using Heroku. On Heroku, one dyno is one such worker process.
Routing is more or less random on Heroku, but provided you have a large number of users (40 is probably not enough though), you should be able to serve as many users as you have dynos simultaneously.

Related

Spread Celery Django tasks out over 24 hours

I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?

The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.

Understanding celery task load balance

I am currently using celery to run my background tasks.
Let's say I want to run these background tasks on weekly basis.
Is it a good idea to throw 20 tasks under one worker?
Each tasks would make about at least 800 web requests.
In the celery docs it says :
A single Celery process can process millions of tasks a minute, with sub-millisecond round-trip latency (using RabbitMQ, librabbitmq, and optimized settings).
So basically; one task could be for one user; I would need to run at least 50 different tasks; each of them making about 800 web request. I thought; maybe I would need new worker for each tasks; but reviewing the doc it doesn't seem like I need multiple workers for each tasks; instead I can throw all at once; and would be just fine. I don't feel confident about that though; What should I do in my case; if I am making 800 web requests per task; should I need multiple workers? Or should I just do everything under one worker.

How to architech heroku app for workers with wildly varying runtimes?

I'm building a web application that has some long-running jobs to do, and some very short. The jobs are invoked by users on the website, and can run anywhere from a few seconds to several hours. The spawned job needs to provide status updates to the user via the website.
I'm new to heroku, and am thinking the proper approach is to spawn a new dyno for each background task, and to communicate status via database or a memcache record. Maybe?
My question is whether this is the technically feasibly, and advisable approach?
I ask it because the documentation has a different mindset: that the worker dyno pulls jobs off a queue, and if you want things to go faster you run more dynos. I'm not sure that will work—could a 10 second job get blocked waiting for a couple of 10 hour jobs to finish? There is a way of determining the size of the job but, again, there is a highly variable amount of work to do before it is known.
I've not found any examples suggesting it is even possible for the web dyno to run up workers ad-hoc. Is it? Is the solution to multi-thread the worker dyno? Is so, what about potential memory space issues?
What's my best approach?
thx.

Monitor python scraper programs on multiple Amazon EC2 servers with a single web interface written in Django

I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.

There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.

Scheduling update functions in Django and Heroku?

I am working on a project, to be deployed on Heroku in Django, which has around 12 update functions. They take around 15 minutes to run each. Let's call them update1(), update2()...update10().
I am deploying with one worker dyno on Heroku, and I would like to run up to n or more of these at once (They are not really computationally intensive, they are all HTML parsers, but the data is time-sensitive, so I would like them to be called as often as possible).
I've read a lot of Celery and APScheduler documentation, but I'm not really sure which is the best/easiest for me. Do scheduled tasks run concurrently if the times overlap with one another (ie. if I run one every 2 minutes, and another every 3 minutes, or do they wait until each one finishes?)
Any way I can queue these functions, so at least a few of them are running at once? What is the suggested number of simultaneous calls for this use-case?

Based on you use case description you do not need a Scheduler, so APScheduler will not match your requirements well.
Do you have a web dyno besides your worker dyno? The usual design pattern for this type of processing is to set up a control thread or control process (your web dyno) that accepts requests. These requests are then placed on a request queue.
This queue is read by one or more worker threads or worker processes (you worker dyno). I have not worked with Celery, but it looks like a match with your requirements. How many worker threads or worker dyno's you will need is difficult to determine based on your description. You will need to specify also how many requests for updates you will need to process per second. Also, you will need to specify if the request is CPU bound or IO bound.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.