On the server-side: I need a way to execute some tasks in the background, frequently and start it at a specific time.
My programming language is Python for the back-end(Sanic Framework), VueJs for the front-end, MongoDB as main DB and the Redis for caching.
Also, I'm using a Docker container(docker-compose).
Also, I worked before with the Celery but I want to know what is the best solution for production that guarantees it's stable and reliable.
On the client-side: For the mentioned question, I need to run it on the server-side, sometimes I need to run a job scheduler on clients, embedded devices such as Raspberry Pi that could run Python or JavaScript.
So, What are your solutions for these use cases?
In production we have both long and short-running tasks and in total our Celery cluster executes up to 6M tasks per day, so naturally I would recommend Celery. It is made for this purpose and if you are a Python developer you have another reason to pick Celery. Finally, Celery is the only Python task queue system known to me that has HA scheduler (https://github.com/mixkorshun/celery-beatx and https://github.com/sibson/redbeat).
There are two other (Python) projects that should be mentioned as alternatives to Celery - Huey (https://github.com/coleifer/huey) and Apache Airflow (https://github.com/apache/airflow).
I'm one of the core devs for Sanic. I would agree with the other answers that Celery is a great option. For anyone in need of a more light weight solution, I have a post about an alternative approach only inside Sanic: https://community.sanicframework.org/t/how-to-use-asyncio-queues-in-sanic/166/4
Starting a new process in the background in python is as simple as calling os.fork(). For a comprehensive example, see https://python-course.eu/forking.php
EDIT:
For a fully featured solution, I'd recommend forking a background process as described above, and then using a library like https://github.com/dbader/schedule to execute jobs at scheduled intervals in that background process.
Related
I have a heavy function (a lot of calculations are done) which outputs a individual number for each user in my Django project. This number changes just a little over time so to minimize the server load I thought about running the function once a day, save the output and just reference the output. I know that these kinda things are usually handled with Celery but the package requires a lot of site packages and extra modules so I thought about writing a simple function like:
x0 = #last.time function was called
x1 = datetime.now
if x0-x1 > 1 day:
def whatever():
....
x0 = datetime.now
return ....
I like to keep my code clean and not to install Packages which are not really required so I would like to know if there are any downsides by "just" using Python or any gain when I would do that with Celery. The task does not need to be asynchronous so I don't care about that.
Is there a clear "Use case" when Celery should be used and when not? Is there a performance loss/gain?
I hope somebody can explain that properly.
Celery is a clear winner but I would like to explain this with pros and cons.
Pros:
You can control celery from Django very easily. Running a celery task, cancelling task, checking state/progress of task can be done within django.
A periodical task running with celery is very simple, just register the task from django run the celery worker and voila you are done. No need to mess around with crontab or background processes.
Celery is very easy to setup and run. You might already know that if you have gone through the introduction of celery.
Cons
One of the cons is that you need to have at least one result backend with either redis, rabbitmq or any other one running with celery for queuing purposes. Although RabbitMq is not a heavy you need to install it once.
One more is that celery worker itself takes some memory but that won't be an issue if you are on a server, on local memory consumption might seem high to you.
I would suggest celery because it would provide you more control over your task rather than a simple background process.
In my django project, I need to collect data from about 50 remote servers into the local database minutely or every 30-seconds. Though it works with crontab in the remote servers, I want to do this in the project. Firstly, I consider the django-celery. However it does well in asynchronous processing and the collect-data task could not be delayed. Therefore i think, it may be not fit. How if i do this use the timer for python and what need i to pay more attention. Excuse for my ignorance of python and django. I'll appreciate other advice or ideas. Many thanks
Basically you can use Celery's preiodic tasks with expire option, which makes you sure that your tasks will not be executed twice.
Also you could run your own script with infinite loop like which will run calculation. If your calculation will run more than minute you can spawn your tasks using eventlet or gevent. Other option you could creare celery-tasks from this script and be sure that your tasks executes every N seconds, as you prefer.
I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.
There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.
Question is relevant to this and this;
the difference is, I'd prefer something with possibly more precision and low load (per-minute cron job isn't preferable for those) and with minimal overhead (i.e. installing celery with rabbitmq seems like a big overkill).
An example task for such is personal reminders server (with reminders that could be edited over web and sent out through e-mail or XMPP).
I'm probably looking for something more like node.js's setTimeout but for django (and though I might prefer to implement reminders in node.js anyway, it's still a possibly interesting question).
For example, it's possible to start new threads in django app (with functions consisting of sleep() and send()); in what ways this can be bad?
The problem with using threads for this solution are the typical problems with Python threads that always drive people towards multi-process solutions instead. The problem is compounded here by the fact your thread isn't driven by the normal request-response cycle. This is summarized nicely by Malcolm Tredinnick here:
Have to disagree. Threads are not a good solution to this problem. The
issue is process management. As written, your threads will never be
rejoined. Webserver processes have a lifecycle uncontrollable by you
(the MaxRequestsPerChild Apache parameter and similar things in other
servers) and you are messing with that by using threads.
If you need a process with a lifecycle that is not matched by the
request-response path — something long running and independent of the
response — a completely separate process is definitely the right model
to use. Using a thread is tying it to the response lifecycle, which
wil have unintended side-effects.
A possible solution for you might be to have a long running process performing your tasks which gets a wake-up signal from a light cron process.
Another possibility would be build something using 0mq, which is much lighter than AMQP style queues (at the cost of some features of course). Tarek Ziade is working on a Mozilla project called powerhose that uses 0mq, looks super simple, and has a heartbeat capability with resolution to the second.
I have a series of maintenance tasks for a python WSGI application that are a bit too complex for a crontab (jobs need to be run at frequencies derived from the size of the job queue, manage a connection pool to a group of EC2 instances, etc).
How should I implement a long-running, event-driven python program? I've never needed this functionality before, so I'm not even sure what to google.
Most of the large, modern python sites are using Celery for this type of work. It is a distributed task queue that supports scheduling of tasks as well.
Though probably a bit heavyweight for a small site, it'll grow with you. I'm looking to implement it myself (sans Rabbit) shortly.
I recently found another choice for django users, django-tasks which is focused on fewer, longer, batch processing type jobs. There is also django-ztask using zeromq.
Addendum: Just came across gearman which has python bindings.