In my django project, I need to collect data from about 50 remote servers into the local database minutely or every 30-seconds. Though it works with crontab in the remote servers, I want to do this in the project. Firstly, I consider the django-celery. However it does well in asynchronous processing and the collect-data task could not be delayed. Therefore i think, it may be not fit. How if i do this use the timer for python and what need i to pay more attention. Excuse for my ignorance of python and django. I'll appreciate other advice or ideas. Many thanks
Basically you can use Celery's preiodic tasks with expire option, which makes you sure that your tasks will not be executed twice.
Also you could run your own script with infinite loop like which will run calculation. If your calculation will run more than minute you can spawn your tasks using eventlet or gevent. Other option you could creare celery-tasks from this script and be sure that your tasks executes every N seconds, as you prefer.
Related
On the server-side: I need a way to execute some tasks in the background, frequently and start it at a specific time.
My programming language is Python for the back-end(Sanic Framework), VueJs for the front-end, MongoDB as main DB and the Redis for caching.
Also, I'm using a Docker container(docker-compose).
Also, I worked before with the Celery but I want to know what is the best solution for production that guarantees it's stable and reliable.
On the client-side: For the mentioned question, I need to run it on the server-side, sometimes I need to run a job scheduler on clients, embedded devices such as Raspberry Pi that could run Python or JavaScript.
So, What are your solutions for these use cases?
In production we have both long and short-running tasks and in total our Celery cluster executes up to 6M tasks per day, so naturally I would recommend Celery. It is made for this purpose and if you are a Python developer you have another reason to pick Celery. Finally, Celery is the only Python task queue system known to me that has HA scheduler (https://github.com/mixkorshun/celery-beatx and https://github.com/sibson/redbeat).
There are two other (Python) projects that should be mentioned as alternatives to Celery - Huey (https://github.com/coleifer/huey) and Apache Airflow (https://github.com/apache/airflow).
I'm one of the core devs for Sanic. I would agree with the other answers that Celery is a great option. For anyone in need of a more light weight solution, I have a post about an alternative approach only inside Sanic: https://community.sanicframework.org/t/how-to-use-asyncio-queues-in-sanic/166/4
Starting a new process in the background in python is as simple as calling os.fork(). For a comprehensive example, see https://python-course.eu/forking.php
EDIT:
For a fully featured solution, I'd recommend forking a background process as described above, and then using a library like https://github.com/dbader/schedule to execute jobs at scheduled intervals in that background process.
I hope you are all having an amazing day. So I am working on a project using Python. The script's job is to automate actions and tasks on a social media platform via http requests. As of now, one instance of this script access one user account. Now, I want to create a website where I can let users register, enter their credentials to the social media platform and run an instance of this script to perform the automation tasks. I've thought about creating a new process of this script every time a new user has register, but this doesn't seem efficient. Also though about using threads, but also does not seem reasonable. Especially if there are 10,000 users registering. What is the best way to do this? How can I scale? Thank you guys so much in advance.
What is the nature of the tasks that you're running?
Are the tasks simply jobs that run at a scheduled time of day, or every X minutes? For this, you could have your Web application register cronjobs or similar, and each cronjob can spawn an instance of your script, which I assume is short-running, to carry out a the automated task one user at a time. If the exact timing of the script doesn't matter then you could scatter the running of these scripts throughout the day, on seperate machines if need be.
The above approach probably won't scale well to 10,000 users, and you will need something more robust, especially if the script is something that needs to run continuously (e.g. you are polling some data from Facebook and need to react to its changes). If it's a lot of communication per user, then you could consider using a producer-consumer model, where a bunch of producer scripts (which run continously) issue work requests into a global queue that a bunch of consumer scripts poll and carry out. You could also load balance such consumers and producers across multiple machines.
Of course, you would definitely want to squeeze out some parallelism from the extra cores of your machines by carrying out this work on multiple threads or processes. You could do this quite easily in Python using the multiprocessing module.
I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.
There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.
This seems like a simple question, but I am having trouble finding the answer.
I am making a web app which would require the constant running of a task.
I'll use sites like Pingdom or Twitterfeed as an analogy. As you may know, Pingdom checks uptime, so is constantly checking websites to see if they are up and Twitterfeed checks RSS feeds to see if they;ve changed and then tweet that. I too need to run a simple script to cycle through URLs in a database and perform an action on them.
My question is: how should I implement this? I am familiar with cron, currently using it to do my server backups. Would this be the way to go?
I know how to make a Python script which runs indefinitely, starting back at the beginning with the next URL in the database when I'm done. Should I just run that on the server? How will I know it is always running and doesn't crash or something?
I hope this question makes sense and I hope I am not repeating someone else or anything.
Thank you,
Sam
Edit: To be clear, I need the task to run constantly. As in, check URL 1 in the database, check URl 2 in the database, check URL 3 and, when it reaches the last one, go right back to the beginning. Thanks!
If you need a repeatable running of the task which can be run from command line - that's what the cron is ideal for.
I don't see any demerits of this approach.
Update:
Okay, I saw the issue somewhat different. Now I see several solutions:
run the cron task at set intervals, let it process the data once per run, next time it will process the data on another run; use PIDs/Database/semaphores to avoid parallel processes;
update the processes that insert/update data in the database; let the information be processed when it is inserted/updated; c)
write a demon process which will reside in memory and check the data in real time.
cron would definitely be a way to go with this, as well as any other task scheduler you may prefer.
The main point is found in the title to your question:
Run a repeating task for a web app
The background task and the web application should be kept separate. They can share code, they can share access to a database, but they should be separate and discrete application contexts. (Consider them as separate UIs accessing the same back-end logic.)
The main reason for this is because web applications and background processes are architecturally very different and aren't meant to be mixed. Consider the structure of a web application being held within a web server (Apache, IIS, etc.). When is the application "running"? When it is "on"? It's not really a running task. It's a service waiting for input (requests) to handle and generate output (responses) and then go back to waiting.
Web applications are for responding to requests. Scheduled tasks or daemon jobs are for running repeated processes in the background. Keeping the two separate will make your management of the two a lot easier.
I have a task which I execute once a minute using celerybeat. It works fine. Sometimes though, the task takes a few seconds more than a minute to run because of which two instances of the task run. This leads to some race conditions that mess things up.
I can (and probably should) fix my task to work properly but I wanted to know if celery has any builtin ways to ensure this. My cursory Google searches and RTFMs yielded no results.
You could add a lock, using something like memcached or just your db.
If you are using a cron schedule or time interval for run periodic tasks you will still have the problem. You can always use a lock mechanism using a db or cache or even filesystem or also schedule the next task from the previous one, maybe not the best approach.
This question can probably help you:
django celery: how to set task to run at specific interval programmatically
You can try adding a classfield to the object that holds the function that youre making run and use that field as a "some other guy is working or not" control
The lock is a good way with either beat or a cron.
But, be aware that beat jobs run at worker start time, not at beat run time.
This was causing me to get a race condition even with a lock. Lets say the worker is off and beat throws 10 jobs into the queue. When celery starts up with 4 processes, all 4 of them grab a task and in my case 1 or 2 would get and set the lock at the same time.
Solution one is to use a cron with a lock, as a cron will execute at that time, not at worker start time.
Solution two is to use a slightly more advanced locking mechanism that handles race conditions. For redis look into setnx, or the newer redlock.
This blog post is really good, and includes a decorator pattern that uses redis-py's locking mechanism: http://loose-bits.com/2010/10/distributed-task-locking-in-celery.html.