Django Celery background scraper

Django Celery background scraper - python

I need some help,
My setup is Django, Postgres, Celery, Redis – all dockerized. Besides regular user-related features, the app should scrape info in the background mode.
What I need is to launch the scraping function manually from management command like "python manage.py start_some_scrape --param1 --param2 ..etc", and know that this script works in the background mode informing me only by logs.
At this moment script works without Celery and only while the terminal connection is alive what is not useful because the scraper should work a long time – like days.
Is Celery the right option for this?
How to pass a task from management command to Celery properly?
How to prevent Celery to be blocked by the long-time task? Celery also has other tasks – related and not related to the scraping script. Is there are threads or some other way?
Thx for help!

A simple way would be to just send your task to the background of your shell. With nohup it shouldn't be terminated even if your shell session expires.
your-pc:~$ nohup python manage.py start_some_scrape --param1 --param2 > logfile.txt &

Your route to achieve what u want is
Django-->Celery(Redis)-->SQLite<--Scrapyd<--Scrapy
If u want to use a shared DB like PostGre, u need to patch scrapyd as it supports only SQLite.
There is a github project https://github.com/holgerd77/django-dynamic-scraper that can do what u want or simple teach u how to pass celery tasks.
Celery is an asynchronous task queue/job queue so u dont have any kind of blocks from his side.

Related

How do you schedule some python scripts to run regularly on a Windows PC?

I have some python scripts that I look to run daily form a Windows PC.
My current workflow is:
The desktop PC stays all every day except for a weekly restart over the weekend
After the restart I open VS Code and run a little bash script ./start.sh that kicks off the tasks.
The above works reasonably fine, but it is also fairly painful. I need to re-run start.sh if I ever close VS Code (eg. for an update). Also the processes use some local python libraries so I need to stop them if I'm going to update them.
With regards to how to do this properly, 4 tools came to mind:
Windows Scheduler
Airflow
Prefect (https://www.prefect.io/)
Rocketry (https://rocketry.readthedocs.io/en/stable/)
However, I can't quite get my head around the fundamental issue that Prefect/Airflow/Rocketry run on my PC then there is nothing that will restart them after the PC reboots. I'm also not sure they will give me the isolation I'd prefer on these tools.
Docker came to mind, I could put each task into a docker image and run them via some form of docker swarm or something like that. But not sure if I'm re-inventing the wheel.
I'm 100% sure I'm not the first person in this situation. Could anyone point me to a guide on how this could be done well?
Note:
I am not considering running the python scripts in the cloud. They interact with local tools that are only licenced for my PC.

You can definitely use Prefect for that - it's very lightweight and seems to be matching what you're looking for. You install it with pip install prefect, start Orion API server: prefect orion start and once you create a Deployment, and start an agent prefect agent start -q default you can even configure schedule from the UI
For more information about Deployments, check our FAQ section.

It sounds Rocketry could also be suitable. Rocketry can shut down itself using a task. You could do a task that:
Runs on the main thread and process (blocking starting new tasks)
Waits or terminates all the currently running tasks (use the session)
Calls session.shut_down() which sets a flag to the scheduler.
There is also a app configuration shut_cond which is simply a condition. If this condition is True, the scheduler exits so alternatively you can use this.
Then after the line app.run() you simply have a line that runs shutdown -r (restart) command on shell using a subprocess library, for example. Then you need something that starts Rocketry again when the restart is completed. For this, perhaps this could be an answer: https://superuser.com/a/954957, or use Windows scheduler to have a simple startup task that starts Rocketry.
Especially if you had Linux machines (Raspberry Pis for example), you could integrate Rocketry with FastAPI and make a small cluster in which Rocketry apps communicate with each other, just put script with Rocketry as a startup service. One machine could be a backup that calls another machine's API which runs Linux restart command. Then the backup executes tasks until the primary machine answers to requests again (is up and running).
But as the author of the library, I'm possibly biased toward my own projects. But Rocketry very capable on complex scheduling problems, that's the purpose of the project.

You can use schtasks for windows to schedule the tasks like running bash script or python script and it's pretty reliable too.

Celery - What pool should I use for windows heavy cpu process and redis backend for status tracking?

I am running python 3.9, Windows 10, celery 4.3, redis as the backend, and aws sqs as the broker (I wasn't intending on using the backend, but it became more and more apparent to me that due to the library's restrictions on windows that'd I'd be better off using it if I could get it to work, otherwise I would've just used redis as the broker and backend).
To give you some context, I have a webpage that a user interacts with to allow them to do a resource intensive task. If the user has a task running and decides to resend the task, I need it to kill the task, and use the new information sent by the user to create the new task.
The problem for me arrives after this line of thinking:
Me: "Hmmm, the prefork pool is used for heavy cpu background tasks... I want to use that..."
Me: Goes and configures settings.py,
updates the celery library,
sets the environment variable to allow windows to run prefork pool -
os.environ.setdefault('FORKED_BY_MULTIPROCESSING', '1'),
sets a few other configuration settings, etc,
runs the worker and it works.
Me: "Hey, hey. It works... Oh, I still can't revoke a task DESPITE RUNNING THE PREFORK POOL!?!?!
Oh, that's okay... I can just set a session variable to let me know if the user already started a task,
and if they have, just have celery tell me if the task that they started is finished
before I allow the user to request to run a task again."
Me: Goes and configures django sessions,
configures redis,
updates the views to include the session variable, etc,
Me: "Great! Everything is working, so far..."
Me: Runs a test to see if the redis server returns the status...
Celery: "PENDING"
Me: "Yo! Is my task done, yet!?"
Celery: "No - PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Me: Searches stackoverflow for why its only pending...
Me: Finds out that you must use --pool=solo for the worker...
Me: Dies on the inside.
Ideally - I'd like to be able to use the prefork pool to do intense processing and to kill the task if need be. The thing is that everything that I read tells me prefork is what I want, but solo is the only way I can think of to get it to work.
Questions:
How bad is it for me to compromise these desires and just go with solo, expecting that I will be using heavy cpu for the tasks and many users? Assume 100s if not 1000s at once submitting tasks.
What other solutions should I consider?

In my experience on windows I cannot use anything other than --pool=solo
What other solutions should I consider?
The way I do it is I use 1 pool for windows development and more on production (linux) at least in my case using solo pool for development is fine.

Django: Celery worker in production, Ubuntu 18+

I'm learning Celery and I'd like to ask:
Which is the absolute simplest way to get Celery to automatically run when Django starts in Ubuntu?. Now I manually start celery -A {prj name} worker -l INFO via the terminal.
Can I make any type of configuration so Celery catches the changes in tasks.py code without the need to restart Celery? Now I ctrl+c and type celery -A {prj name} worker -l INFO every time I change something in the tasks.py code. I can foresee a problem in such approach in production if I can get Celery start automatically ==> need to restart Ubuntu instead?.
(setup: VPS, Django, Ubuntu 18.10 (no docker), no external resources, using Redis (that starts automatically)
I am aware it is a similar question to Django-Celery in production and How to ... but still it is a bit unclear as it refers to amazon and also using shell scripts, crontabs. It seems a bit peculiar that these things wouldn't work out of the box.
I give benefit to the doubt that I have misunderstood the setup of Celery.

I have a deploy script that launch Celery in production.
In production it's better to launch worker :
celery multi stop 5
celery multi start 5 -A {prj name} -Q:1 default -Q:2,3 QUEUE1 -Q:4,5 QUEUE2 --pidfile="%n.pid"
this will stop and launch 5 worker for different Queue
Celery at launch will get the wsgi file that will use this instance of your code, it's mean you need to relaunch it to apply modification, you cannot add a watcher in production (memory cost)

Redis Flushall command in Heroku Scheduler (Python/Django Project)

For a Django application, I want Heroku Scheduler to perform the following commands:
heroku redis:cli
flushall
exit (ctrl-c)
I do this myself once a day now in terminal, but it would be a lot easier if I can schedule these commands.
My question is, is it possible to put these commands in a Python script, or do I need to work in another way? Does anyone has experience with this?

Cronjob Django doesn't work?

I have followd the following links for running a cron job :
django - cron
custom management commands
but all of these ways work if i run commands like:
or
python manage.py crontab add
or
python manage.py runcron
but i don't want to do cron jobs without django server running
i mean i want to run django server and it automatically call a certain function by itself
during running of server for example every (say) 5 minutes.

If I understand correctly you can't use django cron if django isn't running; so create a bash script to check if django is running and if not start it.
MYPROG="myprog"
RESTART="myprog params"
PGREP="/usr/bin/pgrep"
# find myprog pid
$PGREP ${MYPROG}
# if not running
if [ $? -ne 0 ]
then
$RESTART
fi
Then for your cron:
*/5 * * * * /myscript.sh
This is off the cuff and not individualized to your setup but it's as good as any to start.

Let me see if I can elaborate:
On Linux you'll use cron Windows will use at; these are system services not to be confused with Django. They are essentially scheduled tasks.
The custom management commands is essentially a script that you point the cron job to.
You'll need to do some homework on what a cron job is (if you're using Linux) and see how to schedule a reoccuring task and how to have it issue the custom command. This is my understanding of what you're trying to do. If it's not you need to clarify.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.