I'm having trouble trying to figure out the best way to run a background process with Django with specific requirements.
What I would like to be able to do:
Process(es) run(s) on an infinite loop once started (will need 2 background processes, no more, no less)
Start/Stop/Get_Status of each process
Able to access Postgres DB (rules out subprocess module(I think))
Even when no users have accessed the website, the process continues to run in the background if started.
Edit:
When the task that I need to run starts, it has to initialize itself with DB information in order to gather what it needs. After initialization, it compares new information with it's prior results in order to get a delta value. Unfortunately re-initializing each time the task runs defeats this purpose and it must run in a continuous loop unless intentionally stopped by the user.
Options I have considered but haven't been able to find reliable documentation on how I can do what I want to be able to do:
Celery
RQ
django-background-task
My requirements.txt in virtualenv (currently trying to get celery working):
1 amqp==1.4.7
2 anyjson==0.3.3
3 billiard==3.3.0.21
4 celery==3.1.19
5 Django==1.8.6
6 django-crispy-forms==1.5.2
7 kombu==3.0.29
8 psycopg2==2.6.1
9 pytz==2015.7
10 redis==2.10.5
11 requests==2.8.1
12 uWSGI==2.0.11.2
13 wheel==0.24.0
If I didn't supply enough information about my problem, I apologize in advance (this is my first time posting).
I think Celery is just for you. You can take a look at pereodic tasks for some background tasks.
Also it's very easy to start using Celery with Django. You can start learning it here.
Related
I got a big script that retrieves a lot of data via an API (Magento Invoices). This script is launched by a cron every 20 minutes. But sometimes, we need to refresh manually for getting the last invoiced orders. I have a dedicated page for this.
I would like to prevent from manual launching of the script by testing if it's already running because both API and script take a lot of ressources and time.
I tried to add a "process" model with is_active = True/False that would be tested and avoid re-launching if script is already active. At the start of the script, I switch the process status to TRUE and set it to FALSE when the script has finished.
But it seems that the 2nd instance of the script waits for the first to be finished before starting. At the end, both scripts are run because process.is_active always = False
I also tried with request.session variable, but same issue.
Spent lotsa time on this but didn't find a way to achieve my goal.
Has anyone already faced such a case ?
I'm running a python script remotely from a task machine and it creates a process that is supposed to be running for 3 hours. However, it seems to be terminating prematurely at exactly 2 hours. I don't believe it is a problem with the code because after the while loop ends, I am logging to a log file. The log file doesn't show that it exits out of that while loop successfully. Is there a specific setting on the machine that I need to look into that's interrupting my python process?
Is this perhaps a Scheduled Task? If so, have you checked the task's properties?
On my Windows 7 machine under the "Settings" tab is a checkbox for "Stop the task if it runs longer than:" with a box where you can specify the duration.
One of the suggested durations on my machine is "2 hours."
I have a bunch of .py scripts as part of a project. Some of them i want to start and have running in the background whilst the others run through what they need to do.
For example, I have a script which takes a Screenshot every 10 seconds until the script is closed and i wish to have this running in the background whilst the other scripts get called and run through till finish.
Another example is a script which calculates the hash of every file in a designated folder. This has the potential to run for a fair amount of time so it would be good if the rest of the scripts could be kicked off at the same time so they do not have to wait for the Hash script to finish what it is doing before they are invoked.
Is Multiprocessor the right method for this kind of processing, or is there another way to achieve these results which would be better such as this answer: Run multiple python scripts concurrently
You could also use something like Celery to run the tasks async and you'll be able to call tasks from within your python code instead of through the shell.
It depends. With multiprocessing you can create a process manager, so it can spawn the processes the way you want, but there are more flexible ways to do it without coding. Multiprocessing is usually hard.
Check out circus, it's a process manager written in Python that you can use as a library, standalone or via remote API. You can define hooks to model dependencies between processes, see docs.
A simple configuration could be:
[watcher:one-shot-script]
cmd = python script.py
numprocesses = 1
warmup_delay = 30
[watcher:snapshots]
cmd = python snapshots.py
numprocesses = 1
warmup_delay = 30
[watcher:hash]
cmd = python hashing.py
numprocesses = 1
I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.
Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.
One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.
So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?
Appreciate all the help
I'm not 100% positive if this is correct, but this is what I would try first. When you call luigi.run, pass it --scheduler-remove-delay. I'm guessing this is how long the scheduler waits before forgetting a task after all of its dependents have completed. If you look through luigi's source, the default is 600 seconds. For example:
luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)
If you configure the remove_delay setting in your luigi.cfg then it will keep the tasks around for longer.
[scheduler]
record_task_history = True
state_path = /x/s/hadoop/luigi/var/luigi-state.pickle
remove_delay = 86400
Note, there is a typo in the documentation ("remove-delay" instead of remove_delay") which is being fixed under https://github.com/spotify/luigi/issues/2133
I have a small python script that basically connects to a SQL Server (Micrsoft) database and gets users from there, and then syncs them to another mysql database, basically im just running queries to check if the user exists, if not, then add that user to the mysql database.
The script usually would take around 1 min to sync. I require the script to do its work every 5 mins (for example) exactly once (one sync per 5 mins).
How would be the best way to go about building this?
I have some test data for the users but on the real site, theres a lot more users so I can't guarantee the script takes 1 min to execute, it might even take 20 mins. However having an interval of say 15 mins everytime the script executes would be ideal for the problem...
Update:
I have the connection params for the sql server windows db, so I'm using a small ubuntu server to sync between the two databases located on different servers. So lets say db1 (windows) and db2 (linux) are the database servers, I'm using s1 (python server) and pymssql and mysql python modules to sync.
Regards
I am not sure cron is right for the job. It seems to me that if you have it run every 15 minutes but sometimes a synch takes 20 minutes you could have multiple processes running at once and possibly collide.
If the driving force is a constant wait time between the variable execution times then you might need a continuously running process with a wait.
def main():
loopInt = 0
while(loopInt < 10000):
synchDatabase()
loopInt += 1
print("call #" + str(loopInt))
time.sleep(300) #sleep 5 minutes
main()
(obviously not continuous, but long running) You can set the result of while to true and it will be continuous. (comment out loopInt += 1)
Edited to add: Please see note in comments about monitoring the process as you don't want the script to hang or crash and you not be aware of it.
You might want to use a system that handles queues, for example RabbitMQ, and use Celery as the python interface to implement it. With Celery, you can add tasks (like execution of a script) to a queue or run a schedule that'll perform a task after a given interval (just like cron).
Get started http://celery.readthedocs.org/en/latest/