Job Scheduling in Django

Job Scheduling in Django - python

I need to implement a scheduled task in our Django app. DBader's schedule seems to be a good candidate for the job, however when run it as part of a Django project, it doesn't seem to produce the desired effect.
Specifically, this works fine as an independent program:
import schedule
import time
import logging
log = logging.getLogger(__name__)
def handleAnnotationsWithoutRequests(settings):
'''
From settings passed in, grab job-ids list
For each job-id in that list, perform annotation group/set logic [for details, refer to handleAnnotationsWithRequests(requests, username)
sans requests, those are obtained from db based on job-id ]
'''
print('Received settings: {}'.format(str(settings)))
def job():
print("I'm working...")
#schedule.every(3).seconds.do(job)
#schedule.every(2).seconds.do(handleAnnotationsWithoutRequests, settings={'a': 'b'})
invoc_time = "10:33"
schedule.every().day.at(invoc_time).do(handleAnnotationsWithoutRequests, settings={'a': 'b'})
while True:
schedule.run_pending()
time.sleep(1)
But this (equivalent) code run in Django context doesn't result in an invocation.
def handleAnnotationsWithoutRequests(settings):
'''
From settings passed in, grab job-ids list
For each job-id in that list, perform annotation group/set logic [for details, refer to handleAnnotationsWithRequests(requests, username)
sans requests, those are obtained from db based on job-id ]
'''
log.info('Received settings: {}'.format(str(settings)))
def doSchedule(settings):
'''
with scheduler library
Based on time specified in settings, invoke .handleAnnotationsWithoutRequests(settings)
'''
#settings will need to be reconstituted from the DB first
#settings = {}
invocationTime = settings['running_at']
import re
invocationTime = re.sub(r'([AaPp][Mm])', "", invocationTime)
log.info("Invocation time to be used: {}".format(invocationTime))
schedule.every().day.at(invocationTime).do(handleAnnotationsWithoutRequests, settings=settings)
while True:
schedule.run_pending()
time.sleep(1)
so the log from handleAnnotationsWithoutRequests() doesn't appear on the console.
Is this scheduling library compatible with Django? Are there any usage samples that one could refer me to?
I'm suspecting some thread issues are at work here. Perhaps there are better alternatives to be used? Suggestions are welcome.
Thank you in advance.

For web servers, you probably don't want something that runs in-process:
An in-process scheduler for periodic jobs [...]
https://github.com/Tivix/django-cron has proven a working solution.
There's also the heavyweight champion Celery and Celerybeat.

I do this a lot with Django Commands
The pattern I use is to setup a new Django command in my app and then make it a long-running process inside a never-ending while() loop.
I the loop iterates continuously with a custom defined sleep(1) timer.
The short version is here, with a bit of pseudo-code thrown in. You can see a working version of this pattern in my Django Reference Implementation.
class Command(BaseCommand):
help = 'My Long running job'
def handle(self, *args, **options):
self.stdout.write(self.style.SUCCESS(f'Starting long-running job.'))
while True:
if conditions met for job:
self.job()
sleep(5)
def job(self):
self.stdout.write(self.style.SUCCESS(f'Running the job...'))

Related

Django Celery - Passing an object to the views and between tasks using RabbitMQ

This is the first time I'm using Celery, and honestly, I'm not sure I'm doing it right. My system has to run on Windows, so I'm using RabbitMQ as the broker.
As a proof of concept, I'm trying to create a single object where one task sets the value, another task reads the value, and I also want to show the current value of the object when I go to a certain url. However I'm having problems sharing the object between everything.
This is my celery.py
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery
from django.conf import settings
os.environ.setdefault('DJANGO_SETTINGS_MODULE','cesGroundStation.settings')
app = Celery('cesGroundStation')
app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
#app.task(bind = True)
def debug_task(self):
print('Request: {0!r}'.format(self.request))
The object I'm trying to share is:
class SchedulerQ():
item = 0
def setItem(self, item):
self.item = item
def getItem(self):
return self.item
This is my tasks.py
from celery import shared_task
from time import sleep
from scheduler.schedulerQueue import SchedulerQ
schedulerQ = SchedulerQ()
#shared_task()
def SchedulerThread():
print ("Starting Scheduler")
counter = 0
while(1):
counter += 1
if(counter > 100):
counter = 0
schedulerQ.setItem(counter)
print("In Scheduler thread - " + str(counter))
sleep(2)
print("Exiting Scheduler")
#shared_task()
def RotatorsThread():
print ("Starting Rotators")
while(1):
item = schedulerQ.getItem()
print("In Rotators thread - " + str(item))
sleep(2)
print("Exiting Rotators")
#shared_task()
def setSchedulerQ(schedulerQueue):
schedulerQ = schedulerQueue
#shared_task()
def getSchedulerQ():
return schedulerQ
I'm starting my tasks in my apps.py...I'm not sure if this is the right place as the tasks/workers don't seem to work until I start the workers in a separate console where I run the celery -A cesGroundStation -l info.
from django.apps import AppConfig
from scheduler.schedulerQueue import SchedulerQ
from scheduler.tasks import SchedulerThread, RotatorsThread, setSchedulerQ, getSchedulerQ
class SchedulerConfig(AppConfig):
name = 'scheduler'
def ready(self):
schedulerQ = SchedulerQ()
setSchedulerQ.delay(schedulerQ)
SchedulerThread.delay()
RotatorsThread.delay()
In my views.py I have this:
def schedulerQ():
queue = getSchedulerQ.delay()
return HttpResponse("Your list: " + queue)
The django app runs without errors, however my output from "celery -A cesGroundStation -l info" is this: Celery command output
First it seems to start multiple "SchedulerThread" tasks, secondly the "SchedulerQ" object isn't being passed to the Rotators, as it's not reading the updated value.
And if I go to the url for which shows the views.schedulerQ view I get this error:
Django views error
I have very, very little experience with Python, Django and Web Development in general, so I have no idea where to start with that last error. Solutions suggest using Redis to pass the object to the views, but I don't know how I'd do that using RabbitMQ. Later on the schedulerQ object will implement a queue and the scheduler and rotators will act as more of a producer/consumer dynamic with the view showing the contents of the queue, so I believe using the database might be too resource intensive. How can I share this object across all tasks, and is this even the right approach?

The right approach would be to use some persistence layer, such as a database or results back end to store the information you want to share between tasks if you need to share information between tasks (in this example, what you are currently putting in your class).
Celery operates on a distributed message passing paradigm - a good way to distill that idea for this example, is that your module will be executed independently every time a task is dispatched. Whenever a task is dispatched to Celery, you must assume it is running in a seperate interpreter and loaded independently of other tasks. That SchedulerQ class is instantiated anew each time.
You can share information between tasks in ways described in the docs linked previously and some best practice tips discuss data persistence concerns.

Python Django Asynchronous Request handling

I am working in an application where i am doing a huge data processing to generate a completely new set of data which is then finally saved to database. The application is taking a huge time in processing and saving the data to data base. I want to improve the user experience to some extent by redirecting user to result page first and then doing the data saving part in background(may be in the asynchronous way) . My problem is that for displaying the result page i need to have the new set of processed data. Is there any way that i can do so that the data processing and data saving part is done in background and whenever the data processing part is completed(before saving to database) i would get the processed data in result page?.

Asynchronous tasks can be accomplished in Python using Celery. You can simply push the task to Celery queue and the task will be performed in an asynchronous way. You can then do some polling from the result page to check if it is completed.
Other alternative can be something like Tornado.

Another strategy is to writing a threading class that starts up custom management commands you author to behave as worker threads. This is perhaps a little lighter weight than working with something like celery, and of course has both advantages and disadvantages. I also used this technique to sequence/automate migration generation/application during application startup (because it lives in a pipeline). My gunicorn startup script then starts these threads in pre_exec() or when_ready(), etc, as appropriate, and then stops them in on_exit().
# Description: Asychronous Worker Threading via Django Management Commands
# Lets you run an arbitrary Django management command, either a pre-baked one like migrate,
# or a custom one that you've created, as a worker thread, that can spin forever, or not.
# You can use this to take care of maintenance tasks at start-time, like db migration,
# db flushing, etc, or to run long-running asynchronous tasks.
# I sometimes find this to be a more useful pattern than using something like django-celery,
# as I can debug/use the commands I write from the shell as well, for administrative purposes.
import json
import os
import requests
import sys
import time
import uuid
import logging
import threading
import inspect
import ctypes
from django.core.management import call_command
from django.conf import settings
class DjangoWorkerThread(threading.Thread):
"""
Initializes a seperate thread for running an arbitrary Django management command. This is
one (simple) way to make asynchronous worker threads. There exist richer, more complex
ways of doing this in Django as well (django-cerlery).
The advantage of this pattern is that you can run the worker from the command line as well,
via manage.py, for the sake of rapid development, easy testing, debugging, management, etc.
:param commandname: name of a properly created Django management command, which exists
inside the app/management/commands folder in one of the apps in your project.
:param arguments: string containing command line arguments formatted like you would
when calling the management command via manage.py in a shell
:param restartwait: integer seconds to wait before restarting worker if it dies,
or if a once-through command, acts as a thread-loop delay timer
"""
def __init__(self, commandname,arguments="",restartwait=10,logger=""):
super(DjangoWorkerThread, self).__init__()
self.commandname = commandname
self.arguments = arguments
self.restartwait = restartwait
self.name = commandname
self.event = threading.Event()
if logger:
self.l = logger
else:
self.l = logging.getLogger('root')
def run(self):
"""
Start the thread.
"""
try:
exceptioncount = 0
exceptionlimit = 10
while not self.event.is_set():
try:
if self.arguments:
self.l.info('Starting ' + self.name + ' worker thread with arguments ' + self.arguments)
call_command(self.commandname,self.arguments)
else:
self.l.info('Starting ' + self.name + ' worker thread with no arguments')
call_command(self.commandname)
self.event.wait(self.restartwait)
except Exception as e:
self.l.error(self.commandname + ' Unkown error: {}'.format(str(e)))
exceptioncount += 1
if exceptioncount > exceptionlimit:
self.l.error(self.commandname + " : " + self.arguments + " : Exceeded exception retry limit, aborting.")
self.event.set()
finally:
self.l.info('Stopping command: ' + self.commandname + " " + self.arguments)
def stop(self):
"""Nice Stop
Stop nicely by setting an event.
"""
self.l.info("Sending stop event to self...")
self.event.set()
#then make sure it's dead...and schwack it harder if not.
#kill it with fire! be mean to your software. it will make you write better code.
self.l.info("Sent stop event, checking to see if thread died.")
if self.isAlive():
self.l.info("Still not dead, telling self to murder self...")
time.sleep( 0.1 )
os._exit(1)
def start_worker(command_name, command_arguments="", restart_wait=10,logger=""):
"""
Starts a background worker thread running a Django management command.
:param str command_name: the name of the Django management command to run,
typically would be a custom command implemented in yourapp/management/commands,
but could also be used to automate standard Django management tasks
:param str command_arguments: a string containing the command line arguments
to supply to the management command, formatted as if one were invoking
the command from a shell
"""
if logger:
l = logger
else:
l = logging.getLogger('root')
# Start the thread
l.info("Starting worker: "+ command_name + " : " + command_arguments + " : " + str(restart_wait) )
worker = DjangoWorkerThread(command_name,command_arguments, restart_wait,l)
worker.start()
l.info("Worker started: "+ command_name + " : " + command_arguments + " : " + str(restart_wait) )
# Return the thread instance
return worker
#<----------------------------------------------------------------------------->
def stop_worker(worker,logger=""):
"""
Gracefully shutsdown the worker thread
:param threading.Thread worker: the worker thread object
"""
if logger:
l = logger
else:
l = logging.getLogger('root')
# Shutdown the thread
l.info("Stopping worker: "+ worker.commandname + " : " + worker.arguments + " : " + str(worker.restartwait) )
worker.stop()
worker.join(worker.restartwait)
l.info("Worker stopped: "+ worker.commandname + " : " + worker.arguments + " : " + str(worker.restartwait) )

The long running task can be offloaded with Celery. You can still get all the updates and results. Your web application code should take care of polling for updates and results. http://blog.miguelgrinberg.com/post/using-celery-with-flask
explains how one can achieve this.
Some useful steps:
Configure celery with result back-end.
Execute the long running task asynchronously.
Let the task update its state periodically or when it executes some stage in job.
Poll from web application to get the status/result.
Display the results on UI.
There is a need for bootstrapping it all together, but once done it can be reused and it is fairly performant.

It's the same process that a synchronous request. You will use a View that should return a JsonResponse. The 'tricky' part is on the client side, where you have to make the async call to the view.

Remove threads usage from script

The next script I'm using is used to listen to IMAP connection using IMAP IDLE and depends heavily on threads. What's the easiest way for me to eliminate the treads call and just use the main thread?
As a new python developer I tried editing def __init__(self, conn): method but just got more and more errors
A code sample would help me a lot
#!/usr/local/bin/python2.7
print "Content-type: text/html\r\n\r\n";
import socket, ssl, json, struct, re
import imaplib2, time
from threading import *
# enter gmail login details here
USER="username#gmail.com"
PASSWORD="password"
# enter device token here
deviceToken = 'my device token x x x x x'
deviceToken = deviceToken.replace(' ','').decode('hex')
currentBadgeNum = -1
def getUnseen():
(resp, data) = M.status("INBOX", '(UNSEEN)')
print data
return int(re.findall("UNSEEN (\d)*\)", data[0])[0])
def sendPushNotification(badgeNum):
global currentBadgeNum, deviceToken
if badgeNum != currentBadgeNum:
currentBadgeNum = badgeNum
thePayLoad = {
'aps': {
'alert':'Hello world!',
'sound':'',
'badge': badgeNum,
},
'test_data': { 'foo': 'bar' },
}
theCertfile = 'certif.pem'
theHost = ('gateway.push.apple.com', 2195)
data = json.dumps(thePayLoad)
theFormat = '!BH32sH%ds' % len(data)
theNotification = struct.pack(theFormat, 0, 32,
deviceToken, len(data), data)
ssl_sock = ssl.wrap_socket(socket.socket(socket.AF_INET,
socket.SOCK_STREAM), certfile=theCertfile)
ssl_sock.connect(theHost)
ssl_sock.write(theNotification)
ssl_sock.close()
print "Sent Push alert."
# This is the threading object that does all the waiting on
# the event
class Idler(object):
def __init__(self, conn):
self.thread = Thread(target=self.idle)
self.M = conn
self.event = Event()
def start(self):
self.thread.start()
def stop(self):
# This is a neat trick to make thread end. Took me a
# while to figure that one out!
self.event.set()
def join(self):
self.thread.join()
def idle(self):
# Starting an unending loop here
while True:
# This is part of the trick to make the loop stop
# when the stop() command is given
if self.event.isSet():
return
self.needsync = False
# A callback method that gets called when a new
# email arrives. Very basic, but that's good.
def callback(args):
if not self.event.isSet():
self.needsync = True
self.event.set()
# Do the actual idle call. This returns immediately,
# since it's asynchronous.
self.M.idle(callback=callback)
# This waits until the event is set. The event is
# set by the callback, when the server 'answers'
# the idle call and the callback function gets
# called.
self.event.wait()
# Because the function sets the needsync variable,
# this helps escape the loop without doing
# anything if the stop() is called. Kinda neat
# solution.
if self.needsync:
self.event.clear()
self.dosync()
# The method that gets called when a new email arrives.
# Replace it with something better.
def dosync(self):
print "Got an event!"
numUnseen = getUnseen()
sendPushNotification(numUnseen)
# Had to do this stuff in a try-finally, since some testing
# went a little wrong.....
while True:
try:
# Set the following two lines to your creds and server
M = imaplib2.IMAP4_SSL("imap.gmail.com")
M.login(USER, PASSWORD)
M.debug = 4
# We need to get out of the AUTH state, so we just select
# the INBOX.
M.select("INBOX")
numUnseen = getUnseen()
sendPushNotification(numUnseen)
typ, data = M.fetch(1, '(RFC822)')
raw_email = data[0][1]
import email
email_message = email.message_from_string(raw_email)
print email_message['Subject']
#print M.status("INBOX", '(UNSEEN)')
# Start the Idler thread
idler = Idler(M)
idler.start()
# Sleep forever, one minute at a time
while True:
time.sleep(60)
except imaplib2.IMAP4.abort:
print("Disconnected. Trying again.")
finally:
# Clean up.
#idler.stop() #Commented out to see the real error
#idler.join() #Commented out to see the real error
#M.close() #Commented out to see the real error
# This is important!
M.logout()

As far as I can tell, this code is hopelessly confused because the author used the "imaplib2" project library which forces a threading model which this code then never uses.
Only one thread is ever created, which wouldn't need to be a thread but for the choice of imaplib2. However, as the imaplib2 documentation notes:
This module presents an almost identical API as that provided by the standard python library module imaplib, the main difference being that this version allows parallel execution of commands on the IMAP4 server, and implements the IMAP4rev1 IDLE extension. (imaplib2 can be substituted for imaplib in existing clients with no changes in the code, but see the caveat below.)
Which makes it appear that you should be able to throw out much of class Idler and just use the connection M. I recommend that you look at Doug Hellman's excellent Python Module Of The Week for module imaplib prior to looking at the official documentation. You'll need to reverse engineer the code to find out its intent, but it looks to me like:
Open a connection to GMail
check for unseen messages in Inbox
count unseen messages from (2)
send a dummy message to some service at gateway.push.apple.com
Wait for notice, goto (2)
Perhaps the most interesting thing about the code is that it doesn't appear to do anything, although what sendPushNotification (step 4) does is a mystery, and the one line that uses an imaplib2 specific service:
self.M.idle(callback=callback)
uses a named argument that I don't see in the module documentation. Do you know if this code ever actually ran?
Aside from unneeded complexity, there's another reason to drop imaplib2: it exists independently on sourceforge and PyPi which one maintainer claimed two years ago "An attempt will be made to keep it up-to-date with the original". Which one do you have? Which would you install?

Don't do it
Since you are trying to remove the Thread usage solely because you didn't find how to handle the exceptions from the server, I don't recommend removing the Thread usage, because of the async nature of the library itself - the Idler handles it more smoothly than a one thread could.
Solution
You need to wrap the self.M.idle(callback=callback) with try-except and then re-raise it in the main thread. Then you handle the exception by re-running the code in the main thread to restart the connection.
You can find more details of the solution and possible reasons in this answer: https://stackoverflow.com/a/50163971/1544154
Complete solution is here: https://www.github.com/Elijas/email-notifier

Running Scrapy tasks in Python

My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:
"ReactorNotRestartable"
Why?
The offending code (last line throws the error):
crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()
# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)
# start engine scrapy/twisted
crawler.start()

Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.
So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.
Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)
from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process
class CrawlerWorker(Process):
def __init__(self, spider, results):
Process.__init__(self)
self.results = results
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.results.put(self.items)
# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
pass # Do something with item

crawler.start() starts Twisted reactor. There can be only one reactor.
If you want to run more spiders - use
another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)

I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.
Thread(target=process.start).start()
Here is the detailed explanation: Run a Scrapy spider in a Celery Task

Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.

Least painful way to run a Python delay loop

I've got an event-driven chatbot and I'm trying to implement spam protection. I want to silence a user who is behaving badly for a period of time, without blocking the rest of the application.
Here's what doesn't work:
if user_behaving_badly():
ban( user )
time.sleep( penalty_duration ) # Bad! Blocks the entire application!
unban( user )
Ideally, if user_behaving_badly() is true, I want to start a new thread which does nothing but ban the user, then sleep for a while, unban the user, and then the thread disappears.
According to this I can accomplish my goal using the following:
if user_behaving_badly():
thread.start_new_thread( banSleepUnban, ( user, penalty ) )
"Simple" is usually an indicator of "good", and this is pretty simple, but everything I've heard about threads has said that they can bite you in unexpected ways. My question is: Is there a better way than this to run a simple delay loop without blocking the rest of the application?

instead of starting a thread for each ban, put the bans in a priority queue and have a single thread do the sleeping and unbanning
this code keeps two structures a heapq that allows it to quickly find the soonest ban to expire and a dict to make it possible to quickly check if a user is banned by name
import time
import threading
import heapq
class Bans():
def __init__(self):
self.lock = threading.Lock()
self.event = threading.Event()
self.heap = []
self.dict = {}
self.thread = threading.thread(target=self.expiration_thread)
self.thread.setDaemon(True)
self.thread.start()
def ban_user(self, name, duration):
with self.lock:
now = time.time()
expiration = (now+duration)
heapq.heappush(self.heap, (expiration, user))
self.dict[user] = expiration
self.event.set()
def is_user_banned(self, user):
with self.lock:
now = time.time()
return self.dict.get(user, None) > now
def expiration_thread(self):
while True:
self.event.wait()
with self.lock:
next, user = self.heap[0]
now = time.time()
duration = next-now
if duration > 0:
time.sleep(duration)
with self.lock:
if self.heap[0][0] = next:
heapq.heappop(self.heap)
del self.dict(user)
if not self.heap:
self.event.clear()
and is used like this:
B = Bans()
B.ban_user("phil", 30.0)
B.is_user_banned("phil")

Use a threading timer object, like this:
t = threading.Timer(30.0, unban)
t.start() # after 30 seconds, unban will be run
Then only unban is run in the thread.

Why thread at all?
do_something(user):
if(good_user(user)):
# do it
else
# don't
good_user():
if(is_user_baned(user)):
if(past_time_since_ban(user)):
user_good_user(user)
elif(is_user_bad()):
ban_user()
ban_user(user):
# add a user/start time to a hash
is_user_banned()
# check hash
# could check if expired now too, or do it seperately if you care about it
is_user_bad()
# check params or set more values in a hash

This is language agnostic, but consider a thread to keep track of stuff. The thread keeps a data structure that has something like "username" and "banned_until" in a table. The thread is always running in the background checking the table, if banned_until is expired, it unblocks the user. Other threads go on normally.

If you're using a GUI,
most GUI modules have a timer function which can abstract all the yuck multithreading stuff,
and execute code after a given time,
though still allowing the rest of the code to be executed.
For instance, Tkinter has the 'after' function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Job Scheduling in Django - python

For web servers, you probably don't want something that runs in-process: An in-process scheduler for periodic jobs [...] https://github.com/Tivix/django-cron has proven a working solution. There's also the heavyweight champion Celery and Celerybeat.

Related

Django Celery - Passing an object to the views and between tasks using RabbitMQ

Python Django Asynchronous Request handling

Remove threads usage from script

Running Scrapy tasks in Python

Least painful way to run a Python delay loop

Categories

Resources