I'm working in python and was originally using someone's code to thread and render a map with mapnik. I've since tried to put it into a flask API, with celery as a backend.
Originally:
https://gist.github.com/Thetoxicarcade/57777a6714cb6fecaacf
"add an api":
https://gist.github.com/Thetoxicarcade/079cf03a3f3a061134f2
(yes I will edit this to make it shorter and better)
In general:
flask -> search(params) -> runCelery(params) -> returnOkay
celery worker (gimme params) -> fork a bunch of threads, render that place
*I may just rewrite this into a billion celery tasks?
except everything in runCelery is in the dark.
the worker task itself:
"""background working map algorithm"""
# I have no clue with these celery declarations.
CELERY_EAGER_PROPAGATES_EXCEPTIONS = True
CELERY_ALWAYS_EAGER = True
worker = Celery('tasks', backend='rpc://', broker='redis://localhost')
#worker.task(name="Renderer")#bind=True
def async_render(key,minz,maxz,fake):
from celery.utils.log import get_task_logger
logs = logging.getLogger('brettapi')
file = logging.FileHandler('/home/aristatek/log{}'.format(key))
style = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
file.setFormatter(style)
logs.addHandler(file)
logs.setLevel(logging.DEBUG)
logs.debug('HELLO WORLD')
logs.debug('I AM {}'.format(key))
#osm file
home = os.environ['HOME']
tiledir = home + "/mapnik/mapnik/osm.xml"
#grab minz/maxz
if minz is None: minz = 6
if maxz is None: maxz = 17
name=reverseKey(key)
bounds=makeBorder(key)
file = home + "/mapnik/mapnik/tiles/" + name + "/"
print key,minz,maxz,fake,file, tiledir, name
print bounds
# polygons are defined by lowercase, spaceless name strings. (or they were.)
if fake is not True:
bbox = (-180, -90, 180, 90)
render_tiles(bbox, tiledir, file, 0, 5, "World")
render_tiles(bounds, tiledir, file, minz, maxz, name)
return "Finished"
Apparently, any way that I try to get these celery instances to respond with logs or process information, they refuse to do so, which makes debugging them really tough.
I started a celery worker and was able to start celeryflower. I cannot seem to queue the task at all, and see nothing happening. :/
Part of this may be that I'm not importing functions, but even using pdb isn't helpful because of the mysticism of celery objects not obeying anything I throw at them.
It's a vague question because I hardly understand it anyway. The "read the docs" pages for celery are about as vague as possible. Do they mean properties or functions or variables within celery?? within the workers??
I'd like to know a way to get celery to respond, which would be meaningful because it means I'm going in the right direction.
Any help would be appreciated.
Edit -
Turns out from most of this the names declared need to be their own:
def things(key):
thread_lots(key)
worker = Celery()
#worker.task(name='ThisTask')
def ThisTask(key):
from testme import things
things(key)
webserve = Flask()
#webserve.route('/bla')
def bla(key):
ThisTask.apply_async((params), task_id=key)
Related
I need to implement a scheduled task in our Django app. DBader's schedule seems to be a good candidate for the job, however when run it as part of a Django project, it doesn't seem to produce the desired effect.
Specifically, this works fine as an independent program:
import schedule
import time
import logging
log = logging.getLogger(__name__)
def handleAnnotationsWithoutRequests(settings):
'''
From settings passed in, grab job-ids list
For each job-id in that list, perform annotation group/set logic [for details, refer to handleAnnotationsWithRequests(requests, username)
sans requests, those are obtained from db based on job-id ]
'''
print('Received settings: {}'.format(str(settings)))
def job():
print("I'm working...")
#schedule.every(3).seconds.do(job)
#schedule.every(2).seconds.do(handleAnnotationsWithoutRequests, settings={'a': 'b'})
invoc_time = "10:33"
schedule.every().day.at(invoc_time).do(handleAnnotationsWithoutRequests, settings={'a': 'b'})
while True:
schedule.run_pending()
time.sleep(1)
But this (equivalent) code run in Django context doesn't result in an invocation.
def handleAnnotationsWithoutRequests(settings):
'''
From settings passed in, grab job-ids list
For each job-id in that list, perform annotation group/set logic [for details, refer to handleAnnotationsWithRequests(requests, username)
sans requests, those are obtained from db based on job-id ]
'''
log.info('Received settings: {}'.format(str(settings)))
def doSchedule(settings):
'''
with scheduler library
Based on time specified in settings, invoke .handleAnnotationsWithoutRequests(settings)
'''
#settings will need to be reconstituted from the DB first
#settings = {}
invocationTime = settings['running_at']
import re
invocationTime = re.sub(r'([AaPp][Mm])', "", invocationTime)
log.info("Invocation time to be used: {}".format(invocationTime))
schedule.every().day.at(invocationTime).do(handleAnnotationsWithoutRequests, settings=settings)
while True:
schedule.run_pending()
time.sleep(1)
so the log from handleAnnotationsWithoutRequests() doesn't appear on the console.
Is this scheduling library compatible with Django? Are there any usage samples that one could refer me to?
I'm suspecting some thread issues are at work here. Perhaps there are better alternatives to be used? Suggestions are welcome.
Thank you in advance.
For web servers, you probably don't want something that runs in-process:
An in-process scheduler for periodic jobs [...]
https://github.com/Tivix/django-cron has proven a working solution.
There's also the heavyweight champion Celery and Celerybeat.
I do this a lot with Django Commands
The pattern I use is to setup a new Django command in my app and then make it a long-running process inside a never-ending while() loop.
I the loop iterates continuously with a custom defined sleep(1) timer.
The short version is here, with a bit of pseudo-code thrown in. You can see a working version of this pattern in my Django Reference Implementation.
class Command(BaseCommand):
help = 'My Long running job'
def handle(self, *args, **options):
self.stdout.write(self.style.SUCCESS(f'Starting long-running job.'))
while True:
if conditions met for job:
self.job()
sleep(5)
def job(self):
self.stdout.write(self.style.SUCCESS(f'Running the job...'))
This is the first time I'm using Celery, and honestly, I'm not sure I'm doing it right. My system has to run on Windows, so I'm using RabbitMQ as the broker.
As a proof of concept, I'm trying to create a single object where one task sets the value, another task reads the value, and I also want to show the current value of the object when I go to a certain url. However I'm having problems sharing the object between everything.
This is my celery.py
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery
from django.conf import settings
os.environ.setdefault('DJANGO_SETTINGS_MODULE','cesGroundStation.settings')
app = Celery('cesGroundStation')
app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
#app.task(bind = True)
def debug_task(self):
print('Request: {0!r}'.format(self.request))
The object I'm trying to share is:
class SchedulerQ():
item = 0
def setItem(self, item):
self.item = item
def getItem(self):
return self.item
This is my tasks.py
from celery import shared_task
from time import sleep
from scheduler.schedulerQueue import SchedulerQ
schedulerQ = SchedulerQ()
#shared_task()
def SchedulerThread():
print ("Starting Scheduler")
counter = 0
while(1):
counter += 1
if(counter > 100):
counter = 0
schedulerQ.setItem(counter)
print("In Scheduler thread - " + str(counter))
sleep(2)
print("Exiting Scheduler")
#shared_task()
def RotatorsThread():
print ("Starting Rotators")
while(1):
item = schedulerQ.getItem()
print("In Rotators thread - " + str(item))
sleep(2)
print("Exiting Rotators")
#shared_task()
def setSchedulerQ(schedulerQueue):
schedulerQ = schedulerQueue
#shared_task()
def getSchedulerQ():
return schedulerQ
I'm starting my tasks in my apps.py...I'm not sure if this is the right place as the tasks/workers don't seem to work until I start the workers in a separate console where I run the celery -A cesGroundStation -l info.
from django.apps import AppConfig
from scheduler.schedulerQueue import SchedulerQ
from scheduler.tasks import SchedulerThread, RotatorsThread, setSchedulerQ, getSchedulerQ
class SchedulerConfig(AppConfig):
name = 'scheduler'
def ready(self):
schedulerQ = SchedulerQ()
setSchedulerQ.delay(schedulerQ)
SchedulerThread.delay()
RotatorsThread.delay()
In my views.py I have this:
def schedulerQ():
queue = getSchedulerQ.delay()
return HttpResponse("Your list: " + queue)
The django app runs without errors, however my output from "celery -A cesGroundStation -l info" is this: Celery command output
First it seems to start multiple "SchedulerThread" tasks, secondly the "SchedulerQ" object isn't being passed to the Rotators, as it's not reading the updated value.
And if I go to the url for which shows the views.schedulerQ view I get this error:
Django views error
I have very, very little experience with Python, Django and Web Development in general, so I have no idea where to start with that last error. Solutions suggest using Redis to pass the object to the views, but I don't know how I'd do that using RabbitMQ. Later on the schedulerQ object will implement a queue and the scheduler and rotators will act as more of a producer/consumer dynamic with the view showing the contents of the queue, so I believe using the database might be too resource intensive. How can I share this object across all tasks, and is this even the right approach?
The right approach would be to use some persistence layer, such as a database or results back end to store the information you want to share between tasks if you need to share information between tasks (in this example, what you are currently putting in your class).
Celery operates on a distributed message passing paradigm - a good way to distill that idea for this example, is that your module will be executed independently every time a task is dispatched. Whenever a task is dispatched to Celery, you must assume it is running in a seperate interpreter and loaded independently of other tasks. That SchedulerQ class is instantiated anew each time.
You can share information between tasks in ways described in the docs linked previously and some best practice tips discuss data persistence concerns.
I have some python code in which an APScheduler job is not firing. As context, I also have a handler that is looking a directory for file modifications in addition using eventlet/GreenPool to do multi-threading. Based on some troubleshooting, it seems like there's some sort of conflict between APScheduler and eventlet.
My output looks as follows:
2016-12-26 02:30:30 UTC (+0000): Finished Download Pass
2016-12-26 02:46:07 UTC (+0000): EXITING due to control-C or other exit signal
Jobstore default:
Time-Activated Download (trigger: interval[0:05:00], next run at: 2016-12-25 18:35:00 PST)
2016-12-26 02:46:07 UTC (+0000): 1
(18:35 PST = 02:35 UTC)...so it should have fired 11 minutes before I pressed control-C
from apscheduler import events ## pip install apscheduler
from apscheduler.schedulers.background import BackgroundScheduler
# Threading
from eventlet import patcher, GreenPool ## pip install eventlet
patcher.monkey_patch(all = True)
def setSchedule(scheduler, cfg, minutes = 60*2, hours = 0):
"""Set up the schedule of how frequently a download should be attempted.
scheduler object must already be declared.
will accept either minutes or hours for the period between downloads"""
if hours > 0:
minutes = 60*hours if minutes == 60 else 60*hours+minutes
handle = scheduler.add_job(processAllQueues,
trigger='interval',
kwargs={'cfg': cfg},
id='RQmain',
name='Time-Activated Download',
coalesce=True,
max_instances=1,
minutes=minutes,
start_date=dt.datetime.strptime('2016-10-10 00:15:00', '%Y-%m-%d %H:%M:%S') # computer's local time
)
return handle
def processAllQueues(cfg):
SQSpool = GreenPool(size=int(cfg.get('GLOBAL','Max_AWS_Connections')))
FHpool = GreenPool(size=int(cfg.get('GLOBAL','Max_Raw_File_Process')))
arSects = []
dGlobal = dict(cfg.items('GLOBAL'))
for sect in filter(lambda x: iz.notEqualz(x,'GLOBAL','RUNTIME'),cfg.sections()):
dSect = dict(cfg.items(sect)) # changes all key names to lowercase
n = dSect['sqs_queue_name']
nn = dSect['node_name']
fnbase = "{}_{}".format(nn,n)
dSect["no_ext_file_name"] = os.path.normpath(os.path.join(cfg.get('RUNTIME','Data_Directory'),fnbase))
arSects.append(mergeTwoDicts(dGlobal,dSect)) # section overrides global
arRes = []
for (que_data,spec_section) in SQSpool.imap(doQueueDownload,arSects):
if que_data: fileResult = FHpool.spawn(outputQueueToFiles,spec_section,que_data).wait()
else: fileResult = (False,spec_section['sqs_queue_name'])
arRes.append(fileResult)
SQSpool.waitall()
FHpool.waitall()
pr.ts_print("Finished Download Pass")
return None
def main():
cfgglob = readConfigs(cfgdir, datdir)
sched = BackgroundScheduler()
cron_job = setSchedule(sched, cfgglob, 5)
sched.start(paused=True)
try:
change_handle = win32file.FindFirstChangeNotification(cfgdir, 0, win32con.FILE_NOTIFY_CHANGE_FILE_NAME | win32con.FILE_NOTIFY_CHANGE_LAST_WRITE)
processAllQueues(cfgglob)
sched.resume() # turn the scheduler back on and monitor both wallclock and config directory.
cron_job.resume()
while 1:
SkipDownload = False
result = win32event.WaitForSingleObject(change_handle, 500)
if result == win32con.WAIT_OBJECT_0: # If the WaitForSO returned because of a notification rather than error/timing out
sched.pause() # make sure we don't run the job as a result of timestamp AND file modification
while 1:
try:
win32file.FindNextChangeNotification(change_handle) # rearm - done at start because of the loop structure here
cfgglob = None
cfgglob = readConfigs(cfgdir,datdir)
cron_job.modify(kwargs={'cfg': cfgglob}) # job_id="RQmain",
change_handle = win32file.FindFirstChangeNotification(cfgdir, 0, win32con.FILE_NOTIFY_CHANGE_FILE_NAME | win32con.FILE_NOTIFY_CHANGE_LAST_WRITE) # refresh handle
if not SkipDownload: processAllQueues(cfgglob)
sched.resume()
cron_job.resume()
break
except KeyboardInterrupt:
if VERBOSE | DEBUG: pr.ts_print("EXITING due to control-C or other exit signal")
finally:
sched.print_jobs()
pr.ts_print(sched.state)
sched.shutdown(wait=False)
If I comment out most of the processAllQueues function along with the eventlet includes at top, it fires appropriately. If I keep the
from eventlet import patcher, GreenPool ## pip install eventlet
patcher.monkey_patch(all = True)
but comment out processAllQueues up to the print line in the second-to-last line, it fails to fire the APScheduler, indicating that there's either a problem with importing patcher and GreenPool or with the monkey_patch statement. Commenting out the patcher.monkey_patch(all = True) makes it "work" again.
Does anyone know what an alternate monkey_patch statement would be that would work in my circumstances?
You have an explicit event loop watching for file changes. That blocks eventlet event loop from running. You have two options:
Wrap blocking calls (such as win32event.WaitForSingleObject()) in eventlet.tpool.execute()
Run eventlet.sleep() before/after blocking calls and make sure you don't block for too long.
eventlet.monkey_patch(thread=False) is shorter alternative to listing every other module as true. Generally you want thread=True when using locks or thread-local storage or threading API to spawn green threads. You may want thread=False if you genuinely use OS threads, like for funny GUI frameworks.
You shouldn't really consider Eventlet on Windows for running important projects. Performance is much inferior against POSIX. I didn't run tests on Windows since 0.17. It's rather for ease of development on popular desktop platform.
I am working in an application where i am doing a huge data processing to generate a completely new set of data which is then finally saved to database. The application is taking a huge time in processing and saving the data to data base. I want to improve the user experience to some extent by redirecting user to result page first and then doing the data saving part in background(may be in the asynchronous way) . My problem is that for displaying the result page i need to have the new set of processed data. Is there any way that i can do so that the data processing and data saving part is done in background and whenever the data processing part is completed(before saving to database) i would get the processed data in result page?.
Asynchronous tasks can be accomplished in Python using Celery. You can simply push the task to Celery queue and the task will be performed in an asynchronous way. You can then do some polling from the result page to check if it is completed.
Other alternative can be something like Tornado.
Another strategy is to writing a threading class that starts up custom management commands you author to behave as worker threads. This is perhaps a little lighter weight than working with something like celery, and of course has both advantages and disadvantages. I also used this technique to sequence/automate migration generation/application during application startup (because it lives in a pipeline). My gunicorn startup script then starts these threads in pre_exec() or when_ready(), etc, as appropriate, and then stops them in on_exit().
# Description: Asychronous Worker Threading via Django Management Commands
# Lets you run an arbitrary Django management command, either a pre-baked one like migrate,
# or a custom one that you've created, as a worker thread, that can spin forever, or not.
# You can use this to take care of maintenance tasks at start-time, like db migration,
# db flushing, etc, or to run long-running asynchronous tasks.
# I sometimes find this to be a more useful pattern than using something like django-celery,
# as I can debug/use the commands I write from the shell as well, for administrative purposes.
import json
import os
import requests
import sys
import time
import uuid
import logging
import threading
import inspect
import ctypes
from django.core.management import call_command
from django.conf import settings
class DjangoWorkerThread(threading.Thread):
"""
Initializes a seperate thread for running an arbitrary Django management command. This is
one (simple) way to make asynchronous worker threads. There exist richer, more complex
ways of doing this in Django as well (django-cerlery).
The advantage of this pattern is that you can run the worker from the command line as well,
via manage.py, for the sake of rapid development, easy testing, debugging, management, etc.
:param commandname: name of a properly created Django management command, which exists
inside the app/management/commands folder in one of the apps in your project.
:param arguments: string containing command line arguments formatted like you would
when calling the management command via manage.py in a shell
:param restartwait: integer seconds to wait before restarting worker if it dies,
or if a once-through command, acts as a thread-loop delay timer
"""
def __init__(self, commandname,arguments="",restartwait=10,logger=""):
super(DjangoWorkerThread, self).__init__()
self.commandname = commandname
self.arguments = arguments
self.restartwait = restartwait
self.name = commandname
self.event = threading.Event()
if logger:
self.l = logger
else:
self.l = logging.getLogger('root')
def run(self):
"""
Start the thread.
"""
try:
exceptioncount = 0
exceptionlimit = 10
while not self.event.is_set():
try:
if self.arguments:
self.l.info('Starting ' + self.name + ' worker thread with arguments ' + self.arguments)
call_command(self.commandname,self.arguments)
else:
self.l.info('Starting ' + self.name + ' worker thread with no arguments')
call_command(self.commandname)
self.event.wait(self.restartwait)
except Exception as e:
self.l.error(self.commandname + ' Unkown error: {}'.format(str(e)))
exceptioncount += 1
if exceptioncount > exceptionlimit:
self.l.error(self.commandname + " : " + self.arguments + " : Exceeded exception retry limit, aborting.")
self.event.set()
finally:
self.l.info('Stopping command: ' + self.commandname + " " + self.arguments)
def stop(self):
"""Nice Stop
Stop nicely by setting an event.
"""
self.l.info("Sending stop event to self...")
self.event.set()
#then make sure it's dead...and schwack it harder if not.
#kill it with fire! be mean to your software. it will make you write better code.
self.l.info("Sent stop event, checking to see if thread died.")
if self.isAlive():
self.l.info("Still not dead, telling self to murder self...")
time.sleep( 0.1 )
os._exit(1)
def start_worker(command_name, command_arguments="", restart_wait=10,logger=""):
"""
Starts a background worker thread running a Django management command.
:param str command_name: the name of the Django management command to run,
typically would be a custom command implemented in yourapp/management/commands,
but could also be used to automate standard Django management tasks
:param str command_arguments: a string containing the command line arguments
to supply to the management command, formatted as if one were invoking
the command from a shell
"""
if logger:
l = logger
else:
l = logging.getLogger('root')
# Start the thread
l.info("Starting worker: "+ command_name + " : " + command_arguments + " : " + str(restart_wait) )
worker = DjangoWorkerThread(command_name,command_arguments, restart_wait,l)
worker.start()
l.info("Worker started: "+ command_name + " : " + command_arguments + " : " + str(restart_wait) )
# Return the thread instance
return worker
#<----------------------------------------------------------------------------->
def stop_worker(worker,logger=""):
"""
Gracefully shutsdown the worker thread
:param threading.Thread worker: the worker thread object
"""
if logger:
l = logger
else:
l = logging.getLogger('root')
# Shutdown the thread
l.info("Stopping worker: "+ worker.commandname + " : " + worker.arguments + " : " + str(worker.restartwait) )
worker.stop()
worker.join(worker.restartwait)
l.info("Worker stopped: "+ worker.commandname + " : " + worker.arguments + " : " + str(worker.restartwait) )
The long running task can be offloaded with Celery. You can still get all the updates and results. Your web application code should take care of polling for updates and results. http://blog.miguelgrinberg.com/post/using-celery-with-flask
explains how one can achieve this.
Some useful steps:
Configure celery with result back-end.
Execute the long running task asynchronously.
Let the task update its state periodically or when it executes some stage in job.
Poll from web application to get the status/result.
Display the results on UI.
There is a need for bootstrapping it all together, but once done it can be reused and it is fairly performant.
It's the same process that a synchronous request. You will use a View that should return a JsonResponse. The 'tricky' part is on the client side, where you have to make the async call to the view.
I'm wondering how to setup a more specific logging system. All my tasks use
logger = logging.getLogger(__name__)
as a module-wide logger.
I want celery to log to "celeryd.log" and my tasks to "tasks.log" but I got no idea how to get this working. Using CELERYD_LOG_FILE from django-celery I can route all celeryd related log messages to celeryd.log but there is no trace of the log messages created in my tasks.
Note: This answer is outdated as of Celery 3.0, where you now use get_task_logger() to get your per-task logger set up. Please see the Logging section of the What's new in Celery 3.0 document for details.
Celery has dedicated support for logging, per task. See the Task documentation on the subject:
You can use the workers logger to add diagnostic output to the worker log:
#celery.task()
def add(x, y):
logger = add.get_logger()
logger.info("Adding %s + %s" % (x, y))
return x + y
There are several logging levels available, and the workers loglevel setting decides
whether or not they will be written to the log file.
Of course, you can also simply use print as anything written to standard out/-err will be
written to the log file as well.
Under the hood this is all still the standard python logging module. You can set the CELERYD_HIJACK_ROOT_LOGGER option to False to allow your own logging setup to work, otherwise Celery will configure the handling for you.
However, for tasks, the .get_logger() call does allow you to set up a separate log file per individual task. Simply pass in a logfile argument and it'll route log messages to that separate file:
#celery.task()
def add(x, y):
logger = add.get_logger(logfile='tasks.log')
logger.info("Adding %s + %s" % (x, y))
return x + y
Last but not least, you can just configure your top-level package in the python logging module and give it a file handler of it's own. I'd set this up using the celery.signals.after_setup_task_logger signal; here I assume all your modules live in a package called foo.tasks (as in foo.tasks.email and foo.tasks.scaling):
from celery.signals import after_setup_task_logger
import logging
def foo_tasks_setup_logging(**kw):
logger = logging.getLogger('foo.tasks')
if not logger.handlers:
handler = logging.FileHandler('tasks.log')
formatter = logging.Formatter(logging.BASIC_FORMAT) # you may want to customize this.
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.propagate = False
after_setup_task_logger.connect(foo_tasks_setup_logging)
Now any logger whose name starts with foo.tasks will have all it's messages sent to tasks.log instead of to the root logger (which doesn't see any of these messages because .propagate is False).
Just a hint: Celery has its own logging handler:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
Also, Celery logs all output from the task. More details at Celery docs for Task Logging
join
--concurrency=1 --loglevel=INFO
with the command to run celery worker
eg: python xxxx.py celery worker --concurrency=1 --loglevel=INFO
Better to set loglevel inside each python files too