I have a Django rest framework app that calls 2 huey tasks in succession in a serializer create method like so:
...
def create(self, validated_data):
user = self.context['request'].user
player_ids = validated_data.get('players', [])
game = Game.objects.create()
tasks.make_players_friends_task(player_ids)
tasks.send_notification_task(user.id, game.id)
return game
# tasks.py
#db_task()
def make_players_friends_task(ids):
players = User.objects.filter(id__in=ids)
# process players
#db_task()
def send_notification_task(user_id, game_id):
user = User.objects.get(id=user_id)
game = Game.objects.get(id=game_id)
# send notifications
When running the huey process in the terminal, when I hit this endpoint, I can see that only one or the other of the tasks is ever called, but never both. I am running huey with the default settings (redis with 1 thread worker.)
If I alter the code so that I am passing in the objects themselves as parameters, rather than the ids, and remove the django queries in the #db_task methods, things seem to work alright.
The reason I initially used the ids as parameters is because I assumed (or read somewhere) that huey uses json serialization as default, but after looking into it, pickle is actually the default serializer.
One theory is that since I am only running one worker, and also have a #db_periodic_task method in the app, the process can only handle listening for tasks or executing them at any time, but not both. This is the way celery seems to work, where you need a separate process for a scheduler and a worker each, but this isn't mentioned in huey's documentation.
If you run the huey consumer it will actually spawn a separate scheduler together with the amount of workers you've specified, so that's not going to be your problem.
You're not giving enough information to actually properly see what's going wrong so check the following:
If you run the huey consumer in the terminal, observe whether all your tasks show up as properly registered so that the consumer is actually capable of consuming them.
Check whether your redis process is running.
Try performing the tasks with a blocking call to see on which tasks it fails:
task_result = tasks.make_players_friends_task(player_ids)
task_result.get(blocking=True)
task_result = tasks.send_notification_task(user.id, game.id)
task_result.get(blocking=True)
Do this with a debugger or print statements to see whether it makes it to the end of your function or where it gets stuck.
Make sure to always restart your consumer when you change code. It doesn't automatically pick up new code like the django dev server. The fact that your code works as intended while pickling whole objects instead of passing id's could point to this, as it would be really weird that this would break it. On the other hand, you shouldn't pass in django ORM objects. It makes way more sense to use your id approach.
Related
I have a basic django projects that I use as a front end interface for a (Condor) computing cluster for generating simulations. From the django app the users can start simulations (in Condor). The simulation related meta-data and the simulation state are kept in a DB.
I need to add a new feature: notification when (some) simulations are done.
Since I want a simple solution (and I already using background tasks) I was thinking to use repeating task that at fixed intervals query Condor about the tasks, updates the DB and if necessary sends notifications.
So if I want to update every 10 min that statuses I will have something like:
#background(schedule=1)
def check_simulations(repeat=600):
# lookup simulation statuses
simulation_list = get_Simulations()
for sim in simulations_list:
if sim.status == Simulation.DONE:
user.email_user('Simulation Complete', 'You have been notified')
def initialize():
check_simulations()
However this task (or better say the initialize() method) must be started (called once) to create and schedule the check_simulations() task (which will practically serialize the call and save it in the DB); after that the background-tasks thread will read it and execute and also reschedule it (if there is error)
My questions:
where should I put the call to the initialize() method to only be run once ?
One such place could be for instance the urls.py but this is an extremely ugly solution. Is there a better way ?
how to ensure that a server restart will not create and schedule a new task (if one already exist)
This may happen if a task is already scheduled (so a serialized task is in the background-tasks table) and the webserver is restarted so the initialize() method is called again so a new task is created and scheduled ...
i had a similar problem and i solved it this way.
i initialize my task in urls.py, i dont know if you can use other places to put it ,also added and if, to check if the task its allready in the database
from background_task.models import Task
if not Task.objects.filter(verbose_name="update_orders").exists():
tasks.update_orders(repeat=300, verbose_name="update_orders")
i have tested it and it works fine, you can also search for the order with other parameters like name, hash ,...
you can check the task model here: https://github.com/arteria/django-background-tasks/blob/master/background_task/models.py
I have to do some long work in my Flask app. And I want to do it async. Just start working, and then check status from javascript.
I'm trying to do something like:
#app.route('/sync')
def sync():
p = Process(target=routine, args=('abc',))
p.start()
return "Working..."
But this it creates defunct gunicorn workers.
How can it be solved? Should I use something like Celery?
There are many options. You can develop your own solution, use Celery or Twisted (I'm sure there are more already-made options out there but those are the most common ones).
Developing your in-house solution isn't difficult. You can use the multiprocessing module of the Python standard library:
When a task arrives you insert a row in your database with the task id and status.
Then launch a process to perform the work which updates the row status at finish.
You can have a view to check if the task is finished, which actually just checks the status in the corresponding.
Of course you have to think where you want to store the result of the computation and what happens with errors.
Going with Celery is also easy. It would look like the following.
To define a function to be executed asynchronously:
#celery.task
def mytask(data):
... do a lot of work ...
Then instead of calling the task directly, like mytask(data), which would execute it straight away, use the delay method:
result = mytask.delay(mydata)
Finally, you can check if the result is available or not with ready:
result.ready()
However, remember that to use Celery you have to run an external worker process.
I haven't ever taken a look to Twisted so I cannot tell you if it more or less complex than this (but it should be fine to do what you want to do too).
In any case, any of those solutions should work fine with Flask. To check the result it doesn't matter at all if you use Javascript. Just make the view that checks the status return JSON (you can use Flask's jsonify).
I would use a message broker such as rabbitmq or activemq. The flask process would add jobs to the message queue and a long running worker process (or pool or worker processes) would take jobs off the queue to complete them. The worker process could update a database to allow the flask server to know the current status of the job and pass this information to the clients.
Using celery seems to be a nice way to do this.
I have to do some long work in my Flask app. And I want to do it async. Just start working, and then check status from javascript.
I'm trying to do something like:
#app.route('/sync')
def sync():
p = Process(target=routine, args=('abc',))
p.start()
return "Working..."
But this it creates defunct gunicorn workers.
How can it be solved? Should I use something like Celery?
There are many options. You can develop your own solution, use Celery or Twisted (I'm sure there are more already-made options out there but those are the most common ones).
Developing your in-house solution isn't difficult. You can use the multiprocessing module of the Python standard library:
When a task arrives you insert a row in your database with the task id and status.
Then launch a process to perform the work which updates the row status at finish.
You can have a view to check if the task is finished, which actually just checks the status in the corresponding.
Of course you have to think where you want to store the result of the computation and what happens with errors.
Going with Celery is also easy. It would look like the following.
To define a function to be executed asynchronously:
#celery.task
def mytask(data):
... do a lot of work ...
Then instead of calling the task directly, like mytask(data), which would execute it straight away, use the delay method:
result = mytask.delay(mydata)
Finally, you can check if the result is available or not with ready:
result.ready()
However, remember that to use Celery you have to run an external worker process.
I haven't ever taken a look to Twisted so I cannot tell you if it more or less complex than this (but it should be fine to do what you want to do too).
In any case, any of those solutions should work fine with Flask. To check the result it doesn't matter at all if you use Javascript. Just make the view that checks the status return JSON (you can use Flask's jsonify).
I would use a message broker such as rabbitmq or activemq. The flask process would add jobs to the message queue and a long running worker process (or pool or worker processes) would take jobs off the queue to complete them. The worker process could update a database to allow the flask server to know the current status of the job and pass this information to the clients.
Using celery seems to be a nice way to do this.
In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.
Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.
Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.
The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.
Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.
My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).
How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?
Thanks a lot for your help, I'm quite lost.
I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.
In a Django view you can access the TaskMeta object:
from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status
I am creating a Django application that does various long computations with uploaded files. I don't want to make the user wait for the file to be handled - I just want to show the user a page reading something like 'file is being parsed'.
How can I make an asynchronous function call from a view?
Something that may look like that:
def view(request):
...
if form.is_valid():
form.save()
async_call(handle_file)
return render_to_response(...)
Rather than trying to manage this via subprocesses or threads, I recommend you separate it out completely. There are two approaches: the first is to set a flag in a database table somewhere, and have a cron job running regularly that checks the flag and performs the required operation.
The second option is to use a message queue. Your file upload process sends a message on the queue, and a separate listener receives the message and does what's needed. I've used RabbitMQ for this sort of thing, but others are available.
Either way, your user doesn't have to wait for the process to finish, and you don't have to worry about managing subprocesses.
I have tried to do the same and failed after multiple attempt due of the nature of django and other asynchronous call.
The solution I have come up which could be a bit over the top for you is to have another asynchronous server in the background processing messages queues from the web request and throwing some chunked javascript which get parsed directly from the browser in an asynchronous way (ie: ajax).
Everything is made transparent for the end user via mod_proxy setting.
Unless you specifically need to use a separate process, which seems to be the gist of the other questions S.Lott is indicating as duplicate of yours, the threading module from the Python standard library (documented here) may offer the simplest solution. Just make sure that handle_file is not accessing any globals that might get modified, nor especially modifying any globals itself; ideally it should communicate with the rest of your process only through Queue instances; etc, etc, all the usual recommendations about threading;-).
threading will break runserver if I'm not mistaken. I've had good luck with multiprocess in request handlers with mod_wsgi and runserver. Maybe someone can enlighten me as to why this is bad:
def _bulk_action(action, objs):
# mean ponies here
def bulk_action(request, t):
...
objs = model.objects.filter(pk__in=pks)
if request.method == 'POST':
objs.update(is_processing=True)
from multiprocessing import Process
p = Process(target=_bulk_action,args=(action,objs))
p.start()
return HttpResponseRedirect(next_url)
context = {'t': t, 'action': action, 'objs': objs, 'model': model}
return render_to_response(...)
http://docs.python.org/library/multiprocessing.html
New in 2.6