Tornado Subprocess confusion - python

I am trying to build a Tornado web server which takes requests from multiple clients. The request consists of:
a. For a given directory name passed through an URL, zip the files, etc and FTP it out.
b. Providing a status of sorts if the task is completed.
So, rather than making it a synchronous and linear process, I wanted to break it down into multiple subtasks. The client will submit the URL request and then simply receive a response of sorts 'job submitted'. A bit later, the client can come along asking status on this job. During this time the job obviously has to finish its task.
I am confused between what modules to use - Tornado Subprocess, Popen contructor, Subprocess.Call, etc. I've read Python docs but can't find anything where the task is running longer and Tornado is not supposed to wait for it to finish. So, I need a mechanism to start a job, let it run its course but relinquish the client and then when asked by client provide a status on it.
Any help is appreciated. Thanks.

Python programmers widely use Celery for a set of processes to manage a queue of tasks. Set up Celery with RabbitMQ and write a Celery worker (perhaps with Celery Canvas that does the work you need: zips a directory, ftps it to somewhere, etc.
The Tornado-Celery integration package provides something that appears close to what you need to integrate your Tornado application with Celery.
This is all a lot of moving parts to install and configure at first, of course, but it will prepare you for a maintainable application architecture.

Related

Spawn Asyncronous Python Process From Flask [duplicate]

I have to do some long work in my Flask app. And I want to do it async. Just start working, and then check status from javascript.
I'm trying to do something like:
#app.route('/sync')
def sync():
p = Process(target=routine, args=('abc',))
p.start()
return "Working..."
But this it creates defunct gunicorn workers.
How can it be solved? Should I use something like Celery?
There are many options. You can develop your own solution, use Celery or Twisted (I'm sure there are more already-made options out there but those are the most common ones).
Developing your in-house solution isn't difficult. You can use the multiprocessing module of the Python standard library:
When a task arrives you insert a row in your database with the task id and status.
Then launch a process to perform the work which updates the row status at finish.
You can have a view to check if the task is finished, which actually just checks the status in the corresponding.
Of course you have to think where you want to store the result of the computation and what happens with errors.
Going with Celery is also easy. It would look like the following.
To define a function to be executed asynchronously:
#celery.task
def mytask(data):
... do a lot of work ...
Then instead of calling the task directly, like mytask(data), which would execute it straight away, use the delay method:
result = mytask.delay(mydata)
Finally, you can check if the result is available or not with ready:
result.ready()
However, remember that to use Celery you have to run an external worker process.
I haven't ever taken a look to Twisted so I cannot tell you if it more or less complex than this (but it should be fine to do what you want to do too).
In any case, any of those solutions should work fine with Flask. To check the result it doesn't matter at all if you use Javascript. Just make the view that checks the status return JSON (you can use Flask's jsonify).
I would use a message broker such as rabbitmq or activemq. The flask process would add jobs to the message queue and a long running worker process (or pool or worker processes) would take jobs off the queue to complete them. The worker process could update a database to allow the flask server to know the current status of the job and pass this information to the clients.
Using celery seems to be a nice way to do this.

Running asynchronous python code in a Django web application

Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.

Flask long routines

I have to do some long work in my Flask app. And I want to do it async. Just start working, and then check status from javascript.
I'm trying to do something like:
#app.route('/sync')
def sync():
p = Process(target=routine, args=('abc',))
p.start()
return "Working..."
But this it creates defunct gunicorn workers.
How can it be solved? Should I use something like Celery?
There are many options. You can develop your own solution, use Celery or Twisted (I'm sure there are more already-made options out there but those are the most common ones).
Developing your in-house solution isn't difficult. You can use the multiprocessing module of the Python standard library:
When a task arrives you insert a row in your database with the task id and status.
Then launch a process to perform the work which updates the row status at finish.
You can have a view to check if the task is finished, which actually just checks the status in the corresponding.
Of course you have to think where you want to store the result of the computation and what happens with errors.
Going with Celery is also easy. It would look like the following.
To define a function to be executed asynchronously:
#celery.task
def mytask(data):
... do a lot of work ...
Then instead of calling the task directly, like mytask(data), which would execute it straight away, use the delay method:
result = mytask.delay(mydata)
Finally, you can check if the result is available or not with ready:
result.ready()
However, remember that to use Celery you have to run an external worker process.
I haven't ever taken a look to Twisted so I cannot tell you if it more or less complex than this (but it should be fine to do what you want to do too).
In any case, any of those solutions should work fine with Flask. To check the result it doesn't matter at all if you use Javascript. Just make the view that checks the status return JSON (you can use Flask's jsonify).
I would use a message broker such as rabbitmq or activemq. The flask process would add jobs to the message queue and a long running worker process (or pool or worker processes) would take jobs off the queue to complete them. The worker process could update a database to allow the flask server to know the current status of the job and pass this information to the clients.
Using celery seems to be a nice way to do this.

Python: DJango how to own a long runing process?

So I have a background procss that I need to expose/control as a web service. I have wrapped the process to be able to accept commands via a pipe, but now am trying to find out how to control it.
Requirements are as follows:
Need to be able to start the process via the web
Need to be able to send cmds
Need to be able to return results from cmds
Process once started is alive until killed
I think the main question is how do I get django to own the process? Own in the sense, keep a valid save the pipe for future communication with the background process. Right now its something along the lines (just an example):
if __name__ == '__main__':
to_process_pipe, process_pipe = Pipe()
node = PFacade(process_pipe)
p.start()
to_process_pipe.send(['connect'])
print to_process_pipe.recv()
p.killed = True
p.join()
I think I need a better way to be able to communicate, bc I am not sure how I could store the Pipe in DJango.
And please, if you are going to respond with use Celery, please give me a good explination of how.
ok, so you want a process to be up and running and accepting commands from the django workers?
In such a case celery will not be a good solution as it is not providing communication after task is spawned.
IMHO a good solution will be to have a deamon (implemented as django management command) with a infinite main loop, some sleep between runs, listening to the commands from the specific queue.
For the communication - kombu/django-kombu will be great (that was a part of celery).
My final solution was to write a custom "mailbox" based on pidbox.Mailbox. Their implementation was horribly broken but the algorithm was solid.
I basically stood up a REST API hosted via django and then had that rest api send a message to a AMQP Queue(QPID implementation).
I then had a process that sits, monitors the queues, and passess along any commands as they came in.
It worked well and was pretty awesome when it came together.
Maybe Celery (asynchronous task queue/job queue based on distributed message passing) fits your bill.

starting my own threads within python paste

I'm writing a web application using pylons and paste. I have some work I want to do after an HTTP request is finished (send some emails, write some stuff to the db, etc) that I don't want to block the HTTP request on.
If I start a thread to do this work, is that OK? I always see this stuff about paste killing off hung threads, etc. Will it kill my threads which are doing work?
What else can I do here? Is there a way I can make the request return but have some code run after it's done?
Thanks.
You could use a thread approach (maybe setting the Thead.daemon property would help--but I'm not sure).
However, I would suggest looking into a task queuing system. You can place a task on a queue (which is very fast), then a listener can handle the tasks asynchronously, allowing the HTTP request to return quickly. There are two task queues that I know of for Django:
Django Queue Service
Celery
You could also consider using an more "enterprise" messaging solution, such as RabbitMQ or ActiveMQ.
Edit: previous answer with some good pointers.
I think the best solution is messaging system because it can be configured to not loose the task if the pylons process goes down. I would always use processes over threads especially in this case. If you are using python 2.6+ use the built in multiprocessing or you can always install the processing module which you can find on pypi (I can't post link because of I am a new user).
Take a look at gearman, it was specifically made for farming out tasks to 'workers' to handle. They can even handle it in a different language entirely. You can come back and ask if the task was completed, or just let it complete. That should work well for many tasks.
If you absolutely need to ensure it was completed, I'd suggest queuing tasks in a database or somewhere persistent, then have a separate process that runs through it ensuring each one gets handled appropriately.
To answer your basic question directly, you should be able to use threads just as you'd like. The "killing hung threads" part is paste cleaning up its own threads, not yours.
There are other packages that might help, etc, but I'd suggest you start with simple threads and see how far you get. Only then will you know what you need next.
(Note, "Thread.daemon" should be mostly irrelevant to you here. Setting that true will ensure a thread you start will not prevent the entire process from exiting. Doing so would mean, however, that if the process exited "cleanly" (as opposed to being forced to exit) your thread would be terminated even if it wasn't done its work. Whether that's a problem, and how you handle things like that, depend entirely on your own requirements and design.

Categories