starting my own threads within python paste - python

I'm writing a web application using pylons and paste. I have some work I want to do after an HTTP request is finished (send some emails, write some stuff to the db, etc) that I don't want to block the HTTP request on.
If I start a thread to do this work, is that OK? I always see this stuff about paste killing off hung threads, etc. Will it kill my threads which are doing work?
What else can I do here? Is there a way I can make the request return but have some code run after it's done?
Thanks.

You could use a thread approach (maybe setting the Thead.daemon property would help--but I'm not sure).
However, I would suggest looking into a task queuing system. You can place a task on a queue (which is very fast), then a listener can handle the tasks asynchronously, allowing the HTTP request to return quickly. There are two task queues that I know of for Django:
Django Queue Service
Celery
You could also consider using an more "enterprise" messaging solution, such as RabbitMQ or ActiveMQ.
Edit: previous answer with some good pointers.

I think the best solution is messaging system because it can be configured to not loose the task if the pylons process goes down. I would always use processes over threads especially in this case. If you are using python 2.6+ use the built in multiprocessing or you can always install the processing module which you can find on pypi (I can't post link because of I am a new user).

Take a look at gearman, it was specifically made for farming out tasks to 'workers' to handle. They can even handle it in a different language entirely. You can come back and ask if the task was completed, or just let it complete. That should work well for many tasks.
If you absolutely need to ensure it was completed, I'd suggest queuing tasks in a database or somewhere persistent, then have a separate process that runs through it ensuring each one gets handled appropriately.

To answer your basic question directly, you should be able to use threads just as you'd like. The "killing hung threads" part is paste cleaning up its own threads, not yours.
There are other packages that might help, etc, but I'd suggest you start with simple threads and see how far you get. Only then will you know what you need next.
(Note, "Thread.daemon" should be mostly irrelevant to you here. Setting that true will ensure a thread you start will not prevent the entire process from exiting. Doing so would mean, however, that if the process exited "cleanly" (as opposed to being forced to exit) your thread would be terminated even if it wasn't done its work. Whether that's a problem, and how you handle things like that, depend entirely on your own requirements and design.

Related

Is it a bad practice to use sleep() in a web server in production?

I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.

Running asynchronous python code in a Django web application

Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.

What's alternative choice for background worker, async defered task queue in django/wsgi besides celery?

Are there any pure wsgi implementation of background task?
I want to use local variables under the same context directly, not serialize/deserialize to another daemon process via a broker.
Is it possible to make this happen under the current wsgi infrastructure? E.g. after return response yield, run some callback functions?
This is a duplicate of question asked on the Python WEB-SIG. I reference the same page as provided in response to the question on the Python WEB-SIG so others can see it:
http://code.google.com/p/modwsgi/wiki/RegisteringCleanupCode
In doing this though, it ties up the request thread and so it would not be able to handle other requests until your task has finished.
Creating background threads at the end of a request is not a good idea unless you do it using a pooling mechanism such that you limit the number of worker threads for your tasks. Because the process can crash or be shutdown, you loose the job as only in memory and thus not persistent.
Better to use Celery, or if you think that is too heavy weight, have a look at Redis Queue (RQ) instead.
You could look at Django async. It uses an in-database queue and so handles transactions much better. All arguments need to be JSONable as does the return type. In some cases this means you may need to schedule a wrapper function, but that oughtn't to cause you any headaches.
http://pypi.python.org/pypi/django-async
You don't want to be doing this sort of thing inside the web server -- it's absolutely not the right place to do it. Django async provides a manage.py command for flushing the queue which you can run in a loop, possible on another machine from the web server.

python threading/fork?

I'm making a python script that needs to do 3 things simultaneously.
What is a good way to achieve this as do to what i've heard about the GIL i'm not so lean into using threads anymore.
2 of the things that the script needs to do will be heavily active, they will have lots of work to do and then i need to have the third thing reporting to the user over a socket when he asks (so it will be like a tiny server) about the status of the other 2 processes.
Now my question is what would be a good way to achieve this? I don't want to have three different script and also due to GIL using threads i think i won't get much performance and i'll make things worse.
Is there a fork() for python like in C so from my script so fork 2 processes that will do their job and from the main process to report to the user? And how can i communicate from the forked processes with the main process?
LE:: to be more precise 1thread should get email from a imap server and store them into a database, another thread should get messages from db that needs to be sent and then send them and the main thread should be a tiny http server that will just accept one url and will show the status of those two threads in json format. So are threads oK? will the work be done simultaneously or due to the gil there will be performance issues?
I think you could use the multiprocessing package that has an API similar to the threading package and will allow you to get a better performance with multiple cores on a single CPU.
To view the gain of performance using multiprocessing instead threading, check on this link about the average time comparison of the same program using multiprocessing x threading.
The GIL is really only something to care about if you want to do multiprocessing, that is spread the load over several cores/processors. If that is the case, and it kinda sounds like it from your description, use multiprocessing.
If you just need to do three things "simultaneously" in that way that you need to wait in the background for things to happen, then threads are just fine. That's what threads are for in the first place. 8-I)

Python "Task Server"

My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?
I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/
I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.
My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.
You can have a look at celery
It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.

Categories