The purpose is to implement a pool like database connection pool in my web application. My application is write by Django.
The problem is that every time a http request come, my code will be loaded and run through. So if I write some code to initiate a pool. These code will be run per http request. And the pool will be initiated per request. So it is meaningless.
So how should I write this?
Your understanding of how things work is wrong, unfortunately. The way Django runs is very much dependent on the way you are deploying it, but in almost all circumstances it does not load code or initiate globals on every request. Certainly, uWSGI does not behave that way; it runs a set of long-lived workers that persist across many requests.
In effect, uWSGI is already a connection pool. In other words, you are trying to solve a problem that does not exist.
Related
I am working on a Python web app that uses Celery to schedule and execute user job requests.
Most of the time the requests submitted by a user can't be resolved immediately and thus it makes sense to me to schedule them in a queue.
However, now that I have the whole queuing architecture in place, I'm confused about whether I should delegate all the request processing logic to the queue/workers or if I should leave some of the work to the webserver itself.
For example, apart from the job scheduling, there are times where a user only needs to perform a simple database query, or retrieve a static JSON file. Should I also delegate these "synchronous" requests to the queue/workers?
Right now, my webserver controllers don't do anything except validating incoming JSON request schemas and forwarding them to the queue. What are the pros and cons of having a dumb webserver like this?
I believe the way you have it right now plus giving the workers the small jobs now is good. That way the workers would be overloaded first in the event of an attack or huge request influx. :)
I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.
Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.
I have inherited a rather large code base that utilizes tornado to compute and serve big and complex data-types (imagine a 1 MB XML file). Currently there are 8 instances of tornado running to compute and serve this data.
That was a wrong design-decision from the start and I am facing many many timeouts from applications that access the servers.
I'd like to change as few lines of code as possible in the legacy code base because I do not want to break anything that has already been tested in the field. What can I do to transform this system into a threaded one that can execute more xml-computation in parallel?
transform this system into a threaded one that can execute more xml-computation in parallel
If there are enough Tornado instances to saturate the computational resources, moving to a threaded model will probably not gain much performance. Getting rid of blocking code however helps with connection timeouts.
Another option is getting rid of all asynchronous code and using tornado.wsgi.WSGIApplication. That way, you can run the application on a threaded WSGI server. Features that are not available in WSGI mode are listed here.
Use Tornado to just receive non-blocking requests. To do the actual XML processing you can then spawn another process or use an async task processor like celery. Using celery would facilitate easy scaling of your system in future. In fact with this model you'll just need one Tornado instance.
#Eren - I don't think that the computational resources are getting saturated. It would just be that more than 8 requests are not getting processed simultaneously as Tornado would right now be serving requests in blocking mode.
Are there any pure wsgi implementation of background task?
I want to use local variables under the same context directly, not serialize/deserialize to another daemon process via a broker.
Is it possible to make this happen under the current wsgi infrastructure? E.g. after return response yield, run some callback functions?
This is a duplicate of question asked on the Python WEB-SIG. I reference the same page as provided in response to the question on the Python WEB-SIG so others can see it:
http://code.google.com/p/modwsgi/wiki/RegisteringCleanupCode
In doing this though, it ties up the request thread and so it would not be able to handle other requests until your task has finished.
Creating background threads at the end of a request is not a good idea unless you do it using a pooling mechanism such that you limit the number of worker threads for your tasks. Because the process can crash or be shutdown, you loose the job as only in memory and thus not persistent.
Better to use Celery, or if you think that is too heavy weight, have a look at Redis Queue (RQ) instead.
You could look at Django async. It uses an in-database queue and so handles transactions much better. All arguments need to be JSONable as does the return type. In some cases this means you may need to schedule a wrapper function, but that oughtn't to cause you any headaches.
http://pypi.python.org/pypi/django-async
You don't want to be doing this sort of thing inside the web server -- it's absolutely not the right place to do it. Django async provides a manage.py command for flushing the queue which you can run in a loop, possible on another machine from the web server.