I'm learning my way around python and Django, and can't seem to find clear documentation for firing off a background process or thread, to perform some data processing (including pulling info from external websites/urls).
Where can I learn more about background processes/threads in Django?
I'm especially interested in tutorials that demo pulling and pushing data across multiple sites/servers/protocols.
Use Celery, a task queue that works well with Django. Add a delayed task in your view and Celery will handle it in a separate process. Tutorials are available on the Celery homepage.
Once you understand how to create tasks and add tasks to the queue, then you can use standard Python modules like urllib2 to open URLs, or other specialized modules to work with REST APIs.
Under no circumstances should you try to create a new thread or start a subprocess in a view function. Always use delayed task management.
You can use Python subprocesses (http://docs.python.org/library/subprocess.html :
subprocess.call(["ls", "-l"])
Related
I have a python application running inside of a pod in kubernetes which subscribes to a Google Pub/Sub topic and on each message downloads a file from a google bucket.
The issue I have is that I can't process the workload quickly enough using a single threaded Python application. I would normally run a number of pods to handle the workload but the problem is that all the files have to end up on the same filesystem to be processed by another application.
I have tried spawning a new thread for each request but the volume is too great.
What I would like to do is:
1) Have a number of processes that can process new messages
2) Keep the processes alive and use them to respond to new requests coming in.
All the examples for multiprocessing in python are single workload examples, for example providing 10 numbers to a square function, which isn't what I'm trying to achieve.
I've used gunicorn in the past which spawns a number of worker threads for a flask application, what I want is to do something similar without flask.
In the first, try to separate IO-bound (e.g. request, read/write and etc.) task from CPU-bound (parse JSON/XML, calculating and etc.) task.
For IO-bound case use Threading or ThreadPoolExecutor primitives for auto reuse working thread. Keep attention, writing on disk is blocking function!
If you want to use parallelism for CPU-bound user Processing or ProcessPoolExecutor. For sync them you can use shared object (proxy object) or file or pipe or redis and etc.
Shared objects like Managers (Namespaces, dicts and etc.) is preferred if you want to use pure python.
For work with files to avoid blocking, use individual thread or use async.
For asyncio use aiofile library.
I am trying to build a Tornado web server which takes requests from multiple clients. The request consists of:
a. For a given directory name passed through an URL, zip the files, etc and FTP it out.
b. Providing a status of sorts if the task is completed.
So, rather than making it a synchronous and linear process, I wanted to break it down into multiple subtasks. The client will submit the URL request and then simply receive a response of sorts 'job submitted'. A bit later, the client can come along asking status on this job. During this time the job obviously has to finish its task.
I am confused between what modules to use - Tornado Subprocess, Popen contructor, Subprocess.Call, etc. I've read Python docs but can't find anything where the task is running longer and Tornado is not supposed to wait for it to finish. So, I need a mechanism to start a job, let it run its course but relinquish the client and then when asked by client provide a status on it.
Any help is appreciated. Thanks.
Python programmers widely use Celery for a set of processes to manage a queue of tasks. Set up Celery with RabbitMQ and write a Celery worker (perhaps with Celery Canvas that does the work you need: zips a directory, ftps it to somewhere, etc.
The Tornado-Celery integration package provides something that appears close to what you need to integrate your Tornado application with Celery.
This is all a lot of moving parts to install and configure at first, of course, but it will prepare you for a maintainable application architecture.
Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.
I am trying to set up global variables that would be accessible by any of the threads in Django. I know there are endless posts on stackoverflow about this, and everyone says don't do it. I am writing a web application which does some file processing using the Acora Python module. The Acora module builds a tree of sorts based on some input data (strings). The process of building the tree takes some time, so I'd like to build the Acora structure at application start up time, so that when files are submitted to be processed, the Acora structures would be ready to go. This would shave 30 seconds from each file to be processed if I could pull this off.
I've tried a few methods, but for each request, the data isn't available and I think it's because each request is processed in a separate thread, so I need a cross thread or shared memory solution, or I have to find something other than Acora. Also, Acora can't be pickled or serialized as it is a C module and doesn't expose it's data to Python. I've tried Django cache and cPickle, without luck because they use Pickle. Thoughts?
Pull the Acora task out of Django entirely. Use Twisted or some other event framework to create a service that Django can talk to either directly or via a message queue such as Celery whenever it has files that need processing.
Couldn't find a method that does that in the GAE python docs...
No.
Because of the way taskqueues work in App Engine, there is no way to do this using the built-in taskqueue library. Unfortunately, that's just the way it is.
See the Task Queue Python API Overview for details on the built-in taskqueues.
You can manage your queues in the Administration Console:
Manage task queues, allowing for pausing, purging, and deleting queues.
Manage individual tasks in a task queue, allowing for viewing, deleting, or running individual tasks immediately.
There is a library called asynctools that allows more programmatic access to queue status, though you will likely have to restructure your program to use it.