I am trying to set up global variables that would be accessible by any of the threads in Django. I know there are endless posts on stackoverflow about this, and everyone says don't do it. I am writing a web application which does some file processing using the Acora Python module. The Acora module builds a tree of sorts based on some input data (strings). The process of building the tree takes some time, so I'd like to build the Acora structure at application start up time, so that when files are submitted to be processed, the Acora structures would be ready to go. This would shave 30 seconds from each file to be processed if I could pull this off.
I've tried a few methods, but for each request, the data isn't available and I think it's because each request is processed in a separate thread, so I need a cross thread or shared memory solution, or I have to find something other than Acora. Also, Acora can't be pickled or serialized as it is a C module and doesn't expose it's data to Python. I've tried Django cache and cPickle, without luck because they use Pickle. Thoughts?
Pull the Acora task out of Django entirely. Use Twisted or some other event framework to create a service that Django can talk to either directly or via a message queue such as Celery whenever it has files that need processing.
Related
I've been able to create objects that are created at every request from this link: http://flask.pocoo.org/docs/appcontext/#locality-of-the-context.
I'm actually creating an API based off of http://blog.miguelgrinberg.com/post/designing-a-restful-api-using-flask-restful.
I want to be able to load an object once and just have it return a processed response rather than it loading at every request. The object is not a DB, just requires unpickling a large file.
I've looked through the documentation, but I'm still confused about this whole Flask two states thing.
The Flask contexts only apply per request. Use a module global to store data you only want to load once.
You could just load the data on startup, as a global:
some_global_name = load_data_from_pickle()
WSGI servers that support multiple processes either fork the process, or start a new Python interpreter as needed. When forking, globals are copied to the child process.
You can also use before_first_request() hook to load that data into your process; this is only called if the process has to handle an actual request. This would be after the process fork, giving your child process unique data:
#app.before_first_request
def load_global_data():
global some_global_name
some_global_name = load_data_from_pickle()
I'm learning my way around python and Django, and can't seem to find clear documentation for firing off a background process or thread, to perform some data processing (including pulling info from external websites/urls).
Where can I learn more about background processes/threads in Django?
I'm especially interested in tutorials that demo pulling and pushing data across multiple sites/servers/protocols.
Use Celery, a task queue that works well with Django. Add a delayed task in your view and Celery will handle it in a separate process. Tutorials are available on the Celery homepage.
Once you understand how to create tasks and add tasks to the queue, then you can use standard Python modules like urllib2 to open URLs, or other specialized modules to work with REST APIs.
Under no circumstances should you try to create a new thread or start a subprocess in a view function. Always use delayed task management.
You can use Python subprocesses (http://docs.python.org/library/subprocess.html :
subprocess.call(["ls", "-l"])
I have pretty standard Django+Rabbitmq+Celery setup with 1 Celery task and 5 workers.
Task uploads the same (I simplify a bit) big file (~100MB) asynchronously to a number of remote PCs.
All is working fine at the expense of using lots of memory, since every task/worker load that big file into memory separatelly.
What I would like to do is to have some kind of cache, accessible to all tasks, i.e. load the file only once. Django caching based on locmem would be perfect, but like documentation says: "each process will have its own private cache instance" and I need this cache accessible to all workers.
Tried to play with Celery signals like described in #2129820, but that's not what I need.
So the question is: is there a way I can define something global in Celery (like a class based on dict, where I could load the file or smth). Or is there a Django trick I could use in this situation ?
Thanks.
Why not simply stream the upload(s) from disk instead of loading the whole file in memory ?
It seems to me that what you need is memcached backed for django. That way each task in Celery will have access to it.
Maybe you can use threads instead of processes for this particular task. Since threads all share the same memory, you only need one copy of the data in memory, but you still get parallel execution.
( this means not using Celery for this task )
I have a python (well, it's php now but we're rewriting) function that takes some parameters (A and B) and compute some results (finds best path from A to B in a graph, graph is read-only), in typical scenario one call takes 0.1s to 0.9s to complete. This function is accessed by users as a simple REST web-service (GET bestpath.php?from=A&to=B). Current implementation is quite stupid - it's a simple php script+apache+mod_php+APC, every requests needs to load all the data (over 12MB in php arrays), create all structures, compute a path and exit. I want to change it.
I want a setup with N independent workers (X per server with Y servers), each worker is a python app running in a loop (getting request -> processing -> sending reply -> getting req...), each worker can process one request at a time. I need something that will act as a frontend: get requests from users, manage queue of requests (with configurable timeout) and feed my workers with one request at a time.
how to approach this? can you propose some setup? nginx + fcgi or wsgi or something else? haproxy? as you can see i'am a newbie in python, reverse-proxy, etc. i just need a starting point about architecture (and data flow)
btw. workers are using read-only data so there is no need to maintain locking and communication between them
The typical way to handle this sort of arrangement using threads in Python is to use the standard library module Queue. An example of using the Queue module for managing workers can be found here: Queue Example
Looks like you need the "workers" to be separate processes (at least some of them, and therefore might as well make them all separate processes rather than bunches of threads divided into several processes). The multiprocessing module in Python 2.6 and later's standard library offers good facilities to spawn a pool of processes and communicate with them via FIFO "queues"; if for some reason you're stuck with Python 2.5 or even earlier there are versions of multiprocessing on the PyPi repository that you can download and use with those older versions of Python.
The "frontend" can and should be pretty easily made to run with WSGI (with either Apache or Nginx), and it can deal with all communications to/from worker processes via multiprocessing, without the need to use HTTP, proxying, etc, for that part of the system; only the frontend would be a web app per se, the workers just receive, process and respond to units of work as requested by the frontend. This seems the soundest, simplest architecture to me.
There are other distributed processing approaches available in third party packages for Python, but multiprocessing is quite decent and has the advantage of being part of the standard library, so, absent other peculiar restrictions or constraints, multiprocessing is what I'd suggest you go for.
There are many FastCGI modules with preforked mode and WSGI interface for python around, the most known is flup. My personal preference for such task is superfcgi with nginx. Both will launch several processes and will dispatch requests to them. 12Mb is not as much to load them separately in each process, but if you'd like to share data among workers you need threads, not processes. Note, that heavy math in python with single process and many threads won't use several CPU/cores efficiently due to GIL. Probably the best approach is to use several processes (as much as cores you have) each running several threads (default mode in superfcgi).
The most simple solution in this case is to use the webserver to do all the heavy lifting. Why should you handle threads and/or processes when the webserver will do all that for you?
The standard arrangement in deployments of Python is:
The webserver start a number of processes each running a complete python interpreter and loading all your data into memory.
HTTP request comes in and gets dispatched off to some process
Process does your calculation and returns the result directly to the webserver and user
When you need to change your code or the graph data, you restart the webserver and go back to step 1.
This is the architecture used Django and other popular web frameworks.
I think you can configure modwsgi/Apache so it will have several "hot" Python interpreters
in separate processes ready to go at all times and also reuse them for new accesses
(and spawn a new one if they are all busy).
In this case you could load all the preprocessed data as module globals and they would
only get loaded once per process and get reused for each new access. In fact I'm not sure this isn't the default configuration
for modwsgi/Apache.
The main problem here is that you might end up consuming
a lot of "core" memory (but that may not be a problem either).
I think you can also configure modwsgi for single process/multiple
thread -- but in that case you may only be using one CPU because
of the Python Global Interpreter Lock (the infamous GIL), I think.
Don't be afraid to ask at the modwsgi mailing list -- they are very
responsive and friendly.
You could use nginx load balancer to proxy to PythonPaste paster (which serves WSGI, for example Pylons), that launches each request as separate thread anyway.
Another option is a queue table in the database.
The worker processes run in a loop or off cron and poll the queue table for new jobs.
im looking to write a daemon that:
reads a message from a queue (sqs, rabbit-mq, whatever ...) containing a path to a zip file
updates a record in the database saying something like "this job is processing"
reads the aforementioned archive's contents and inserts a row into a database w/ information culled from file meta data for each file found
duplicates each file to s3
deletes the zip file
marks the job as "complete"
read next message in queue, repeat
this should be running as a service, and initiated by a message queued when someone uploads a file via the web frontend. the uploader doesn't need to immediately see the results, but the upload be processed in the background fairly expediently.
im fluent with python, so the very first thing that comes to mind is writing a simple server with twisted to handle each request and carry out the process mentioned above. but, ive never written anything like this that would run in a multi-user context. its not going to service hundreds of uploads per minute or hour, but it'd be nice if it could handle several at a time, reasonable. i also am not terribly familiar with writing multi-threaded applications and dealing with issues like blocking.
how have people solved this in the past? what are some other approaches i could take?
thanks in advance for any help and discussion!
I've used Beanstalkd as a queueing daemon to very good effect (some near-time processing and image resizing - over 2 million so far in the last few weeks). Throw a message into the queue with the zip filename (maybe from a specific directory) [I serialise a command and parameters in JSON], and when you reserve the message in your worker-client, no one else can get it, unless you allow it to time out (when it goes back to the queue to be picked up).
The rest is the unzipping and uploading to S3, for which there are other libraries.
If you want to handle several zip files at once, run as many worker processes as you want.
I would avoid doing anything multi-threaded and instead use the queue and the database to synchronize as many worker processes as you care to start up.
For this application I think twisted or any framework for creating server applications is going to be overkill.
Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: How do you create a daemon in Python?
Add some logging, maybe a try/except block to email out failures to you.
i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:
django view accepts, stores upload
a celery Task is dispatched to process the upload. all work is done inside the Task.