I have a want to make a jython and python communcation link. I have a django app and python scripts I use for a front end and do system admin/automation tasks. I use jython for Weblogic 9/10. The thing I want to do is make it so that I can give the jython system a request to do. Such as task A with args a,b,c and then return back a message when it was done.
I want to do this because wlst or jython is slow to start and it becomes a pain to do when I need to do a deploy, or check the status of a server or servers(up to 100 right now). So which would be the easiest way to share information back to the main script or python class while keeping the jython/(wlst) system alive and can easily share / make requests?
The way I have been doing it is using the pickle object. By getting all the data, spitting it out to a file, then loading the file back into the python app/script.
Have you considdered Celery or some other standard Queue/Broker messaging system? django-celery is quite mature and well developed and specifically designed for this kind of task.
Django -> Celery --> Worker Process (always running)
^ |-> Worker Process
| `-> Worker Process -,
\______ Job Complete _____/
The basic idea is that you have always-running worker processes(on 1 or more servers) waiting for messages to come in. (these can be pickled objects or json or whatever you want). These processes idle waiting for Celery (and it's RabbitMQ backend) to dish out a message/job to them. Once the message/job is processed, notification comes back through the broker and calls a callback in django in which you update status.
Celery is a task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.
The execution units, called tasks, are executed concurrently on a single or more worker servers. Tasks can execute asynchronously (in the background) or synchronously (wait until ready).
For this kind of messaging I like to use a real language-agnostic message queueing system that can be used over and over again in future projects. Look at AMQP if you can handle having a message-queue broker in the middle managing all the queues. Or if you don't want a 3rd party broker then look at ZeroMQ.
In both cases you can send messages using a sub-pub queue that can handle several workers per queue if needed. The messages can either be simple text strings http://tnetstrings.org/ or they can be JSON objects or, you might even be able to send pickled Python objects along with code to execute if you are careful. Personally I like using JSON objects (a subset of JSON) and unpack them into Python dicts in order to use them.
I have used both AMQP and ZeroMQ in systems with around 20 communicating Python processes. It works well, and if you need to connect to something non-Python, you will find that there is already an AMQP module and a ZeroMQ library out there.
An interesting extension of your scenario is to have 3 kinds of worker processes written in Jython, CPython and IronPython. That way you can leverage 3rd party Java and .NET modules as well as binary CPython modules like lxml. Combine it with something like Redis so that the processes are completely decoupled and can run on multiple servers if necessary. The workers would put their results into Redis instead of gumming up the message queuing system with big messages interleaved with small ones. If necessary a worker can publish a message containing a Redis key so that another process can retrieve the value.
Pickling is fine, use cPickle for efficiency. However you should not write it to a file. Rather use some other IPC mechanisms, like sockets or pipes (i.e. see https://stackoverflow.com/search?q=python+named+pipes) that avoid the disk-overhead.
Related
I have a python application running inside of a pod in kubernetes which subscribes to a Google Pub/Sub topic and on each message downloads a file from a google bucket.
The issue I have is that I can't process the workload quickly enough using a single threaded Python application. I would normally run a number of pods to handle the workload but the problem is that all the files have to end up on the same filesystem to be processed by another application.
I have tried spawning a new thread for each request but the volume is too great.
What I would like to do is:
1) Have a number of processes that can process new messages
2) Keep the processes alive and use them to respond to new requests coming in.
All the examples for multiprocessing in python are single workload examples, for example providing 10 numbers to a square function, which isn't what I'm trying to achieve.
I've used gunicorn in the past which spawns a number of worker threads for a flask application, what I want is to do something similar without flask.
In the first, try to separate IO-bound (e.g. request, read/write and etc.) task from CPU-bound (parse JSON/XML, calculating and etc.) task.
For IO-bound case use Threading or ThreadPoolExecutor primitives for auto reuse working thread. Keep attention, writing on disk is blocking function!
If you want to use parallelism for CPU-bound user Processing or ProcessPoolExecutor. For sync them you can use shared object (proxy object) or file or pipe or redis and etc.
Shared objects like Managers (Namespaces, dicts and etc.) is preferred if you want to use pure python.
For work with files to avoid blocking, use individual thread or use async.
For asyncio use aiofile library.
Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.
Suppose that one is interested to write a python app where there should be communication between different processes. The communications will be done by sending strings and/or numpy arrays.
What are the considerations to prefer OpenMPI vs. a tool like RabbitMQ?
There is no single correct answer to such question. It all depends on a big number of different factors. For example:
What kind of communications do you have? Are you sending large packets or small packets, do you need good bandwidth or low latency?
What kind of delivery guarantees do you need?
OpenMPI can instantly deliver messages only to a running process, while different MQ solutions can queue messages and allow fancy producer-consumer configurations.
What kind of network do you have? If you are running on the localhost, something like ZeroMQ would probably be the fastest. If you are running on the set of hosts, depends on the interconnections available. E.g. OpenMPI can utilize infiniband/mirynet links.
What kind of processing are you doing? With MPI all processes are usually started at the same time, do the processing and terminate all at once.
This is exactly the scenario I was in a few months ago and I decided to use AMQP with RabbitMQ using topic exchanges, in addition to memcache for large objects.
The AMQP messages are all strings, in JSON object format so that it is easy to add attributes to a message (like number of retries) and republish it. JSON objects are a subset of JSON that correspond to Python dicts. For instance {"recordid": "272727"} is a JSON object with one attribute. I could have just pickled a Python dict but that would have locked us into only using Python with the message queues.
The large objects don't get routed by AMQP, instead they go into a memcache where they are available for another process to retrieve them. You could just as well use Redis or Tokyo Tyrant for this job. The idea is that we did not want short messages to get queued behind large objects.
In the end, my Python processes ended up using both AMQP and ZeroMQ for two different aspects of the architecture. You may find that it makes sense to use both OpenMPI and AMQP but for different types of jobs.
In my case, a supervisor process runs forever, starts a whole flock of worker who also run forever unless they die or hang, in which case the supervisor restarts them. The work constantly flows in as messages via AMQP, and each process handles just one step of the work, so that when we identify a bottleneck we can have multiple instances of the process, possibly on separate machines, to remove the bottleneck. In my case, I have 15 instances of one process, 4 of two others, and about 8 other single instances.
I have two python programs and I want to communicate them.
Both of them are system services and none of them is forked by parent process.
Is there any way to do this without using sockets?
(eg by crating some Queue -> serialize it -> deserialize by other process and perform communication; or write on file process id to which perform communication, and then create magic structure which gets process id and send some messages to this process... )
The solution should work on Linux and Windows.
Your best bet is ZeroMQ, which is designed for, and extremely fast at IPC (also supports TCP/multicast messaging as well). The Python bindings are really nice, and easy to work with. There is a nice introduction to ZeroMQ with Python here: http://nichol.as/zeromq-an-introduction. If you were planning to expand this across multiple machines, AMQP (which is a message queue protocol) would be a good to look at, there are a lot of great libraries for working with AMQP for python. I really like kombu and celery. You could also think about twisted, which gives you a fairly insane number of options for communication, and a nice event loop to boot.
On Linux you can use a named pipe. http://en.wikipedia.org/wiki/Named_pipe Just beware, the writing program / thread will block until the reader opens the pipe.
I think windows supports them to some degree.
I have a python (well, it's php now but we're rewriting) function that takes some parameters (A and B) and compute some results (finds best path from A to B in a graph, graph is read-only), in typical scenario one call takes 0.1s to 0.9s to complete. This function is accessed by users as a simple REST web-service (GET bestpath.php?from=A&to=B). Current implementation is quite stupid - it's a simple php script+apache+mod_php+APC, every requests needs to load all the data (over 12MB in php arrays), create all structures, compute a path and exit. I want to change it.
I want a setup with N independent workers (X per server with Y servers), each worker is a python app running in a loop (getting request -> processing -> sending reply -> getting req...), each worker can process one request at a time. I need something that will act as a frontend: get requests from users, manage queue of requests (with configurable timeout) and feed my workers with one request at a time.
how to approach this? can you propose some setup? nginx + fcgi or wsgi or something else? haproxy? as you can see i'am a newbie in python, reverse-proxy, etc. i just need a starting point about architecture (and data flow)
btw. workers are using read-only data so there is no need to maintain locking and communication between them
The typical way to handle this sort of arrangement using threads in Python is to use the standard library module Queue. An example of using the Queue module for managing workers can be found here: Queue Example
Looks like you need the "workers" to be separate processes (at least some of them, and therefore might as well make them all separate processes rather than bunches of threads divided into several processes). The multiprocessing module in Python 2.6 and later's standard library offers good facilities to spawn a pool of processes and communicate with them via FIFO "queues"; if for some reason you're stuck with Python 2.5 or even earlier there are versions of multiprocessing on the PyPi repository that you can download and use with those older versions of Python.
The "frontend" can and should be pretty easily made to run with WSGI (with either Apache or Nginx), and it can deal with all communications to/from worker processes via multiprocessing, without the need to use HTTP, proxying, etc, for that part of the system; only the frontend would be a web app per se, the workers just receive, process and respond to units of work as requested by the frontend. This seems the soundest, simplest architecture to me.
There are other distributed processing approaches available in third party packages for Python, but multiprocessing is quite decent and has the advantage of being part of the standard library, so, absent other peculiar restrictions or constraints, multiprocessing is what I'd suggest you go for.
There are many FastCGI modules with preforked mode and WSGI interface for python around, the most known is flup. My personal preference for such task is superfcgi with nginx. Both will launch several processes and will dispatch requests to them. 12Mb is not as much to load them separately in each process, but if you'd like to share data among workers you need threads, not processes. Note, that heavy math in python with single process and many threads won't use several CPU/cores efficiently due to GIL. Probably the best approach is to use several processes (as much as cores you have) each running several threads (default mode in superfcgi).
The most simple solution in this case is to use the webserver to do all the heavy lifting. Why should you handle threads and/or processes when the webserver will do all that for you?
The standard arrangement in deployments of Python is:
The webserver start a number of processes each running a complete python interpreter and loading all your data into memory.
HTTP request comes in and gets dispatched off to some process
Process does your calculation and returns the result directly to the webserver and user
When you need to change your code or the graph data, you restart the webserver and go back to step 1.
This is the architecture used Django and other popular web frameworks.
I think you can configure modwsgi/Apache so it will have several "hot" Python interpreters
in separate processes ready to go at all times and also reuse them for new accesses
(and spawn a new one if they are all busy).
In this case you could load all the preprocessed data as module globals and they would
only get loaded once per process and get reused for each new access. In fact I'm not sure this isn't the default configuration
for modwsgi/Apache.
The main problem here is that you might end up consuming
a lot of "core" memory (but that may not be a problem either).
I think you can also configure modwsgi for single process/multiple
thread -- but in that case you may only be using one CPU because
of the Python Global Interpreter Lock (the infamous GIL), I think.
Don't be afraid to ask at the modwsgi mailing list -- they are very
responsive and friendly.
You could use nginx load balancer to proxy to PythonPaste paster (which serves WSGI, for example Pylons), that launches each request as separate thread anyway.
Another option is a queue table in the database.
The worker processes run in a loop or off cron and poll the queue table for new jobs.