Is it possible to create a long running process in NodeJs to handle many background operations without interrupting the main thread; something like Celery in Python.
Hint, it's highly preferable to be able to manage that long-running process, in case of failure, or need to be restarted, away from the main process.
http://nodejs.org/api/child_process.html is the right API to create long-running processes, you will have complete control over the child processes (access to stdin/out/err, can send signals etc). This approach however requires that your node process is parent of those children.. If you want the child to outlive the parent, take a look at options.detached during child creation (and following child.unref()).
Please note, however, that Node.js is suited extremely well to avoid such architecture. Typically node.js do all the background stuff in the main thread. I've been writing apps with lots of traffic (like thousands requests per second), with DB, Redis and RabbitMQ access all from the main thread and without any child processes - and it was worked fine, as it should, thanks to Node's evented IO system.
I'm generally using child_process api only to launch separate executables (e.g. ffmpeg to transcode some video file), apart of such scenarios separate processes are probably not what you want.
There is also cluster api which allow single master to handle numerous worker processes, though I think it isn't what you look for, either.
You can create child process to handle your background operations. And then use messages to pass data between the new process and your main thread.
http://nodejs.org/api/child_process.html
Update
It looks like you need to use the server queues, sort of beanstalkd http://kr.github.io/beanstalkd/ + https://www.npmjs.com/package/fivebeans.
Related
I have python GUI application which can kick off any number of computation-intensive long-running tasks that naturally belong in multiprocessing.Pool workers.
However, I'd like to be able to cancel these tasks, because later GUI input (such as changing a configuration variable) might render these tasks irrelevant.
Is there a popular pattern in Python for keeping track of which workers are working on what task, and interrupting them as needed?
The solutions I can think of are:
When a worker starts on a task it "announces" through some shared state that it is working on that particular task; if we need to cancel that task we look up which process is working on it and .terminate() it. There are many complexities here though.
Use raw multiprocessing.Processes and write a Pool-like manager that does exactly what we want.
Use some alternative library such as Celery. A huge list is here.
I have a python application running inside of a pod in kubernetes which subscribes to a Google Pub/Sub topic and on each message downloads a file from a google bucket.
The issue I have is that I can't process the workload quickly enough using a single threaded Python application. I would normally run a number of pods to handle the workload but the problem is that all the files have to end up on the same filesystem to be processed by another application.
I have tried spawning a new thread for each request but the volume is too great.
What I would like to do is:
1) Have a number of processes that can process new messages
2) Keep the processes alive and use them to respond to new requests coming in.
All the examples for multiprocessing in python are single workload examples, for example providing 10 numbers to a square function, which isn't what I'm trying to achieve.
I've used gunicorn in the past which spawns a number of worker threads for a flask application, what I want is to do something similar without flask.
In the first, try to separate IO-bound (e.g. request, read/write and etc.) task from CPU-bound (parse JSON/XML, calculating and etc.) task.
For IO-bound case use Threading or ThreadPoolExecutor primitives for auto reuse working thread. Keep attention, writing on disk is blocking function!
If you want to use parallelism for CPU-bound user Processing or ProcessPoolExecutor. For sync them you can use shared object (proxy object) or file or pipe or redis and etc.
Shared objects like Managers (Namespaces, dicts and etc.) is preferred if you want to use pure python.
For work with files to avoid blocking, use individual thread or use async.
For asyncio use aiofile library.
I want to use ThreadPoolExecutor on a webapp (django),
All examples that I saw are using the thread pool like that:
with ThreadPoolExecutor(max_workers=1) as executor:
code
I tried to store the thread pool as a class member of a class and to use map fucntion
but I got memory leak, the only way I could use it is by the with notation
so I have 2 questions:
Each time I run with ThreadPoolExecutor does it creates threads again and then release them, in other word is this operation is expensive?
If I avoid using with how can I release the memory of the threads
thanks
Normally, web applications are stateless. That means every object you create should live in a request and die at the end of the request. That includes your ThreadPoolExecutor. Having an executor at the application level may work, but it will be embedded into your web application instead of running as a separate group of processes.
So if you want to take the workers down or restart them, your web app will have to restart as well.
And there will be stability concerns, since there is no main process watching over child processes detecting which one has gotten stale, so requires a lot of code to get multiprocessing right.
Alternatively, If you want a persistent group of processes to listen to a job queue and run your tasks, there are several projects that do that for you. All you need to do is to set up a server that takes care of queueing and locking such as redis or rabbitmq, then point your project at that server and start the workers. Some projects even let you use the database as a job queue backend.
I have a python (well, it's php now but we're rewriting) function that takes some parameters (A and B) and compute some results (finds best path from A to B in a graph, graph is read-only), in typical scenario one call takes 0.1s to 0.9s to complete. This function is accessed by users as a simple REST web-service (GET bestpath.php?from=A&to=B). Current implementation is quite stupid - it's a simple php script+apache+mod_php+APC, every requests needs to load all the data (over 12MB in php arrays), create all structures, compute a path and exit. I want to change it.
I want a setup with N independent workers (X per server with Y servers), each worker is a python app running in a loop (getting request -> processing -> sending reply -> getting req...), each worker can process one request at a time. I need something that will act as a frontend: get requests from users, manage queue of requests (with configurable timeout) and feed my workers with one request at a time.
how to approach this? can you propose some setup? nginx + fcgi or wsgi or something else? haproxy? as you can see i'am a newbie in python, reverse-proxy, etc. i just need a starting point about architecture (and data flow)
btw. workers are using read-only data so there is no need to maintain locking and communication between them
The typical way to handle this sort of arrangement using threads in Python is to use the standard library module Queue. An example of using the Queue module for managing workers can be found here: Queue Example
Looks like you need the "workers" to be separate processes (at least some of them, and therefore might as well make them all separate processes rather than bunches of threads divided into several processes). The multiprocessing module in Python 2.6 and later's standard library offers good facilities to spawn a pool of processes and communicate with them via FIFO "queues"; if for some reason you're stuck with Python 2.5 or even earlier there are versions of multiprocessing on the PyPi repository that you can download and use with those older versions of Python.
The "frontend" can and should be pretty easily made to run with WSGI (with either Apache or Nginx), and it can deal with all communications to/from worker processes via multiprocessing, without the need to use HTTP, proxying, etc, for that part of the system; only the frontend would be a web app per se, the workers just receive, process and respond to units of work as requested by the frontend. This seems the soundest, simplest architecture to me.
There are other distributed processing approaches available in third party packages for Python, but multiprocessing is quite decent and has the advantage of being part of the standard library, so, absent other peculiar restrictions or constraints, multiprocessing is what I'd suggest you go for.
There are many FastCGI modules with preforked mode and WSGI interface for python around, the most known is flup. My personal preference for such task is superfcgi with nginx. Both will launch several processes and will dispatch requests to them. 12Mb is not as much to load them separately in each process, but if you'd like to share data among workers you need threads, not processes. Note, that heavy math in python with single process and many threads won't use several CPU/cores efficiently due to GIL. Probably the best approach is to use several processes (as much as cores you have) each running several threads (default mode in superfcgi).
The most simple solution in this case is to use the webserver to do all the heavy lifting. Why should you handle threads and/or processes when the webserver will do all that for you?
The standard arrangement in deployments of Python is:
The webserver start a number of processes each running a complete python interpreter and loading all your data into memory.
HTTP request comes in and gets dispatched off to some process
Process does your calculation and returns the result directly to the webserver and user
When you need to change your code or the graph data, you restart the webserver and go back to step 1.
This is the architecture used Django and other popular web frameworks.
I think you can configure modwsgi/Apache so it will have several "hot" Python interpreters
in separate processes ready to go at all times and also reuse them for new accesses
(and spawn a new one if they are all busy).
In this case you could load all the preprocessed data as module globals and they would
only get loaded once per process and get reused for each new access. In fact I'm not sure this isn't the default configuration
for modwsgi/Apache.
The main problem here is that you might end up consuming
a lot of "core" memory (but that may not be a problem either).
I think you can also configure modwsgi for single process/multiple
thread -- but in that case you may only be using one CPU because
of the Python Global Interpreter Lock (the infamous GIL), I think.
Don't be afraid to ask at the modwsgi mailing list -- they are very
responsive and friendly.
You could use nginx load balancer to proxy to PythonPaste paster (which serves WSGI, for example Pylons), that launches each request as separate thread anyway.
Another option is a queue table in the database.
The worker processes run in a loop or off cron and poll the queue table for new jobs.
I'm writing a web application using pylons and paste. I have some work I want to do after an HTTP request is finished (send some emails, write some stuff to the db, etc) that I don't want to block the HTTP request on.
If I start a thread to do this work, is that OK? I always see this stuff about paste killing off hung threads, etc. Will it kill my threads which are doing work?
What else can I do here? Is there a way I can make the request return but have some code run after it's done?
Thanks.
You could use a thread approach (maybe setting the Thead.daemon property would help--but I'm not sure).
However, I would suggest looking into a task queuing system. You can place a task on a queue (which is very fast), then a listener can handle the tasks asynchronously, allowing the HTTP request to return quickly. There are two task queues that I know of for Django:
Django Queue Service
Celery
You could also consider using an more "enterprise" messaging solution, such as RabbitMQ or ActiveMQ.
Edit: previous answer with some good pointers.
I think the best solution is messaging system because it can be configured to not loose the task if the pylons process goes down. I would always use processes over threads especially in this case. If you are using python 2.6+ use the built in multiprocessing or you can always install the processing module which you can find on pypi (I can't post link because of I am a new user).
Take a look at gearman, it was specifically made for farming out tasks to 'workers' to handle. They can even handle it in a different language entirely. You can come back and ask if the task was completed, or just let it complete. That should work well for many tasks.
If you absolutely need to ensure it was completed, I'd suggest queuing tasks in a database or somewhere persistent, then have a separate process that runs through it ensuring each one gets handled appropriately.
To answer your basic question directly, you should be able to use threads just as you'd like. The "killing hung threads" part is paste cleaning up its own threads, not yours.
There are other packages that might help, etc, but I'd suggest you start with simple threads and see how far you get. Only then will you know what you need next.
(Note, "Thread.daemon" should be mostly irrelevant to you here. Setting that true will ensure a thread you start will not prevent the entire process from exiting. Doing so would mean, however, that if the process exited "cleanly" (as opposed to being forced to exit) your thread would be terminated even if it wasn't done its work. Whether that's a problem, and how you handle things like that, depend entirely on your own requirements and design.