In my wsgi.py startup hooks I create a queue objects which I need to be passed to the views module.
# Create and start thread for euclid.
q = queue.Queue()
euclidThread = threading.Thread(target=startEuclidServer,
kwargs={"msgq":q})
euclidThread.setDaemon(True)
euclidThread.start()
The queue is used for communication between my "euclid" thread and django.
My django project contains an app called "monitor" where my views need to be able to access the queue I create on startup.
Previously I did this by starting my thread and creating my queue in ../monitor/urls.py however this was problematic as it would only run upon the first http request to that app.
Anyone know the best way to do this, or should I be doing this in a completely different way. For the sake of simplicity I want to avoid using a dedicated queue such as rabbitmq/redis.
The Queue you are using here is designed for communication when all the threads are managed by one master process:
The Queue module implements multi-producer, multi-consumer queues. It
is especially useful in threaded programming when information must be
exchanged safely between multiple threads. The Queue class in this
module implements all the required locking semantics. It depends on
the availability of thread support in Python; see the threading
module.
This is not the case when you are doing web development.
You need to separate your queue process completely from your web process; the way you are doing it now I cannot even imagine how many issues it will cause in the future.
You need to have three separate processes:
Process that launches your queue.
Process that launches your wsgi process(es), which could be something like "runserver" if you are in development mode; or uwsgi+supervisord+circus or similar.
The worker(s) that will do the job that's posted on the queue.
Don't combine these.
Your views can then access the queue without worrying about thread issues; and your workers can also post updates without any issues.
Read on celery which is the defacto standard way of getting all this done easily in django.
Related
I have a python application running inside of a pod in kubernetes which subscribes to a Google Pub/Sub topic and on each message downloads a file from a google bucket.
The issue I have is that I can't process the workload quickly enough using a single threaded Python application. I would normally run a number of pods to handle the workload but the problem is that all the files have to end up on the same filesystem to be processed by another application.
I have tried spawning a new thread for each request but the volume is too great.
What I would like to do is:
1) Have a number of processes that can process new messages
2) Keep the processes alive and use them to respond to new requests coming in.
All the examples for multiprocessing in python are single workload examples, for example providing 10 numbers to a square function, which isn't what I'm trying to achieve.
I've used gunicorn in the past which spawns a number of worker threads for a flask application, what I want is to do something similar without flask.
In the first, try to separate IO-bound (e.g. request, read/write and etc.) task from CPU-bound (parse JSON/XML, calculating and etc.) task.
For IO-bound case use Threading or ThreadPoolExecutor primitives for auto reuse working thread. Keep attention, writing on disk is blocking function!
If you want to use parallelism for CPU-bound user Processing or ProcessPoolExecutor. For sync them you can use shared object (proxy object) or file or pipe or redis and etc.
Shared objects like Managers (Namespaces, dicts and etc.) is preferred if you want to use pure python.
For work with files to avoid blocking, use individual thread or use async.
For asyncio use aiofile library.
I want to use ThreadPoolExecutor on a webapp (django),
All examples that I saw are using the thread pool like that:
with ThreadPoolExecutor(max_workers=1) as executor:
code
I tried to store the thread pool as a class member of a class and to use map fucntion
but I got memory leak, the only way I could use it is by the with notation
so I have 2 questions:
Each time I run with ThreadPoolExecutor does it creates threads again and then release them, in other word is this operation is expensive?
If I avoid using with how can I release the memory of the threads
thanks
Normally, web applications are stateless. That means every object you create should live in a request and die at the end of the request. That includes your ThreadPoolExecutor. Having an executor at the application level may work, but it will be embedded into your web application instead of running as a separate group of processes.
So if you want to take the workers down or restart them, your web app will have to restart as well.
And there will be stability concerns, since there is no main process watching over child processes detecting which one has gotten stale, so requires a lot of code to get multiprocessing right.
Alternatively, If you want a persistent group of processes to listen to a job queue and run your tasks, there are several projects that do that for you. All you need to do is to set up a server that takes care of queueing and locking such as redis or rabbitmq, then point your project at that server and start the workers. Some projects even let you use the database as a job queue backend.
Is there a difference in thread operation functionality between an application with 3 daemon threads that all pull from a multiprocessing Queue and 4 separate applications: a multiprocessing Queue/Pipe and 3 daemon thread applications that read from the Queue/Pipe application?
Neither application uses blocking/synchronisation. At the end of the day the operating system will decide when to allow a thread to run and for how long. Are there any other differences in functionality here or are they essentially the same?
Generic Application (no synchronisation or blocking):
'Stock Market Feed' Queue: StockTrade messages (dictonaries)
'TradingStrategy' 1 Daemon Thread: Pull from queue, inspect messages and perform trades
'TradingStrategy' 1 Daemon Thread: Pull from queue, inspect messages and perform trades
'TradingStrategy' 1 Daemon Thread: Pull from queue, inspect messages and perform trades
Alternate architecture:
Feed Application (no multi-threading):
'Stock Market Feed' Queue or Pipe: StockTrade messages (dictonaries). Can a Queue be accessed from another outside process? I know a named pipe can but can a queue?
Trading Application (no multi-threading):
'TradingStrategy': Interacts with feed (pipe?/queue), inspect messages and perform trades
Trading Application (no multi-threading):
'TradingStrategy': Interacts with feed (pipe?/queue), inspect messages and perform trades
Trading Application (no multi-threading):
'TradingStrategy': Interacts with feed (pipe?/queue), inspect messages and perform trades
Yes, the two options are quite different. But it gets complicated fast trying to explain the difference. You should research and read up on the differences between a thread and a process. Get that in your head straight first.
Now, given your specific scenario, assuming by "multiprocessing queue" you actually mean an instance of a python Queue in one thread of a process, since the queue is inside the same process as all the worker threads, the workers will be able to access, and share that same instance of the Queue.
However when the workers are all separate processes then they cannot access the Queue by shared memory and will need some form of interprocess communication to gain access to that queue.
In practice, I'd be thinking something like redis or zeromq to be your queue, then build a python program to talk to it, then scale up as few, or as many copies of it, as you need.
I have a Python web application in which the client (Ember.js) communicates with the server via WebSocket (I am using Flask-SocketIO).
Apart from the WebSocket server the backend does two more things that are worth to be mentioned:
Doing some image conversion (using graphicsmagick)
OCR incoming images from the client (using tesseract)
When the client submits an image its entity is created in the database and the id is put in an image conversion queue. The worker grabs it and does image conversion. After that the worker puts it in the OCR queue where it will be handled by the OCR queue worker.
So far so good. The WS requests are handled synchronously in separate threads (Flask-SocketIO uses Eventlet for that) and the heavy computational action happens asynchronously (in separate threads as well).
Now the problem: the whole application runs on a Raspberry Pi 3. If I do not make use of the 4 cores it has I only have one ARMv8 core clocked at 1.2 GHz. This is very little power for OCR. So I decided to find out how to use multiple cores with Python. Although I read about the problems with the GIL) I found out about multiprocessing where it says The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.. Exactly what I wanted. So I instantly replaced the
from threading import Thread
thread = Thread(target=heavy_computational_worker_thread)
thread.start()
by
from multiprocessing import Process
process = Process(target=heavy_computational_worker_thread)
process.start()
The queue needed to be handled by the multiple cores as well So i had to change
from queue import Queue
queue = multiprocessing.Queue()
to
import multiprocessing
queue = multiprocessing.Queue()
as well. Problematic: the queue and the Thread libraries are monkey patched by Eventlet. If I stop using the monkey patched version of Thread and Queue and use the one from multiprocsssing instead then the request thread started by Eventlet blocks forever when accessing the queue.
Now my question:
Is there any way I can make this application do the OCR and image conversion on a separate core?
I would like to keep using WebSocket and Eventlet if that's possible. The advantage I have is that the only communication interface between the processes would be the queue.
Ideas that I already had:
- Not using a Python implementation of a queue but rather using I/O. For example a dedicated Redis which the different subprocesses would access
- Going a step further: starting every queue worker as a separate Python process (e.g. python3 wsserver | python3 ocrqueue | python3 imgconvqueue). Then I would have to make sure myself that the access on the queue and on the database would be non-blocking
The best thing would be to keep the single process and make it work with multiprocessing, though.
Thank you very much in advance
Eventlet is currently incompatible with the multiprocessing package. There is an open issue for this work: https://github.com/eventlet/eventlet/issues/210.
The alternative that I think will work well in your case is to use Celery to manage your queue. Celery will start a pool of worker processes that wait for tasks provided by the main process via a message queue (RabbitMQ and Redis are both supported).
The Celery workers do not need to use eventlet, only the main server does, so this frees them to do whatever they need to do without the limitations imposed by eventlet.
If you are interested in exploring this approach, I have a complete example that uses it: https://github.com/miguelgrinberg/flack.
Is it possible to create a long running process in NodeJs to handle many background operations without interrupting the main thread; something like Celery in Python.
Hint, it's highly preferable to be able to manage that long-running process, in case of failure, or need to be restarted, away from the main process.
http://nodejs.org/api/child_process.html is the right API to create long-running processes, you will have complete control over the child processes (access to stdin/out/err, can send signals etc). This approach however requires that your node process is parent of those children.. If you want the child to outlive the parent, take a look at options.detached during child creation (and following child.unref()).
Please note, however, that Node.js is suited extremely well to avoid such architecture. Typically node.js do all the background stuff in the main thread. I've been writing apps with lots of traffic (like thousands requests per second), with DB, Redis and RabbitMQ access all from the main thread and without any child processes - and it was worked fine, as it should, thanks to Node's evented IO system.
I'm generally using child_process api only to launch separate executables (e.g. ffmpeg to transcode some video file), apart of such scenarios separate processes are probably not what you want.
There is also cluster api which allow single master to handle numerous worker processes, though I think it isn't what you look for, either.
You can create child process to handle your background operations. And then use messages to pass data between the new process and your main thread.
http://nodejs.org/api/child_process.html
Update
It looks like you need to use the server queues, sort of beanstalkd http://kr.github.io/beanstalkd/ + https://www.npmjs.com/package/fivebeans.