Functionality difference between application that has 'internal'/'external' daemon threads

Functionality difference between application that has 'internal'/'external' daemon threads - python

Is there a difference in thread operation functionality between an application with 3 daemon threads that all pull from a multiprocessing Queue and 4 separate applications: a multiprocessing Queue/Pipe and 3 daemon thread applications that read from the Queue/Pipe application?
Neither application uses blocking/synchronisation. At the end of the day the operating system will decide when to allow a thread to run and for how long. Are there any other differences in functionality here or are they essentially the same?
Generic Application (no synchronisation or blocking):
'Stock Market Feed' Queue: StockTrade messages (dictonaries)
'TradingStrategy' 1 Daemon Thread: Pull from queue, inspect messages and perform trades
'TradingStrategy' 1 Daemon Thread: Pull from queue, inspect messages and perform trades
'TradingStrategy' 1 Daemon Thread: Pull from queue, inspect messages and perform trades
Alternate architecture:
Feed Application (no multi-threading):
'Stock Market Feed' Queue or Pipe: StockTrade messages (dictonaries). Can a Queue be accessed from another outside process? I know a named pipe can but can a queue?
Trading Application (no multi-threading):
'TradingStrategy': Interacts with feed (pipe?/queue), inspect messages and perform trades
Trading Application (no multi-threading):
'TradingStrategy': Interacts with feed (pipe?/queue), inspect messages and perform trades
Trading Application (no multi-threading):
'TradingStrategy': Interacts with feed (pipe?/queue), inspect messages and perform trades

Yes, the two options are quite different. But it gets complicated fast trying to explain the difference. You should research and read up on the differences between a thread and a process. Get that in your head straight first.
Now, given your specific scenario, assuming by "multiprocessing queue" you actually mean an instance of a python Queue in one thread of a process, since the queue is inside the same process as all the worker threads, the workers will be able to access, and share that same instance of the Queue.
However when the workers are all separate processes then they cannot access the Queue by shared memory and will need some form of interprocess communication to gain access to that queue.
In practice, I'd be thinking something like redis or zeromq to be your queue, then build a python program to talk to it, then scale up as few, or as many copies of it, as you need.

Related

How does Waitress handle concurrent tasks?

I'm trying to build a python webserver using Django and Waitress, but I'd like to know how Waitress handles concurrent requests, and when blocking may occur.
While the Waitress documentation mentions that multiple worker threads are available, it doesn't provide a lot of information on how they are implemented and how the python GIL affects them (emphasis my own):
When a channel determines the client has sent at least one full valid HTTP request, it schedules a "task" with a "thread dispatcher". The thread dispatcher maintains a fixed pool of worker threads available to do client work (by default, 4 threads). If a worker thread is available when a task is scheduled, the worker thread runs the task. The task has access to the channel, and can write back to the channel's output buffer. When all worker threads are in use, scheduled tasks will wait in a queue for a worker thread to become available.
There doesn't seem to be much information on Stackoverflow either. From the question "Is Gunicorn's gthread async worker analogous to Waitress?":
Waitress has a master async thread that buffers requests, and enqueues each request to one of its sync worker threads when the request I/O is finished.
These statements don't address the GIL (at least from my understanding) and it'd be great if someone could elaborate more on how worker threads work for Waitress. Thanks!

Here's how the event-driven asynchronous servers generally work:
Start a process and listen to incoming requests. Utilizing the event notification API of the operating system makes it very easy to serve thousands of clients from single thread/process.
Since there's only one process managing all the connections, you don't want to perform any slow (or blocking) tasks in this process. Because then it will block the program for every client.
To perform blocking tasks, the server delegates the tasks to "workers". Workers can be threads (running in the same process) or separate processes (or subprocesses). Now the main process can keep on serving clients while workers perform the blocking tasks.
How does Waitress handle concurrent tasks?
Pretty much the same way I just described above. And for workers it creates threads, not processes.
how the python GIL affects them
Waitress uses threads for workers. So, yes they are affected by GIL in that they aren't truly concurrent though they seem to be. "Asynchronous" is the correct term.
Threads in Python run inside a single process, on a single CPU core, and don't run in parallel. A thread acquires the GIL for a very small amount of time and executes its code and then the GIL is acquired by another thread.
But since the GIL is released on network I/O, the parent process will always acquire the GIL whenever there's a network event (such as an incoming request) and this way you can stay assured that the GIL will not affect the network bound operations (like receiving requests or sending response).
On the other hand, Python processes are actually concurrent: they can run in parallel on multiple cores. But Waitress doesn't use processes.
Should you be worried?
If you're just doing small blocking tasks like database read/writes and serving only a few hundred users per second, then using threads isn't really that bad.
For serving a large volume of users or doing long running blocking tasks, you can look into using external task queues like Celery. This will be much better than spawning and managing processes yourself.

Hint: Those were my comments to the accepted answer and the conversation below, moved to a separate answer for space reasons.
Wait.. The 5th request will stay in the queue until one of the 4 threads is done with their previous handling, and therefore gone back to the pool. One thread will only ever server one request at a time. "IO bound" tasks only help in that the threads waiting for IO will implicitly (e.g. by calling time.sleep) tell the scheduler (python's internal one) that it can pass the GIL along to another thread since there's currently nothing to do, so that the others will get more CPU time for their stuff. On thread level this is fully sequential, which is still concurrent and asynchronous on process level, just not parallel. Just to get some wording staight.
Also, Python threads are "standard" OS threads (like those in C). So they will use all CPU cores and make full use of them. The only thing restricting them is that they need to hold the GIL when calling Python C-API functions, because the whole API in general is not thread-safe. On the other hand, calls to non-Python functions, i.e. functions in C extensions like numpy for example, but also many database APIs, including anything loaded via ctypes, do not hold the GIL while running. Why should they, they are running external C binaries which don't know anything of the Python interpreter running in the parent process. Therefore, such tasks will run truely in parallel when called from a WSGI app hosted by waitress. And if you've got more cores available, turn the thread number up to that amount (threads=X kwarg on waitress.create_server).

ThreadPoolExecutor on long running process

I want to use ThreadPoolExecutor on a webapp (django),
All examples that I saw are using the thread pool like that:
with ThreadPoolExecutor(max_workers=1) as executor:
code
I tried to store the thread pool as a class member of a class and to use map fucntion
but I got memory leak, the only way I could use it is by the with notation
so I have 2 questions:
Each time I run with ThreadPoolExecutor does it creates threads again and then release them, in other word is this operation is expensive?
If I avoid using with how can I release the memory of the threads
thanks

Normally, web applications are stateless. That means every object you create should live in a request and die at the end of the request. That includes your ThreadPoolExecutor. Having an executor at the application level may work, but it will be embedded into your web application instead of running as a separate group of processes.
So if you want to take the workers down or restart them, your web app will have to restart as well.
And there will be stability concerns, since there is no main process watching over child processes detecting which one has gotten stale, so requires a lot of code to get multiprocessing right.
Alternatively, If you want a persistent group of processes to listen to a job queue and run your tasks, there are several projects that do that for you. All you need to do is to set up a server that takes care of queueing and locking such as redis or rabbitmq, then point your project at that server and start the workers. Some projects even let you use the database as a job queue backend.

Using multiple cores with Python and Eventlet

I have a Python web application in which the client (Ember.js) communicates with the server via WebSocket (I am using Flask-SocketIO).
Apart from the WebSocket server the backend does two more things that are worth to be mentioned:
Doing some image conversion (using graphicsmagick)
OCR incoming images from the client (using tesseract)
When the client submits an image its entity is created in the database and the id is put in an image conversion queue. The worker grabs it and does image conversion. After that the worker puts it in the OCR queue where it will be handled by the OCR queue worker.
So far so good. The WS requests are handled synchronously in separate threads (Flask-SocketIO uses Eventlet for that) and the heavy computational action happens asynchronously (in separate threads as well).
Now the problem: the whole application runs on a Raspberry Pi 3. If I do not make use of the 4 cores it has I only have one ARMv8 core clocked at 1.2 GHz. This is very little power for OCR. So I decided to find out how to use multiple cores with Python. Although I read about the problems with the GIL) I found out about multiprocessing where it says The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.. Exactly what I wanted. So I instantly replaced the
from threading import Thread
thread = Thread(target=heavy_computational_worker_thread)
thread.start()
by
from multiprocessing import Process
process = Process(target=heavy_computational_worker_thread)
process.start()
The queue needed to be handled by the multiple cores as well So i had to change
from queue import Queue
queue = multiprocessing.Queue()
to
import multiprocessing
queue = multiprocessing.Queue()
as well. Problematic: the queue and the Thread libraries are monkey patched by Eventlet. If I stop using the monkey patched version of Thread and Queue and use the one from multiprocsssing instead then the request thread started by Eventlet blocks forever when accessing the queue.
Now my question:
Is there any way I can make this application do the OCR and image conversion on a separate core?
I would like to keep using WebSocket and Eventlet if that's possible. The advantage I have is that the only communication interface between the processes would be the queue.
Ideas that I already had:
- Not using a Python implementation of a queue but rather using I/O. For example a dedicated Redis which the different subprocesses would access
- Going a step further: starting every queue worker as a separate Python process (e.g. python3 wsserver | python3 ocrqueue | python3 imgconvqueue). Then I would have to make sure myself that the access on the queue and on the database would be non-blocking
The best thing would be to keep the single process and make it work with multiprocessing, though.
Thank you very much in advance

Eventlet is currently incompatible with the multiprocessing package. There is an open issue for this work: https://github.com/eventlet/eventlet/issues/210.
The alternative that I think will work well in your case is to use Celery to manage your queue. Celery will start a pool of worker processes that wait for tasks provided by the main process via a message queue (RabbitMQ and Redis are both supported).
The Celery workers do not need to use eventlet, only the main server does, so this frees them to do whatever they need to do without the limitations imposed by eventlet.
If you are interested in exploring this approach, I have a complete example that uses it: https://github.com/miguelgrinberg/flack.

Django passing objects to views

In my wsgi.py startup hooks I create a queue objects which I need to be passed to the views module.
# Create and start thread for euclid.
q = queue.Queue()
euclidThread = threading.Thread(target=startEuclidServer,
kwargs={"msgq":q})
euclidThread.setDaemon(True)
euclidThread.start()
The queue is used for communication between my "euclid" thread and django.
My django project contains an app called "monitor" where my views need to be able to access the queue I create on startup.
Previously I did this by starting my thread and creating my queue in ../monitor/urls.py however this was problematic as it would only run upon the first http request to that app.
Anyone know the best way to do this, or should I be doing this in a completely different way. For the sake of simplicity I want to avoid using a dedicated queue such as rabbitmq/redis.

The Queue you are using here is designed for communication when all the threads are managed by one master process:
The Queue module implements multi-producer, multi-consumer queues. It
is especially useful in threaded programming when information must be
exchanged safely between multiple threads. The Queue class in this
module implements all the required locking semantics. It depends on
the availability of thread support in Python; see the threading
module.
This is not the case when you are doing web development.
You need to separate your queue process completely from your web process; the way you are doing it now I cannot even imagine how many issues it will cause in the future.
You need to have three separate processes:
Process that launches your queue.
Process that launches your wsgi process(es), which could be something like "runserver" if you are in development mode; or uwsgi+supervisord+circus or similar.
The worker(s) that will do the job that's posted on the queue.
Don't combine these.
Your views can then access the queue without worrying about thread issues; and your workers can also post updates without any issues.
Read on celery which is the defacto standard way of getting all this done easily in django.

What's the best pattern to design an asynchronous RPC application using Python, Pika and AMQP?

The producer module of my application is run by users who want to submit work to be done on a small cluster. It sends the subscriptions in JSON form through the RabbitMQ message broker.
I have tried several strategies, and the best so far is the following, which is still not fully working:
Each cluster machine runs a consumer module, which subscribes itself to the AMQP queue and issues a prefetch_count to tell the broker how many tasks it can run at once.
I was able to make it work using SelectConnection from the Pika AMQP library. Both consumer and producer start two channels, one connected to each queue. The producer sends requests on channel [A] and waits for responses in channel [B], and the consumer waits for requests on channel [A] and send responses on channel [B]. It seems, however, that when the consumer runs the callback that calculates the response, it blocks, so I have only one task executed at each consumer at each time.
What I need in the end:
the consumer [A] subscribes his tasks (around 5k each time) to the cluster
the broker dispatches N messages/requests for each consumer, where N is the number of concurrent tasks it can handle
when a single task is finished, the consumer replies to the broker/producer with the result
the producer receives the replies, update the computation status and, in the end, prints some reports
Restrictions:
If another user submits work, all of his tasks will be queued after the previous user (I guess this is automatically true from the queue system, but I haven't thought about the implications on a threaded environment)
Tasks have an order to be submitted, but the order they are replied is not important
UPDATE
I have studied a bit further and my actual problem seems to be that I use a simple function as callback to the pika's SelectConnection.channel.basic_consume() function. My last (unimplemented) idea is to pass a threading function, instead of a regular one, so the callback would not block and the consumer can keep listening.

As you have noticed, your process blocks when it runs a callback. There are several ways to deal with this depending on what your callback does.
If your callback is IO-bound (doing lots of networking or disk IO) you can use either threads or a greenlet-based solution, such as gevent, eventlet, or greenhouse. Keep in mind, though, that Python is limited by the GIL (Global Interpreter Lock), which means that only one piece of python code is ever running in a single python process. This means that if you are doing lots of computation with python code, these solutions will likely not be much faster than what you already have.
Another option would be to implement your consumer as multiple processes using multiprocessing. I have found multiprocessing to be very useful when doing parallel work. You could implement this by either using a Queue, having the parent process being the consumer and farming out work to its children, or by simply starting up multiple processes which each consume on their own. I would suggest, unless your application is highly concurrent (1000s of workers), to simply start multiple workers, each of which consumes from their own connection. This way, you can use the acknowledgement feature of AMQP, so if a consumer dies while still processing a task, the message is sent back to the queue automatically and will be picked up by another worker, rather than simply losing the request.
A last option, if you control the producer and it is also written in Python, is to use a task library like celery to abstract the task/queue workings for you. I have used celery for several large projects and have found it to be very well written. It will also handle the multiple consumer issues for you with the appropriate configuration.

Your setup sounds good to me. And you are right, you can simply set the callback to start a thread and chain that to a separate callback when the thread finishes to queue the response back over Channel B.
Basically, your consumers should have a queue of their own (size of N, amount of parallelism they support). When a request comes in via Channel A, it should store the result in the queue shared between the main thread with Pika and the worker threads in the thread pool. As soon it is queued, pika should respond back with ACK, and your worker thread would wake up and start processing.
Once the worker is done with its work, it would queue the result back on a separate result queue and issue a callback to the main thread to send it back to the consumer.
You should take care and make sure that the worker threads are not interfering with each other if they are using any shared resources, but that's a separate topic.

Being unexperienced in threading, my setup would run multiple consumer processes (the number of which basically being your prefetch count). Each would connect to the two queues and they would process jobs happily, unknowning of eachother's existence.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.