Interprocess communication with SPSC queue in python - python

I have multiple write-heavy Python applications (producer1.py, producer2.py, ...) and I'd like to implement an asynchronous, non-blocking writer (consumer.py) as a separate process, so that the producers are not blocked by disk access or contention.
To make this more easily optimizable, assume I just need to expose a logging call that passes a fixed length string from a producer to the writer, and the written file does not need to be sorted by call time. And the target platform can be Linux-only. How should I implement this with minimal latency penalty on the calling thread?
This seems like an ideal setup for multiple lock-free SPSC queues but I couldn't find any Python implementations.
Edit 1
I could implement a circular buffer as a memory-mapped file on /dev/shm, but I'm not sure if I'll have atomic CAS in Python?

The simplest way would be using an async TCP/Unix Socket server in consumer.py.
Using HTTP will be an overhead in this case.
A producer, TCP/Unix Socket client, will send data to consumer then consumer will respond right away before writing data in disk drive.
File IO in consumer are blocking but it will not block producers as stated above.

Related

Producer-consumer architecture over network in python?

I need a producer-consumer kind of architecture, where the producer puts data in a queue over and over, and then a consumer reads from that queue as fast as it can process the data.
For the producer and consumer running in separate processes we already have multiprocessing, with Queue where you have put and get. So even if the producer runs as 2-3 times the speed of the consumer, all the data is in the queue (assume memory use is not a problem) and the consumer just calls q.get whenever it needs to.
But I need the producer and consumer to be connected over a network, so probably tough a socket (but I am open to other methods). The big problem with sockets is that they do not separate objects automatically like queues do.
For a multiprocessing.Queue if I call q.get I get the next object, the queue takes care of how many bytes to read and recreates the object for me, q.get just returns the object. With a socket I have to pickle.dumps to send it and then I need to be careful how many bytes to read from the socket (in case there is more than 1 object in the socket) and then pickle.loads the result. The main problem is keeping track of object sizes.
If I put 10 objects of different sizes that add up to 1000 bytes in a Queue then the queue takes care of how many bytes to read for every object when calling q.get. For a socket if I pickle the 10 objects and send them, the socket has no idea how to split the big 1000 byte string inside it, and creating a mechanism for this means adding alot of new code.
Is there some kind of... socket-based Queue or similar?
This is usually solved with an external software that will act as a broker for the producer and consumer over the internet. There are a few open source projects you can look into;
RabbitMQ
Kafka
Redis
Celery
They are all different in their own way, but they all have Python libraries you can easily pip install to begin using them. All of them will require that a third process is running to serve as the broker of messages.
Similarly, there are paid products for this as well - typically hosted in one of the big cloud providers - like AWS SQS.
This is not to say that it is not possible to create a custom socket or server implementation to do this... but, a lot of times in programming, it's best not to try to rebuild the wheel.

How to use multiple processes in python for a continuous workload

I have a python application running inside of a pod in kubernetes which subscribes to a Google Pub/Sub topic and on each message downloads a file from a google bucket.
The issue I have is that I can't process the workload quickly enough using a single threaded Python application. I would normally run a number of pods to handle the workload but the problem is that all the files have to end up on the same filesystem to be processed by another application.
I have tried spawning a new thread for each request but the volume is too great.
What I would like to do is:
1) Have a number of processes that can process new messages
2) Keep the processes alive and use them to respond to new requests coming in.
All the examples for multiprocessing in python are single workload examples, for example providing 10 numbers to a square function, which isn't what I'm trying to achieve.
I've used gunicorn in the past which spawns a number of worker threads for a flask application, what I want is to do something similar without flask.
In the first, try to separate IO-bound (e.g. request, read/write and etc.) task from CPU-bound (parse JSON/XML, calculating and etc.) task.
For IO-bound case use Threading or ThreadPoolExecutor primitives for auto reuse working thread. Keep attention, writing on disk is blocking function!
If you want to use parallelism for CPU-bound user Processing or ProcessPoolExecutor. For sync them you can use shared object (proxy object) or file or pipe or redis and etc.
Shared objects like Managers (Namespaces, dicts and etc.) is preferred if you want to use pure python.
For work with files to avoid blocking, use individual thread or use async.
For asyncio use aiofile library.

Memory bounds in twisted applications

Consider the following scenario: A process on the server is used to handle data from a network connection. Twisted makes this very easy with spawnProcess and you can easily connect the ProcessTransport with your protocol on the network side.
However, I was unable to determine how Twisted handles a situation where the data from the network is available faster than the process performs reads on its standard input. As far as I can see, Twisted code mostly uses an internal buffer (self._buffer or similar) to store unconsumed data. Doesn't this mean that concurrent requests from a fast connection (eg. over local gigabit LAN) could fill up main memory and induce heavy swapping, making the situation even worse? How can this be prevented?
Ideally, the internal buffer would have an upper bound. As I understand it, the OS's networking code would automatically stall the connection/start dropping packets if the OS's buffers are full, which would slow down the client. (Yes I know, DoS on the network level is still possible, but this is a different problem). This is also the approach I would take if implementing it myself: just don't read from the socket if the internal buffer is full.
Restricting the maximum request size is also not an option in my case, as the service should be able to process files of arbitrary size.
The solution has two parts.
One part is called producers. Producers are objects that data comes out of. A TCP transport is a producer. Producers have a couple useful methods: pauseProducing and resumeProducing. pauseProducing causes the transport to stop reading data from the network. resumeProducing causes it to start reading again. This gives you a way to avoid building up an unbounded amount of data in memory that you haven't processed yet. When you start to fall behind, just pause the transport. When you catch up, resume it.
The other part is called consumers. Consumers are objects that data goes in to. A TCP transport is also a consumer. More importantly for your case, though, a child process transport is also a consumer. Consumers have a few methods, one in particular is useful to you: registerProducer. This tells the consumer which producer data is coming to it from. The consumer can them call pauseProducing and resumeProducing according to its ability to process the data. When a transport (TCP or process) cannot send data as fast as a producer is asking it to send data, it will pause the producer. When it catches up, it will resume it again.
You can read more about producers and consumers in the Twisted documentation.

Should I use epoll or just blocking recv in threads?

I'm trying to write a scalable custom web server.
Here's what I have so far:
The main loop and request interpreter are in Cython. The main loop accepts connections and assigns the sockets to one of the processes in the pool (has to be processes, threads won't get any benefit from multi-core hardware because of the GIL).
Each process has a thread pool. The process assigns the socket to a thread.
The thread calls recv (blocking) on the socket and waits for data. When some shows up, it gets piped into the request interpreter, and then sent via WSGI to the application running in that thread.
Now I've heard about epoll and am a little confused. Is there any benefit to using epoll to get socket data and then pass that directly to the processes? Or should I just go the usual route of having each thread wait on recv?
PS: What is epoll actually used for? It seems like multithreading and blocking fd calls would accomplish the same thing.
If you're already using multiple threads, epoll doesn't offer you much additional benefit.
The point of epoll is that a single thread can listen for activity on many file selectors simultaneously (and respond to events on each as they occur), and thus provide event-driven multitasking without requiring the spawning of additional threads. Threads are relatively cheap (compared to spawning processes), but each one does require some overhead (after all, they each have to maintain a call stack).
If you wanted to, you could rewrite your pool processes to be single-threaded using epoll, which would reduce your overall thread usage count, but of course you'd have to consider whether that's something you care about or not - in general, for low numbers of simultaneous requests on each worker, the overhead of spawning threads wouldn't matter, but if you want each worker to be able to handle 1000s of open connections, that overhead can become significant (and that's where epoll shines).
But...
What you're describing sounds suspiciously like you're basically reinventing the wheel - your:
main loop and request interpreter
pool of processes
sounds almost exactly like:
nginx (or any other load balancer/reverse proxy)
A pre-forking tornado app
Tornado is a single-threaded web server python module using epoll, and it has the capability built-in for pre-forking (meaning that it spawns multiple copies of itself as separate processes, effectively creating a process pool). Tornado is based on the tech created to power Friendfeed - they needed a way to handle huge numbers of open connections for long-polling clients looking for new real-time updates.
If you're doing this as a learning process, then by all means, reinvent away! It's a great way to learn. But if you're actually trying to build an application on top of these kinds of things, I'd highly recommend considering using the existing, stable, communally-developed projects - it'll save you a lot of time, false starts, and potential gotchas.
(P.S. I approve of your avatar. <3)
The epoll function (and the other functions in the same family poll and select) allow you to write single threading networking code that manage multiple networking connection. Since there is no threading, there is no need fot synchronisation as would be required in a multi-threaded program (this can be difficult to get right).
On the other hand, you'll need to have an explicit state machine for each connection. In a threaded program, this state machine is implicit.
Those function just offer another way to multiplex multiple connexion in a process. Sometimes it is easier not to use threads, other times you're already using threads, and thus it is easier just to use blocking sockets (which release the GIL in Python).

Multiple communication channels with Twisted Python

I am currently researching the Twisted framework as a way of implementing a network-based backup application, and I would like to achieve something that I cannot find any examples of on the net.
I plan to implement the system using the Perspective Broker, but I will also need a way of transferring binary files from the client to the server. I would like to be able to call a method on the PB, and then use some sort of UID to send the file over a separate data channel.
The reason for having these two separate communication channels is down to the fact that I would like to make the client multi-threaded (one thread scanning the directory tree, while another thread transfers the changed files to the server).
Is this possible with Twisted? I have read that having multiple threads calling methods on a reactor is bad news, so is this architecture doomed to failure?
I would appreciate any pointers in the right direction, as I mentioned I am still researching the possibilities - but I plan to use Django for this project, so Python is a must.
The reason for having these two separate communication channels is down to the fact that I would like to make the client multi-threaded (one thread scanning the directory tree, while another thread transfers the changed files to the server).
This reasoning doesn't follow. You can use a single protocol running over a single socket just fine, even if you have a thread wandering the filesystem looking for work to do.
There may be other reasons to want to send file data differently than you send metadata or other structured data between the client and server, though. However, the main one that comes to mind is that you might not want to force commands to wait for files to be completed, and this issue is relieved by PB's FilePager class.
The main thing to remember if you're going to have threads in a Twisted-using application is that whenever you want to invoke a Twisted API from any thread ''except'' the thread in which the reactor is running you must use reactor.callFromThread (or an API built solely on that method, such as twisted.internet.threads.blockingCallFromThread).
callFromThread sends some work (in the form of an object to call) to the reactor thread where the reactor will arrange to call it "soon". Any other Twisted API you invoke from the wrong thread will have undefined results.

Categories