I need a producer-consumer kind of architecture, where the producer puts data in a queue over and over, and then a consumer reads from that queue as fast as it can process the data.
For the producer and consumer running in separate processes we already have multiprocessing, with Queue where you have put and get. So even if the producer runs as 2-3 times the speed of the consumer, all the data is in the queue (assume memory use is not a problem) and the consumer just calls q.get whenever it needs to.
But I need the producer and consumer to be connected over a network, so probably tough a socket (but I am open to other methods). The big problem with sockets is that they do not separate objects automatically like queues do.
For a multiprocessing.Queue if I call q.get I get the next object, the queue takes care of how many bytes to read and recreates the object for me, q.get just returns the object. With a socket I have to pickle.dumps to send it and then I need to be careful how many bytes to read from the socket (in case there is more than 1 object in the socket) and then pickle.loads the result. The main problem is keeping track of object sizes.
If I put 10 objects of different sizes that add up to 1000 bytes in a Queue then the queue takes care of how many bytes to read for every object when calling q.get. For a socket if I pickle the 10 objects and send them, the socket has no idea how to split the big 1000 byte string inside it, and creating a mechanism for this means adding alot of new code.
Is there some kind of... socket-based Queue or similar?
This is usually solved with an external software that will act as a broker for the producer and consumer over the internet. There are a few open source projects you can look into;
RabbitMQ
Kafka
Redis
Celery
They are all different in their own way, but they all have Python libraries you can easily pip install to begin using them. All of them will require that a third process is running to serve as the broker of messages.
Similarly, there are paid products for this as well - typically hosted in one of the big cloud providers - like AWS SQS.
This is not to say that it is not possible to create a custom socket or server implementation to do this... but, a lot of times in programming, it's best not to try to rebuild the wheel.
Related
I have multiple write-heavy Python applications (producer1.py, producer2.py, ...) and I'd like to implement an asynchronous, non-blocking writer (consumer.py) as a separate process, so that the producers are not blocked by disk access or contention.
To make this more easily optimizable, assume I just need to expose a logging call that passes a fixed length string from a producer to the writer, and the written file does not need to be sorted by call time. And the target platform can be Linux-only. How should I implement this with minimal latency penalty on the calling thread?
This seems like an ideal setup for multiple lock-free SPSC queues but I couldn't find any Python implementations.
Edit 1
I could implement a circular buffer as a memory-mapped file on /dev/shm, but I'm not sure if I'll have atomic CAS in Python?
The simplest way would be using an async TCP/Unix Socket server in consumer.py.
Using HTTP will be an overhead in this case.
A producer, TCP/Unix Socket client, will send data to consumer then consumer will respond right away before writing data in disk drive.
File IO in consumer are blocking but it will not block producers as stated above.
Consider the following scenario: A process on the server is used to handle data from a network connection. Twisted makes this very easy with spawnProcess and you can easily connect the ProcessTransport with your protocol on the network side.
However, I was unable to determine how Twisted handles a situation where the data from the network is available faster than the process performs reads on its standard input. As far as I can see, Twisted code mostly uses an internal buffer (self._buffer or similar) to store unconsumed data. Doesn't this mean that concurrent requests from a fast connection (eg. over local gigabit LAN) could fill up main memory and induce heavy swapping, making the situation even worse? How can this be prevented?
Ideally, the internal buffer would have an upper bound. As I understand it, the OS's networking code would automatically stall the connection/start dropping packets if the OS's buffers are full, which would slow down the client. (Yes I know, DoS on the network level is still possible, but this is a different problem). This is also the approach I would take if implementing it myself: just don't read from the socket if the internal buffer is full.
Restricting the maximum request size is also not an option in my case, as the service should be able to process files of arbitrary size.
The solution has two parts.
One part is called producers. Producers are objects that data comes out of. A TCP transport is a producer. Producers have a couple useful methods: pauseProducing and resumeProducing. pauseProducing causes the transport to stop reading data from the network. resumeProducing causes it to start reading again. This gives you a way to avoid building up an unbounded amount of data in memory that you haven't processed yet. When you start to fall behind, just pause the transport. When you catch up, resume it.
The other part is called consumers. Consumers are objects that data goes in to. A TCP transport is also a consumer. More importantly for your case, though, a child process transport is also a consumer. Consumers have a few methods, one in particular is useful to you: registerProducer. This tells the consumer which producer data is coming to it from. The consumer can them call pauseProducing and resumeProducing according to its ability to process the data. When a transport (TCP or process) cannot send data as fast as a producer is asking it to send data, it will pause the producer. When it catches up, it will resume it again.
You can read more about producers and consumers in the Twisted documentation.
Suppose that one is interested to write a python app where there should be communication between different processes. The communications will be done by sending strings and/or numpy arrays.
What are the considerations to prefer OpenMPI vs. a tool like RabbitMQ?
There is no single correct answer to such question. It all depends on a big number of different factors. For example:
What kind of communications do you have? Are you sending large packets or small packets, do you need good bandwidth or low latency?
What kind of delivery guarantees do you need?
OpenMPI can instantly deliver messages only to a running process, while different MQ solutions can queue messages and allow fancy producer-consumer configurations.
What kind of network do you have? If you are running on the localhost, something like ZeroMQ would probably be the fastest. If you are running on the set of hosts, depends on the interconnections available. E.g. OpenMPI can utilize infiniband/mirynet links.
What kind of processing are you doing? With MPI all processes are usually started at the same time, do the processing and terminate all at once.
This is exactly the scenario I was in a few months ago and I decided to use AMQP with RabbitMQ using topic exchanges, in addition to memcache for large objects.
The AMQP messages are all strings, in JSON object format so that it is easy to add attributes to a message (like number of retries) and republish it. JSON objects are a subset of JSON that correspond to Python dicts. For instance {"recordid": "272727"} is a JSON object with one attribute. I could have just pickled a Python dict but that would have locked us into only using Python with the message queues.
The large objects don't get routed by AMQP, instead they go into a memcache where they are available for another process to retrieve them. You could just as well use Redis or Tokyo Tyrant for this job. The idea is that we did not want short messages to get queued behind large objects.
In the end, my Python processes ended up using both AMQP and ZeroMQ for two different aspects of the architecture. You may find that it makes sense to use both OpenMPI and AMQP but for different types of jobs.
In my case, a supervisor process runs forever, starts a whole flock of worker who also run forever unless they die or hang, in which case the supervisor restarts them. The work constantly flows in as messages via AMQP, and each process handles just one step of the work, so that when we identify a bottleneck we can have multiple instances of the process, possibly on separate machines, to remove the bottleneck. In my case, I have 15 instances of one process, 4 of two others, and about 8 other single instances.
I am getting at extremely fast rate, tweets from a long-lived connection to the Twitter API Streaming Server. I proceed by doing some heavy text processing and save the tweets in my database.
I am using PyCurl for the connection and callback function that care of text processing and saving in the db. See below my approach who is not working properly.
I am not familiar with network programming, so would like to know:
How can use Threads, Queue or Twisted frameworks to solve this problem ?
def process_tweet():
# do some heaving text processing
def open_stream_connection():
connect = pycurl.Curl()
connect.setopt(pycurl.URL, STREAMURL)
connect.setopt(pycurl.WRITEFUNCTION, process_tweet)
connect.setopt(pycurl.USERPWD, "%s:%s" % (TWITTER_USER, TWITTER_PASS))
connect.perform()
You should have a number of threads receiving the messages as they come in. That number should probably be 1 if you are using pycurl, but should be higher if you are using httplib - the idea being you want to be able to have more than one query on the Twitter API at a time, so there is a steady amount of work to process.
When each Tweet arrives, it is pushed onto a Queue.Queue. The Queue ensures that there is thread-safety in the communications - each tweet will only be handled by one worker thread.
A pool of worker threads is responsible for reading from the Queue and dealing with the Tweet. Only the interesting tweets should be added to the database.
As the database is probably the bottleneck, there is a limit to the number of threads in the pool that are worth adding - more threads won't make it process faster, it'll just mean more threads are waiting in the queue to access the database.
This is a fairly common Python idiom. This architecture will scale only to a certain degree - i.e. what one machine can process.
Here's simple setup if you are OK with using a single machine.
1 thread accepts connections. After a connection is accepted, it passes the accepted connection to another thread for processing.
You can, of course, use processes (e.g, using multiprocessing) instead of threads, but I'm not familiar with multiprocessing to give advice. The setup would be the same: 1 process accepts connections, then passes them to subprocesses.
If you need to shard the processing across multiple machines, then the simple thing to do would be to stuff the message into the database, then notify the workers about the new record (this will require some sort of coordination/locking between the workers). If you want to avoid hitting the database, then you'll have to pipe messages from your network process to the workers (and I'm not well versed enough in low level networking to tell you how to do that :))
I suggest this organization:
one process reads Twitter, stuffs tweets into database
one or more processes reads database, processes each, inserts into new database. Original tweets either deleted or marked processed.
That is, you have two more more processes/threads. The tweet database could be seen as a queue of work. Multiple worker processes take jobs (tweets) off the queue, and create data in the second database.
I have a "manager" process on a node, and several worker processes. The manager is the actual server who holds all of the connections to the clients. The manager accepts all incoming packets and puts them into a queue, and then the worker processes pull the packets out of the queue, process them, and generate a result. They send the result back to the manager (by putting them into another queue which is read by the manager), but here is where I get stuck: how do I send the result to a specific socket? When dealing with the processing of the packets on a single process, it's easy, because when you receive a packet you can reply to it by just grabbing the "transport" object in-context. But how would I do this with the method I'm using?
It sounds like you might need to keep a reference to the transport (or protocol) along with the bytes the just came in on that protocol in your 'event' object. That way responses that came in on a connection go out on the same connection.
If things don't need to be processed serially perhaps you should think about setting up functors that can handle the data in parallel to remove the need for queueing. Just keep in mind that you will need to protect critical sections of your code.
Edit:
Judging from your other question about evaluating your server design it would seem that processing in parallel may not be possible for your situation, so my first suggestion stands.