Exiting a multiprocessing program on error

Exiting a multiprocessing program on error - python

I have an application that creates one multiprocessing process for communication and n for workers. I would need to find a way to exit the whole application when there is an error somewhere or the user presses control-C.
// the agents are connected via zmq

If the error occurs in the communication process, you could use your existing socket to send a special task that asks the receiver to close. This would be received by one worker, that notifies the communication process (through some socket) that it is exiting, and then exits.
The communication process could basically fill the pipeline with n sockets if no queuing at workers are done, or k*n + one in a while if queuing is done. This is done until all workers are done.
An error at a worker could be sent to the communication process through the earlier mentioned channel that an error has occurred. The channel could for example be a push/pull socket, or maybe a pub/sub with a proxy. (I guess you are using a push/pull to get the results back. In that case use that.)

Related

How to pass messages/information between process using concurrent.futures.ProcessPoolExecutor

I am trying to pass messages between my processes. I have a process that is a client to a server. I want other processes to pass information to the client to pass it to the server. Think of the server as a data logger. I am using concurrent.futures.ProcessPoolExecutor. The processes are working continuously in a while True.
Any hint in how to connect the processes is appreciated,
Thank you.
Edit: I found the answer. All credit goes to whomever posted the following post here.
import multiprocessing
qm = multiprocessing.Manager()
pqueue = qm.Queue()

Processes in general are not able to talk to each other through python's standard library. The memory space for the process is allocated by the OS and in general does not overlap. Parent processes (like your original python process) have some limited visibility of child processes ( like the processes created using ProcessPoolExecutor), but not the ability to pass messages between them.
You have a couple options here.
1) Use threads. If you don't actually need parallelism (ex using all cores of a multi-core system), this option might work. Threads all run within a single process and have shared memory space. So you can write to an object from one thread and read from it in another. concurrent.futures.ThreadPoolExecutor will allow you to do this.
2)Use the underlying OS. In general, when two processes running on the same system need to communicate, they do it over ports. The process that is receiving messages will bind to a port. The process that is sending messages will then establish a socket connection over that port, and use this connection to communicate. Refer to https://docs.python.org/3.7/howto/sockets.html for more details.
3)Use a message broker like rabbitMQ or reddis. These are external processes that facilitate communication between processes.

If I am listening to a websocket in one thread and running a function in another thread is it possible to miss messages

Title says it all really. I am running a program on a Linux EC2 instance with 4 threads. Three of these are listening to different websockets and the final one is webscraping and calling off a set of other functions when needed.
Is it possible that if the GIL is owned by the 4th thread (i.e it is currently running its calculation through the single core) that websocket messages could be 'missed' by the threads listening?
I am beginning to think it isn't possible, but have no understanding as to why. I have looked around, but to little avail.

Not really, even if your application is completely blocked say by scheduling or simply sleeping the operating system will queue the incoming network messages. You might lose messages say if the TCP buffer starts to overflow, I reckon that is unlikely in your case. You can test your idea by deliberately sleeping in the 4th thread for some time and see if messages are dropped.

ZeroMQ cleaning up PULL socket - half-close

My problem is kind of trying to half-close a zmq socket.
In simple terms I have a pair of PUSH/PULL sockets in Python.
The PUSH socket never stops sending, but the PULL socket should be able to clean itself up in a following way:
Stop accepting any additional messages to the queue
Process the messages still in the queue
Close the socket etc.
I don't want to affect the PUSH socket in any way, it can keep accumulating its own queue until another PULL socket comes around or that might be there already. The LINGER option doesn't seem to work with recv() (just with send()).
One option might be to have a broker in between with the broker PUSH and receiver PULL HWM set to zero. Then the broker's PULL would accumulate the messages. However, I'd rather not do this. Is there any other way?

I believe you are confusing which socket type will queue messages. According to the zmq_socket docs, a PUSH socket will queue its messages but a PULL socket doesn't have any type of queuing mechanism.
So what you're asking to be able to do would be something of the following:
1) Stop recv'ing any additional messages to the PULL socket.
2) Close the socket etc.
The PUSH socket will continue to 'queue' its messages automatically until either the HWM is met (at which it will then block and not queue any more messages) or a PULL socket comes along and starts recv'ing messages.
The case I think you're really concerned about is a slow PULL reader. In which you would like to get all of the currently queued messages in the PUSH socket (at once?) and then quit. This isn't how zmq works, you get one message at a time.
To implement something of this sort, you'll have to wrap the PULL capability with your own queue. You 'continually' PULL the messages into your personal queue (in a different thread?) until you want to stop, then process those messages and quit.

What is the right ZMQ architecture for a webserver sending fire-and-forget tasks to a bunch of webservers?

I have a website which sends out heavy processing tasks to a worker server. Right now, there is only one worker server however in the future more will be added. These jobs are quite time-consuming (takes 5mins - 1 hour). The idea is to have a configuration where just building a new worker server should suffice to increase the capacity of the whole system, without needing extra configuration in the webserver parts.
Currently, I've done a basic implementation using python-zeromq, with the PUSH/PULL architecture.
Everytime there's a new job request, the webserver creates a socket, connects to one of the workers and sends the job (no reply needed, this is a fire-and-forget type of job):
context = zmq.Context()
socket = context.socket(zmq.PUSH)
socket.connect("tcp://IP:5000")
socket.send(msg)
And on the worker side this is running all the time:
context = zmq.Context()
socket = context.socket(zmq.PULL)
# bind to port in it's own IP
socket.bind("tcp://IP:5000")
print("Listening for messages...")
while True:
msg = socket.recv()
<do something>
Now I looked more into this, and I think this is not quite the right way of doing it. Since adding a new worker server would require to add the IP of it to the webserver script, connect to both of them etc.
I would rather prefer the webserver to have a persistent socket on (and not create one everytime), and have workers connect to the webserver instead. Sort of like here:
https://github.com/taotetek/blog_examples/blob/master/python_multiprocessing_with_zeromq/workqueue_example.py
In short, as opposed to what is above, webserver's socket, binds to its own IP, and workers connects to it.I suppose then jobs are sent via round-robin style.
However what I'm worried about is, what happens if the webserver gets restarted (something that happens quite often) or gets offline for a while. Using zeromq, will all worker
connections will hang? Somehow become invalid? If the webserver goes down, will the current queue disappear?
In the current setup, things seem to run somewhat OK, but I'm not 100% sure what's the right (and not too complex) way of doing this.

From the ZeroMQ Guide:
Components can come and go dynamically and ØMQ will automatically reconnect.
If the underlying tcp connection is broken, ZeroMQ will repeatedly try to reconnect, sending your message once the connection succeeds.
Note that PAIR sockets are an exception. They don't automatically reconnect. (See the zmq_socket docs.)

Binding on the server might work. Are you sure you won't ever need more than one web server, though? I'd consider putting a broker between your server(s) and workers.
Either way, I think persistent sockets are the way to go.

What's the best pattern to design an asynchronous RPC application using Python, Pika and AMQP?

The producer module of my application is run by users who want to submit work to be done on a small cluster. It sends the subscriptions in JSON form through the RabbitMQ message broker.
I have tried several strategies, and the best so far is the following, which is still not fully working:
Each cluster machine runs a consumer module, which subscribes itself to the AMQP queue and issues a prefetch_count to tell the broker how many tasks it can run at once.
I was able to make it work using SelectConnection from the Pika AMQP library. Both consumer and producer start two channels, one connected to each queue. The producer sends requests on channel [A] and waits for responses in channel [B], and the consumer waits for requests on channel [A] and send responses on channel [B]. It seems, however, that when the consumer runs the callback that calculates the response, it blocks, so I have only one task executed at each consumer at each time.
What I need in the end:
the consumer [A] subscribes his tasks (around 5k each time) to the cluster
the broker dispatches N messages/requests for each consumer, where N is the number of concurrent tasks it can handle
when a single task is finished, the consumer replies to the broker/producer with the result
the producer receives the replies, update the computation status and, in the end, prints some reports
Restrictions:
If another user submits work, all of his tasks will be queued after the previous user (I guess this is automatically true from the queue system, but I haven't thought about the implications on a threaded environment)
Tasks have an order to be submitted, but the order they are replied is not important
UPDATE
I have studied a bit further and my actual problem seems to be that I use a simple function as callback to the pika's SelectConnection.channel.basic_consume() function. My last (unimplemented) idea is to pass a threading function, instead of a regular one, so the callback would not block and the consumer can keep listening.

As you have noticed, your process blocks when it runs a callback. There are several ways to deal with this depending on what your callback does.
If your callback is IO-bound (doing lots of networking or disk IO) you can use either threads or a greenlet-based solution, such as gevent, eventlet, or greenhouse. Keep in mind, though, that Python is limited by the GIL (Global Interpreter Lock), which means that only one piece of python code is ever running in a single python process. This means that if you are doing lots of computation with python code, these solutions will likely not be much faster than what you already have.
Another option would be to implement your consumer as multiple processes using multiprocessing. I have found multiprocessing to be very useful when doing parallel work. You could implement this by either using a Queue, having the parent process being the consumer and farming out work to its children, or by simply starting up multiple processes which each consume on their own. I would suggest, unless your application is highly concurrent (1000s of workers), to simply start multiple workers, each of which consumes from their own connection. This way, you can use the acknowledgement feature of AMQP, so if a consumer dies while still processing a task, the message is sent back to the queue automatically and will be picked up by another worker, rather than simply losing the request.
A last option, if you control the producer and it is also written in Python, is to use a task library like celery to abstract the task/queue workings for you. I have used celery for several large projects and have found it to be very well written. It will also handle the multiple consumer issues for you with the appropriate configuration.

Your setup sounds good to me. And you are right, you can simply set the callback to start a thread and chain that to a separate callback when the thread finishes to queue the response back over Channel B.
Basically, your consumers should have a queue of their own (size of N, amount of parallelism they support). When a request comes in via Channel A, it should store the result in the queue shared between the main thread with Pika and the worker threads in the thread pool. As soon it is queued, pika should respond back with ACK, and your worker thread would wake up and start processing.
Once the worker is done with its work, it would queue the result back on a separate result queue and issue a callback to the main thread to send it back to the consumer.
You should take care and make sure that the worker threads are not interfering with each other if they are using any shared resources, but that's a separate topic.

Being unexperienced in threading, my setup would run multiple consumer processes (the number of which basically being your prefetch count). Each would connect to the two queues and they would process jobs happily, unknowning of eachother's existence.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.