How should a ZeroMQ worker safely "hang up"?

How should a ZeroMQ worker safely "hang up"? - python

I started using ZeroMQ this week, and when using the Request-Response pattern I am not sure how to have a worker safely "hang up" and close his socket without possibly dropping a message and causing the customer who sent that message to never get a response. Imagine a worker written in Python who looks something like this:
import zmq
c = zmq.Context()
s = c.socket(zmq.REP)
s.connect('tcp://127.0.0.1:9999')
while i in range(8):
s.recv()
s.send('reply')
s.close()
I have been doing experiments and have found that a customer at 127.0.0.1:9999 of socket type zmq.REQ who makes a fair-queued request just might have the misfortune of having the fair-queuing algorithm choose the above worker right after the worker has done its last send() but before it runs the following close() method. In that case, it seems that the request is received and buffered by the ØMQ stack in the worker process, and that the request is then lost when close() throws out everything associated with the socket.
How can a worker detach "safely" — is there any way to signal "I don't want messages anymore", then (a) loop over any final messages that have arrived during transmission of the signal, (b) generate their replies, and then (c) execute close() with the guarantee that no messages are being thrown away?
Edit: I suppose the raw state that I would want to enter is a "half-closed" state, where no further requests could be received — and the sender would know that — but where the return path is still open so that I can check my incoming buffer for one last arrived message and respond to it if there is one sitting in the buffer.
Edit: In response to a good question, corrected the description to make the number of waiting messages plural, as there could be many connections waiting on replies.

You seem to think that you are trying to avoid a “simple” race condition such as in
... = zmq_recv(fd);
do_something();
zmq_send(fd, answer);
/* Let's hope a new request does not arrive just now, please close it quickly! */
zmq_close(fd);
but I think the problem is that fair queuing (round-robin) makes things even more difficult: you might already even have several queued requests on your worker. The sender will not wait for your worker to be free before sending a new request if it is its turn to receive one, so at the time you call zmq_send other requests might be waiting already.
In fact, it looks like you might have selected the wrong data direction. Instead of having a requests pool send requests to your workers (even when you would prefer not to receive new ones), you might want to have your workers fetch a new request from a requests queue, take care of it, then send the answer.
Of course, it means using XREP/XREQ, but I think it is worth it.
Edit: I wrote some code implementing the other direction to explain what I mean.

I think the problem is that your messaging architecture is wrong. Your workers should use a REQ socket to send a request for work and that way there is only ever one job queued at the worker. Then to acknowledge completion of the work, you could either use another REQ request that doubles as ack for the previous job and request for a new one, or you could have a second control socket.
Some people do this using PUB/SUB for the control so that each worker publishes acks and the master subscribes to them.
You have to remember that with ZeroMQ there are 0 message queues. None at all! Just messages buffered in either the sender or receiver depending on settings like High Water Mark, and type of socket. If you really do need message queues then you need to write a broker app to handle that, or simply switch to AMQP where all communication is through a 3rd party broker.

I've been thinking about this as well. You may want to implement a CLOSE message which notifies the customer that the worker is going away. You could then have the worker drain for a period of time before shutting down. Not ideal, of course, but might be workable.

There is a conflict of interest between sending requests as rapidly as possible to workers, and getting reliability in case a worked crashes or dies. There is an entire section of the ZeroMQ Guide that explains different answers to this question of reliability. Read that, it'll help a lot.
tl;dr workers can/will crash and clients need a resend functionality. The Guide provides reusable code for that, in many languages.

Wouldn't the simplest solution be to have the customer timeout when waiting for the reply and then retry if no reply is received?

Try sleeping before the call to close. This is fixed in 2.1 but not in 2.0 yet.

Related

pyzmq - zmq_req can I have one context and use several sockets?

I'm currently working on a Benchmark project, where I'm trying to stress the server out with zmq requests.
I was wondering what would be the best way to approach this, I was thinking of having a context to create a socket and push it into a thread, in which I would send request and wait for responses in each thread respectively, but I'm not too sure this is possible with python's limitations.
More over, would it be the same socket for all threads, that is, if I'm waiting for a response on one thread (With it's own socket), would it be possible for another thread to catch that response?
Thanks.
EDIT:
Test flow logic would be like this:
Client socket would use zmq.REQ.
Client sends message.
Client waits for a response.
If no response, client reconnects and tries again until limit.
I'd like to scale this operation up to any number of clients, preferring not to deal with Processes unless performance wise the difference is significant..
How would you do this?

Q : "...can I have one context and use several sockets?"
Oh sure you can.
Moreover, you can have several Context()-instances, each one managing ... almost... any number of Socket()-instances, each Socket()-instance's methods may get called from one and only one python-thread ( a Zen-of-Zero rule: zero-sharing ).
Due to known GIL-lock re-[SERIAL]-isation of all the thread-based code-execution flow, this still has to and will wait for acquiring the GIL-lock ownership, which in turn permits a GIL-lock owner ( and nobody else ) to execute a fixed amount of python instructions, before it re-releases the GIL-lock to some other thread...

Asynchronous IPC between Node.js/Electron and Python

I try to build a GUI for given Python code using Electron.
The data flow is actually straight-forward: The user interacts with the Electron app, which sends a request to the Python API, which processes the request and sends a reply.
So far, so good. I read different threads and blog posts:
ZeroRPC solutions:
https://medium.com/#abulka/electron-python-4e8c807bfa5e
https://github.com/fyears/electron-python-example
Spawn Python API as child process from node.js and communicate directly:
https://www.ahmedbouchefra.com/connect-python-3-electron-nodejs-build-desktop-apps/
This seems to be not the smartest solution for me, since using zeroRPC or zeroMQ makes it more easy to change the frontend architecture without touching the backend code.
Use zeroMQ sockets (for example exclusive pair?)
https://zeromq.org/socket-api/#exclusive-pair-pattern
But in all three solutions, I struggle at the same point: I have to make asynchronous requests/replies, because the request processing can take some time and in this time, there can occur already further requests. For me, this looks like a very common pattern, but I found nothing on SO, maybe I just don't know, what exactly I am looking for.
Frontend Backend
| |
REQ1 |—————————————————————————————>|Process REQ1——--
| | |
REQ2 |—————————————————————————————>|Process REQ2 --|----—
| | | |
REP1 |<————————————————————————————-|REPLY1 <——————— |
| | |
REP2 |<————————————————————————————-|REPLY2 <———————————--
| |
The most flexible solution seems to me going with 3. zeroMQ, but on the website and the Python doc, I found only the minimum working examples, where both, send and receive are blocking.
Could anybody give me a hint?

If you're thinking of using ZeroMQ, you are entering into the world of Actor model programming. In actor model programming, sending a message happens independently of receiving that message (the two activities are asynchronous).
What ZeroMQ means by Blocking
When ZeroMQ talks about a send "blocking", what that means is that the internal buffer ZeroMQ uses to queue up messages prior to transmission is full, so it blocks the sending application until there is space available in this queue. The thing that empties the queue is the successful transfer of earlier messages to the receiver, which has a receive buffer, which has to be emptied by the recieve application. The thing that actually transfers the messages is the mamangement thread(s) that belong to the ZeroMQ contenxt.
This management thread is the cruicial part; it's running independently of your own application threads, and so it's making the communications between sender and receiver asynchronous.
What you likely want is to use ZeroMQ's reactor, zmq_poll(). Typically in actor model programming you have a loop, and at the top is a call to the reactor (zmq_poll() in this case). Zmq_poll() tells you when something has happened, but here you'd primarily be interested in it telling you that a message has arrived. Typically then you'd read that message, process it (which may involve sending out other ZeroMQ messages), and looping back to the zmq_poll().
Backend
So your backend would be something like:
while (forever)
{
zmq_poll(list of input sockets) // allows serving more than one socket
zmq_recv(socket that has a message ready to read) // will always succeed immediately because zmq_poll() told us there was a message waiting
decode req message
generate reply message
zmq_send(reply to original requester) // Socket should be in blocking mode to ensue that messages don't get lost if something is unexpectedly running slowly
}
If you don't want to serve more than one Front end, it's simpler:
while (forever)
{
zmq_recv(req) // Socket should be in blocking mode
decode req message
generate reply message
zmq_send(reply) // Socket should also be in blocking mode to ensure that messages don't get lost if something is unexpectedly running slow
}
Frontend
Your front end will be different. Basically, you'll need the Electron event loop handler to take over the role of zmq_poll(). A build of ZeroMQ for use within Electron will have taken care of that. But basically it will come down to GUI event callbacks sending ZeroMQ messages. You will also have to write a callback for Electron to run when a message arrives on the socket from the backend. There'll be no blocking in the front end between sending and receiving a message.
Timing
This means that the timing diagram you've drawn is wrong. The front end can send out as many requests as it wants, but there's no timing alignment between those requests departing and arriving in the backend (though assuming everything is running smoothly, the first one will arrive pretty much straight away). Having sent a request or requests, the front end simply returns to doing whatever it wants (which, for a User Interface, is often nothing but the event loop manager waiting for an event).
That backend will be in a loop of read/process/reply, read/process/reply, handling the requests one at a time. Again there is no timing alignment between those replies departing and subsequently arriving in the front end. When a reply does arrive back in the front end, it wakes up and deals with it.

Is it a bad practice to use sleep() in a web server in production?

I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.

Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)

Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.

Python + ZeroMQ - proper handling of finite stream of data in PUSH/PULL model

I'm trying to implement a distributed PUSH/PULL (some kinda MapReduce) model with Python and ZMQ, as it's described here: http://taotetek.net/2011/02/02/python-multiprocessing-with-zeromq/ . In this example, result_manager knows exactly how many messages to wait and when to send "FINISHED" state to workers.
Let's assume I have a big but finite stream of data of unknown length. In this case I can't know exactly where to stop. I tried to send "FINISHED" from ventilator in the end instead of result_manager, but, of course, workers receive it in the middle of processing (due to the fact that it's a separate channel) and die immediately, so a lot of data is lost.
Otherwise, if I use the same work_message queue to send "FINISHED" state - it's being captured by first available worker while others hang, that's also as expected.
Is there any other model I should use here? Or can you please point me to some best practices for this case?

Otherwise, if I use the same work_message queue to send "FINISHED"
state - it's being captured by first available worker while others
hang, that's also as expected.
You can easily work around this.
Send "FINISH" from VENTILATOR to RESULT_MANAGER PULL socket.
RESULT_MANAGER receives "FINISH" and publishes this message to all WORKERS through PUB socket.
All WORKERS receive "FINISH" message on SUB sockets and kill themselves.
Here you have example code from the ZMQ Guide how to send sth from VENTILATOR to RESULT_MANAGER in devide and conquer design pattern.

ZeroMQ cleaning up PULL socket - half-close

My problem is kind of trying to half-close a zmq socket.
In simple terms I have a pair of PUSH/PULL sockets in Python.
The PUSH socket never stops sending, but the PULL socket should be able to clean itself up in a following way:
Stop accepting any additional messages to the queue
Process the messages still in the queue
Close the socket etc.
I don't want to affect the PUSH socket in any way, it can keep accumulating its own queue until another PULL socket comes around or that might be there already. The LINGER option doesn't seem to work with recv() (just with send()).
One option might be to have a broker in between with the broker PUSH and receiver PULL HWM set to zero. Then the broker's PULL would accumulate the messages. However, I'd rather not do this. Is there any other way?

I believe you are confusing which socket type will queue messages. According to the zmq_socket docs, a PUSH socket will queue its messages but a PULL socket doesn't have any type of queuing mechanism.
So what you're asking to be able to do would be something of the following:
1) Stop recv'ing any additional messages to the PULL socket.
2) Close the socket etc.
The PUSH socket will continue to 'queue' its messages automatically until either the HWM is met (at which it will then block and not queue any more messages) or a PULL socket comes along and starts recv'ing messages.
The case I think you're really concerned about is a slow PULL reader. In which you would like to get all of the currently queued messages in the PUSH socket (at once?) and then quit. This isn't how zmq works, you get one message at a time.
To implement something of this sort, you'll have to wrap the PULL capability with your own queue. You 'continually' PULL the messages into your personal queue (in a different thread?) until you want to stop, then process those messages and quit.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How should a ZeroMQ worker safely "hang up"? - python

I've been thinking about this as well. You may want to implement a CLOSE message which notifies the customer that the worker is going away. You could then have the worker drain for a period of time before shutting down. Not ideal, of course, but might be workable.

Wouldn't the simplest solution be to have the customer timeout when waiting for the reply and then retry if no reply is received?

Try sleeping before the call to close. This is fixed in 2.1 but not in 2.0 yet.

Related

pyzmq - zmq_req can I have one context and use several sockets?

Asynchronous IPC between Node.js/Electron and Python

Is it a bad practice to use sleep() in a web server in production?

Python + ZeroMQ - proper handling of finite stream of data in PUSH/PULL model

ZeroMQ cleaning up PULL socket - half-close

Categories

Resources