How to abort context.socket.recv() the right way in ZeroMQ? - python

I have a small software where I have a separate thread which is waiting for ZeroMQ messages. I am using the PUB/SUB communication protocol of ZeroMQ.
Currently I am aborting that thread by setting a variable "cont_loop" to False.
But I discovered that, when no messages arrive to the ZeroMQ subscriber I cannot exit the thread (without taking down the whole program).
def __init__(self):
Thread.__init__(self)
self.cont_loop = True
def abort(self):
self.continue_loop = False
def run(self):
zmq_context = zmq.Context()
zmq_socket = zmq_context.socket(zmq.SUB)
zmq_socket.bind("tcp://*:%s" % *(5556))
zmq_socket.setsockopt(zmq.SUBSCRIBE, "")
while self.cont_loop:
data = zmq_socket.recv()
print "Message: " + data
zmq_socket.close()
zmq_context.term()
print "exit"
I tried to move socket.close() and context.term() to abort-method. So that it shuts down the subscriber but this killed the whole program.
What is the correct way to shut down the above program?

Q: What is the correct way to ... ?
A: There are many ways to achieve the set goal. Let me pick just one, as a mock-up example on how to handle distributed process-to-process messaging.
First. Assume, there are more priorities in typical software design task. Some higher, some lower, some even so low, that one can defer an execution of these low-priority sub-tasks, so that there remains more time in the scheduler, to execute those sub-tasks, that cannot handle waiting.
This said, let's view your code. The SUB-side instruction to .recv() as was being used, causes two things. One visible - it performs a RECEIVE operation on a ZeroMQ-socket with a SUB-behaviour. The second, lesser visible is, it remains hanging, until it gets something "compatible" with a current state of the SUB-behaviour ( more on setting this later ).
This means, it also BLOCKS all the time since such .recv() method call UNTIL some unknown, locally uncontrollable coincidence of states/events makes it to deliver a ZeroMQ-message, with it's content being "compatible" with the locally pre-set state of this (still blocking) SUB-behaviour instance.
That may take ages.
This is exactly why .recv() is being rather used inside a control-loop, where external handling gets both the chance & the responsibility to do what you want ( including abort-related operations & a fair / graceful termination with proper resources' release(s) ).
Receive process becomes .recv( flags = zmq.NOBLOCK ) in rather a try: except: episode. Such a way your local process does not lose it's control over the stream-of-events ( incl. the NOP being one such ).
The best next step?
Take your time and get through a great book of gems, "Code Connected, Volume 1", Pieter HINTJENS, co-father of the ZeroMQ, has published ( also as PDF ).
Many his thoughts & errors to be avoided that he had shared with us is indeed worth your time.
Enjoy the powers of ZeroMQ. It's very powerful & worth getting mastered top-down.

Related

Why ZeroMQ SUBs are missing messages?

I built about 12000 subscribers per computer with threading as following
subscriber side:
def client(id):
context=zmq.Context()
subscriber=context.socket(zmq.SUB)
subscriber.connect('ip:port')
subscriber.setsockopt(zmq.SUBSCRIBE,(id+'e').encode())
while 1:
signal=subscriber.recv_multipart()
write logs...
for i in range(12000):
threading.Thread(target=client,args=(str(i+j*12000),)).start()
#j is arbitrary unduplicated int
publisher side:
subscriber=zmq.Context().socket(zmq.PUB)
subscriber.bind('tcp://*:port')
while 1:
for id in client_id:
subscriber.send_multipart([(id+'e').encode()]+[message])
When I used more than one computer(by using different j) to build subscribers, sometimes some subscribers could not receive message at all.
If I restart subscribers, those who could not receive message become normal. But those who were normal become to unable to receive message.
These problem will not show any errors, only can be found in my logs.
Do excessive connection occur this problem?
As the counts of connections / messages / sizes grow larger and larger, some default guesstimates typically cease to suffice. Try to extend some otherwise working defaults on the PUB-side configuration, where the problem seems to start choking ( do not forget that since v3.?+ the subscription-list processing got transferred from the SUB-side(s) to the central PUB-side. That reduces the volumes of data-flow, yet at some (here growing to remarkable amounts) additional costs on the PUB-side ~ RAM-for-buffers + CPU-for-TOPIC-list-filtering...
So, let's start with these steps on the PUB-side :
aSock2SUBs = zmq.Context( _tweak_nIOthreads ).socket( zmq.PUB ) # MORE CPU POWER
aSock2SUBs.setsockopt( zmq.SNDBUF, _tweak_SIZE_with_SO_SNDBUF ) # ROOM IN SNDBUF
And last but not least, PUB-s do silently drop any messages, that do not "fit" under its current HighWaterMark level, so let's tweak this one too :
aSock2SUBs.setsockopt( zmq.SNDHWM, _tweak_HWM_till_no_DROPs ) # TILL NO DROPS
Other { TCP_* | TOS | RECONNECT_IVL* | BACKLOG | IMMEDIATE | HEARTBEAT_* | ... }-low-level parameter settings may help further to make your herd of 12k+ SUB-s live in peace side by side with other (both friendly & hostile ) traffic and make your application more robust, than if relying just on pre-cooked API-defaults.
Consult both the ZeroMQ API documentation altogether also with the O/S defaults, as many of these ZeroMQ low-level attributes also rely on the O/S actual configuration values.
You shall also be warned, that making 12k+ threads in Python still leaves a purely [SERIAL] code execution, as the Python central GIL-lock ownership (exclusive) avoids (yes, principally avoids) any form of [CONCURRENT] co-execution, as the very ownership of the GIL-lock is exclusive and re-[SERIAL]-ises any amount of threads into a waiting queue and results in a plain sequence of chunks' execution ( By default, Python 2 will switch threads every 100 instructions. Since Python 3.2+, by default, the GIL will be released after 5 milliseconds ( 5,000 [us] ) so that other thread can have a chance to try & also acquire the GIL-lock. You can change these defaults, if the war of 12k+ threads on swapping the ownership of the GIL-lock actually results in "almost-blocking" any and all of the TCP/IP-instrumentation for message buffering, stacking, sending, re-transmit-ing until an in time confirmed reception. One may test it until a bleeding edge, yet choosing some safer ceiling might help if other parameters have been well adjusted for robustness.
Last but not least, enjoy the Zen-of-Zero, the masterpiece of Martin SUSTRIK for distributed-computing, so well crafted for ultimately scalable, almost zero-latency, very comfortable, widely ported signalling & messaging framework.
Further to user3666197's answer, you may also have to consider the time taken for all of those clients to connect. The PUBlisher has no idea how many SUBcribers there are supposed to be, and will simply get on with the job of sending out messages to those SUBscribers presently connected, from when the very first connection is made. The PUBlisher socket does not hang on to messages its sent just in case more SUBscribers connect at some undefined time in the future. Once a message has been transferred to 1 or more SUBscribers, it's dropped from the PUBlisher's queue. Also, the connections are not made instantaneously, and 12,000 is quite a few to get through.
It doesn't matter if you start your PUBlisher or SUBscriber program first; your 12,000 connections will be being made over a period of time once both programs are running, this happening asynchronously wrt to your own thread(s). Some SUBscribers will start getting messages whilst others will still be unknown to the PUBlisher. When, finally, all 12,000 connections are made then it will smooth out.

How to know if a ZeroMQ socket is ready?

I have a simple PUSH/PULL ZeroMQ code in Python. It looks like below.
def zmqtest(self):
print('zmq')
Process(target=start_consumer, args=('1', 9999)).start()
Process(target=start_consumer, args=('2', 9999)).start()
ctx = zmq.Context()
socket = ctx.socket(zmq.PUSH)
socket.bind('tcp://127.0.0.1:9999')
# sleep(.5) # I have to wait here...
for i in range(5):
socket.send_unicode('{}'.format(i))
The problem is I have to wait more than .5 second before sending message, otherwise only one consumer process can receive a message. If I wait more than .5 second, everything looks fine.
I guess it takes a while before the socket binding to settle down, and it is done asynchronously.
I wonder if there's a more reliable way to know when the socket is ready.
Sure it takes a while.Sure it is done async.
Let's damage first a bit the terminology.
ZeroMQ is a great framework. Each distributed-system's client, willing to use it ( except using just the inproc:// transport class ), first instantiates an async data-pumping engine .. the Context() instance(s), as needed.
Each Scalable Formal Communication Pattern { PUSH | PULL | ... | XSUB | SUB | PAIR } does not create a socket,
but
rather instantiates an access-point, that may later .connect() or .bind() to some counterparty ( another access-point, of a suitable type, in some Context() instance, be it local or not ( again, the local-inproc://-only infrastructures being the known exception to this rule ) ).
In this sense, an answer to a question "When the socket is ready?" requires an end-to-end investigation "across" the distributed-system, handling all the elements, that participate on the socket-alike behaviour's implementation.
Testing a "local"-end access-point RTO-state:
For this, your agent may self-connect a receiving access-point ( working as a PULL archetype ), so as to "sniff", when the local-end Context() instance has reached an RTO-state + a .bind()- created O/S L3+ interface starts distributing the intended agent's-PUSH-ed messages.
Testing a "remote"-agent's RTO-state:
This part can have an indirect or an explicit testing. An indirect way may use a message-embedded index. That can contain a raising number ( an ordinal ), which bears a weak information about an order. Given the PUSH-side message-routing strategy is Round-robin, the local-agent can be sure, that until it's local PULL-access-point receives all messages indicating a contiguous sequence of ordinals, there is no other "remote"-PULL-ing agent in an RTO-state. Once the "local" PULL-access-point receives "gap" in the stream of ordinals, that means ( sure, only in the case all the PUSH's .setsockopt()-s were setup properly ) there is another -- non-local -- PULL-ing agent in an RTO-state.
Is this usefull?
Maybe yes, maybe not. The point was to better understand the new challenges that any distributed-system has to somehow cope with.
The nature of multi-stage message queuing, multi-layered implementation ( local-PUSH-agent's-code, local Context()-thread(s), local-O/S, local-kernel, LAN/WAN, remote-kernel, remote-O/S, remote Context()-thread(s), remote-PULL-agent's-code to name just a few ) and multi-agent behaviour simply introduce many places, where an operation may gain latency / block / deadlock / fail in some other manner.
Yes, a walk on a wild-side.
Nevertheless, one may opt to use a much richer, explicit signalling ( besides the initially thought just a raw-data transport ) and help to solve the context-specific, signalling-RTO-aware behaviour inside the multi-agent worlds, that may better reflect the actual situations and survive also the other issues that start to appear in non-monolythic worlds of distributed-systems.
Explicit signalling is one way to cope with.
Fine-tune the ZeroMQ infrastructure. Forget using defaults. Always!
Recent API versions started to add more options to fine-tune the ZeroMQ behaviour for particular use-cases. Be sure to read carefully all details available to setup Context()-instance to tweak the socket instance access-point behaviour, so that it best matches your distributed-system signalling + transport needs:
.setsockopt( ZMQ_LINGER, 0 ) # always, indeed ALWAYS
.setsockopt( ZMQ_SNDBUF, .. ) # always, additional O/S + kernel rules apply ( read more about proper sizing )
.setsockopt( ZMQ_SNDHWM, .. ) # always, problem-specific data-engineered sizing
.setsockopt( ZMQ_TOS, .. ) # always, indeed ALWAYS for critical systems
.setsockopt( ZMQ_IMMEDIATE, .. ) # prevents "loosing" messages pumped into incomplete connections
and many more. Without these, design would remain nailed into a coffin in the real-world transaction's jungle.

Listening for GPIB events

I am controlling a test system using PyVisa/GPIB. The system is comprised of two separate testers (A and B) and a laptop. The the laptop passively listens for a GPIB message from tester A, when received the laptop triggers tester B.
I am using the following code to passively listen for events from tester A:
rm = visa.ResourceManager()
con = "GPIB0::3"
tester_A = rm.get_instrument(con, timeout=5000)
while True:
event = None
try:
event = tester_A.read_raw()
except VisaIOError:
logger.warning("Timeout expired.")
if event != None:
# Do something
Is there a better way to listen and respond to events from tester A? Is there a better way to control this system via GPIB?
The approach you describe will work, but as you are experiencing, is not ideal if you are not quite sure when the instrument is going to respond. The solution lies in using the GPIB's service request (SRQ) functionality.
In brief, the GPIB connection also provides various status registers that allow you to quickly check, for example, whether the instrument is on, whether an error has occurred, etc. (pretty picture). Some of the bits in this register can be set so that they turn on or off after particular events, for example when an operation is complete. This means you tell the instrument to execute a series of commands that you suspect will take a while, and to then flip a bit in the status register to indicate it is done.
From within your software you can do a number of things to make use of this:
Keep looping through a while loop until the status bit indicates that the operation is complete - this is very crude and I wouldn't recommend it.
VISA has a viWaitOnEvent function that allows you to wait until the status bit indicatesthat the operation is complete - a good solution if you need all execution to stop until the instrument has taken a measurement.
VISA also allows you to create an event that occurs when the status bit has flipped - This is a particularly nice solution as it allows you to write an event handler to handle the event.

Why won't ZMQ drop messages?

I have an application which fetches messages from a ZeroMQ publisher, using a PUB/SUB setup. The reader is slow sometimes so I set a HWM on both the sender and receiver. I expect that the receiver will fill the buffer and jump to catch up when it recovers from processing slowdowns. But the behavior that I observe is that it never drops! ZeroMQ seems to be ignoring the HWM. Am I doing something wrong?
Here's a minimal example:
publisher.py
import zmq
import time
ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.setsockopt(zmq.SNDHWM, 1)
sock.bind("tcp://*:5556")
i = 0
while True:
sock.send(str(i))
print i
time.sleep(0.1)
i += 1
subscriber.py
import zmq
import time
ctx = zmq.Context()
sock = ctx.socket(zmq.SUB)
sock.setsockopt(zmq.SUBSCRIBE, "")
sock.setsockopt(zmq.RCVHWM, 1)
sock.connect("tcp://localhost:5556")
while True:
print sock.recv()
time.sleep(0.5)
I believe there are a couple things at play here:
High Water Marks are not exact (see the last paragraph in the linked section) - typically this means the real queue size will be smaller than your listed number, I don't know how this will behave at 1.
Your PUB HWM will never drop messages... due to the way PUB sockets work, it will always immediately processes the message whether there is an available subscriber or not. So unless it actually takes ZMQ .1 seconds to process the message through the queue, your HWM will never come into play on the PUB side.
What should be happening is something like the following (I'm assuming an order of operations that would allow you to actually receive the first published message):
Start up subscriber.py & wait a suitable period to make sure it's completely spun up (basically immediately)
Start up publisher.py
PUB processes and sends the first message, SUB receives and processes the first message
PUB sleeps for .1 seconds and processes & sends the second message
SUB sleeps for .5 seconds, the socket receives the second message but sits in queue until the next call to sock.recv() processes it
PUB sleeps for .1 seconds and processes & sends the third message
SUB is still sleeping for another .3 seconds, so the third message should hit the queue behind the second message, which would make 2 messages in the queue, and the third one should drop due to the HWM
... etc etc etc.
I suggest the following changes to help troubleshoot the issue:
Remove the HWM on your publisher... it does nothing but add a variable we don't need to deal with in your test case, since we never expect it to change anything. If you need it for your production environment, add it back in and test it in a high volume scenario later.
Change the HWM on your subscriber to 50. It'll make the test take longer, but you won't be at the extreme edge case, and since the ZMQ documentation states that the HWM isn't exact, the extreme edge cases could cause unexpected behavior. Mind you, I believe your test (being small numbers) wouldn't do that, but I haven't looked at the code implementing the queues so I can't say with certainty, and it may be possible that your data is small enough that your effective HWM is actually larger.
Change your subscriber sleep time to 3 full seconds... in theory, if your queue holds up to exactly 50 messages, you'll saturate that within two loops (just like you do now), and then you'll have to wait 2.5 minutes to work through those messages to see if you start getting skips, which after the first 50 messages should start jumping large groups of numbers. But I'd wait at least 5-10 minutes. If you find that you start skipping after 100 or 200 messages, then you're being bitten by the smallness of your data.
This of course doesn't address what happens if you still don't skip any messages... If you do that and still experience the same issue, then we may need to dig more into how high water marks actually work, there may be something we're missing.
I met exactly the same problem, and my demo is nearly the same with yours, the subscriber or publisher won't drop any message after either zmq.RCVHWM or zmq.SNDHWM is set to 1.
I walk around after referring to the suicidal snail pattern for slow subscriber detection in Chap.5 of zguide. Hope it helps.
BTW: would you please let me know if you've solved the bug of zmq.HWM ?

How to make socket.recv(500) not stop a while loop

I made an IRC bot which uses a while true loop to receive whatever is said.
To receive I use recv(500), but that stops the loop if there isn't anything to receive, but i need the loop to continue even if there isn't anything to receive.
I need a makeshift timer to continue running.
Example code:
/A lot of stuff/
timer=0
while 1:
timer=timer+1
line=s.recv(500) #If there is nothing to receive, the loop and thus the timer stop.
/A lot of stuff/
So either I need a way to stop it stopping the loop, or I need a better timer.
You can settimeout on the socket so that the call returns promptly (with a suitable exception, so you'll need a try/except around it) if nothing's there -- a timeout of 0.1 seconds actually works better than non-blocking sockets in most conditions.
This is going to prove a bad way to design a network application. I recommend looking into twisted, a networking library with an excellent implementation of the IRC protocol for making a client (like your bot) in twisted.words.protocols.irc.
http://www.habnabit.org/twistedex.html is an example of a very basic IRC bot written using twisted. With very little code, you are able to access a whole, correct, efficient, reconnecting implementation of IRC.
If you are intent on writing this from a socket level yourself, I still recommend studying a networking library like twisted to learn about how to effectively implement network apps. Your current technique will prove less effective than desired.
I usually use irclib which takes care of this sort of detail for you.
If you want to do this with low-level python, consider using the ready_sockets = select.select([s.fileno()], [], [], 0.1) -- this will test the socket s for readability. If your socket's file number is not returned in ready_sockets, then there is no data to read.
Be careful not to use the timout of "0" if you are going to call select repeatedly in a loop that does not otherwise yield the CPU -- that would consume 100% of the CPU as the loop executes. I gave 0.1 seconds timeout as an example; in this case, your timer variable would be counting tenths of a second.
Here's an example:
timer=0
sockets_to_check = [s.fileno()]
while 1:
ready_sockets = select.select(sockets_to_check, [], sockets_to_check, 0.1)
if (len(ready_sockets[2]) > 0):
# Handle socket error or closed connection here -- our socket appeared
# in the 'exceptional sockets' return value so something has happened to
# it.
elif (len(ready_sockets[0]) > 0):
line = s.recv(500)
else:
timer=timer+1 # Note that timer is not incremented if the select did not
# incur a full 0.1 second delay. Although we may have just
# waited for 0.09999 seconds without accounting for that. If
# your timer must be perfect, you will need to implement it
# differently. If it is used only for time-out testing, this
# is fine.
Note that the above code takes advantage of the fact that your input lists contain only one socket. If you were to use this approach with multiple sockets, which select.select does support, the len(ready_sockets[x]) > 0 test would not reveal which socket is ready for reading or has an exception.

Categories