How to know if a ZeroMQ socket is ready?

How to know if a ZeroMQ socket is ready? - python

I have a simple PUSH/PULL ZeroMQ code in Python. It looks like below.
def zmqtest(self):
print('zmq')
Process(target=start_consumer, args=('1', 9999)).start()
Process(target=start_consumer, args=('2', 9999)).start()
ctx = zmq.Context()
socket = ctx.socket(zmq.PUSH)
socket.bind('tcp://127.0.0.1:9999')
# sleep(.5) # I have to wait here...
for i in range(5):
socket.send_unicode('{}'.format(i))
The problem is I have to wait more than .5 second before sending message, otherwise only one consumer process can receive a message. If I wait more than .5 second, everything looks fine.
I guess it takes a while before the socket binding to settle down, and it is done asynchronously.
I wonder if there's a more reliable way to know when the socket is ready.

Sure it takes a while.Sure it is done async.
Let's damage first a bit the terminology.
ZeroMQ is a great framework. Each distributed-system's client, willing to use it ( except using just the inproc:// transport class ), first instantiates an async data-pumping engine .. the Context() instance(s), as needed.
Each Scalable Formal Communication Pattern { PUSH | PULL | ... | XSUB | SUB | PAIR } does not create a socket,
but
rather instantiates an access-point, that may later .connect() or .bind() to some counterparty ( another access-point, of a suitable type, in some Context() instance, be it local or not ( again, the local-inproc://-only infrastructures being the known exception to this rule ) ).
In this sense, an answer to a question "When the socket is ready?" requires an end-to-end investigation "across" the distributed-system, handling all the elements, that participate on the socket-alike behaviour's implementation.
Testing a "local"-end access-point RTO-state:
For this, your agent may self-connect a receiving access-point ( working as a PULL archetype ), so as to "sniff", when the local-end Context() instance has reached an RTO-state + a .bind()- created O/S L3+ interface starts distributing the intended agent's-PUSH-ed messages.
Testing a "remote"-agent's RTO-state:
This part can have an indirect or an explicit testing. An indirect way may use a message-embedded index. That can contain a raising number ( an ordinal ), which bears a weak information about an order. Given the PUSH-side message-routing strategy is Round-robin, the local-agent can be sure, that until it's local PULL-access-point receives all messages indicating a contiguous sequence of ordinals, there is no other "remote"-PULL-ing agent in an RTO-state. Once the "local" PULL-access-point receives "gap" in the stream of ordinals, that means ( sure, only in the case all the PUSH's .setsockopt()-s were setup properly ) there is another -- non-local -- PULL-ing agent in an RTO-state.
Is this usefull?
Maybe yes, maybe not. The point was to better understand the new challenges that any distributed-system has to somehow cope with.
The nature of multi-stage message queuing, multi-layered implementation ( local-PUSH-agent's-code, local Context()-thread(s), local-O/S, local-kernel, LAN/WAN, remote-kernel, remote-O/S, remote Context()-thread(s), remote-PULL-agent's-code to name just a few ) and multi-agent behaviour simply introduce many places, where an operation may gain latency / block / deadlock / fail in some other manner.
Yes, a walk on a wild-side.
Nevertheless, one may opt to use a much richer, explicit signalling ( besides the initially thought just a raw-data transport ) and help to solve the context-specific, signalling-RTO-aware behaviour inside the multi-agent worlds, that may better reflect the actual situations and survive also the other issues that start to appear in non-monolythic worlds of distributed-systems.
Explicit signalling is one way to cope with.
Fine-tune the ZeroMQ infrastructure. Forget using defaults. Always!
Recent API versions started to add more options to fine-tune the ZeroMQ behaviour for particular use-cases. Be sure to read carefully all details available to setup Context()-instance to tweak the socket instance access-point behaviour, so that it best matches your distributed-system signalling + transport needs:
.setsockopt( ZMQ_LINGER, 0 ) # always, indeed ALWAYS
.setsockopt( ZMQ_SNDBUF, .. ) # always, additional O/S + kernel rules apply ( read more about proper sizing )
.setsockopt( ZMQ_SNDHWM, .. ) # always, problem-specific data-engineered sizing
.setsockopt( ZMQ_TOS, .. ) # always, indeed ALWAYS for critical systems
.setsockopt( ZMQ_IMMEDIATE, .. ) # prevents "loosing" messages pumped into incomplete connections
and many more. Without these, design would remain nailed into a coffin in the real-world transaction's jungle.

Related

Why ZeroMQ SUBs are missing messages?

I built about 12000 subscribers per computer with threading as following
subscriber side:
def client(id):
context=zmq.Context()
subscriber=context.socket(zmq.SUB)
subscriber.connect('ip:port')
subscriber.setsockopt(zmq.SUBSCRIBE,(id+'e').encode())
while 1:
signal=subscriber.recv_multipart()
write logs...
for i in range(12000):
threading.Thread(target=client,args=(str(i+j*12000),)).start()
#j is arbitrary unduplicated int
publisher side:
subscriber=zmq.Context().socket(zmq.PUB)
subscriber.bind('tcp://*:port')
while 1:
for id in client_id:
subscriber.send_multipart([(id+'e').encode()]+[message])
When I used more than one computer(by using different j) to build subscribers, sometimes some subscribers could not receive message at all.
If I restart subscribers, those who could not receive message become normal. But those who were normal become to unable to receive message.
These problem will not show any errors, only can be found in my logs.
Do excessive connection occur this problem?

As the counts of connections / messages / sizes grow larger and larger, some default guesstimates typically cease to suffice. Try to extend some otherwise working defaults on the PUB-side configuration, where the problem seems to start choking ( do not forget that since v3.?+ the subscription-list processing got transferred from the SUB-side(s) to the central PUB-side. That reduces the volumes of data-flow, yet at some (here growing to remarkable amounts) additional costs on the PUB-side ~ RAM-for-buffers + CPU-for-TOPIC-list-filtering...
So, let's start with these steps on the PUB-side :
aSock2SUBs = zmq.Context( _tweak_nIOthreads ).socket( zmq.PUB ) # MORE CPU POWER
aSock2SUBs.setsockopt( zmq.SNDBUF, _tweak_SIZE_with_SO_SNDBUF ) # ROOM IN SNDBUF
And last but not least, PUB-s do silently drop any messages, that do not "fit" under its current HighWaterMark level, so let's tweak this one too :
aSock2SUBs.setsockopt( zmq.SNDHWM, _tweak_HWM_till_no_DROPs ) # TILL NO DROPS
Other { TCP_* | TOS | RECONNECT_IVL* | BACKLOG | IMMEDIATE | HEARTBEAT_* | ... }-low-level parameter settings may help further to make your herd of 12k+ SUB-s live in peace side by side with other (both friendly & hostile ) traffic and make your application more robust, than if relying just on pre-cooked API-defaults.
Consult both the ZeroMQ API documentation altogether also with the O/S defaults, as many of these ZeroMQ low-level attributes also rely on the O/S actual configuration values.
You shall also be warned, that making 12k+ threads in Python still leaves a purely [SERIAL] code execution, as the Python central GIL-lock ownership (exclusive) avoids (yes, principally avoids) any form of [CONCURRENT] co-execution, as the very ownership of the GIL-lock is exclusive and re-[SERIAL]-ises any amount of threads into a waiting queue and results in a plain sequence of chunks' execution ( By default, Python 2 will switch threads every 100 instructions. Since Python 3.2+, by default, the GIL will be released after 5 milliseconds ( 5,000 [us] ) so that other thread can have a chance to try & also acquire the GIL-lock. You can change these defaults, if the war of 12k+ threads on swapping the ownership of the GIL-lock actually results in "almost-blocking" any and all of the TCP/IP-instrumentation for message buffering, stacking, sending, re-transmit-ing until an in time confirmed reception. One may test it until a bleeding edge, yet choosing some safer ceiling might help if other parameters have been well adjusted for robustness.
Last but not least, enjoy the Zen-of-Zero, the masterpiece of Martin SUSTRIK for distributed-computing, so well crafted for ultimately scalable, almost zero-latency, very comfortable, widely ported signalling & messaging framework.

Further to user3666197's answer, you may also have to consider the time taken for all of those clients to connect. The PUBlisher has no idea how many SUBcribers there are supposed to be, and will simply get on with the job of sending out messages to those SUBscribers presently connected, from when the very first connection is made. The PUBlisher socket does not hang on to messages its sent just in case more SUBscribers connect at some undefined time in the future. Once a message has been transferred to 1 or more SUBscribers, it's dropped from the PUBlisher's queue. Also, the connections are not made instantaneously, and 12,000 is quite a few to get through.
It doesn't matter if you start your PUBlisher or SUBscriber program first; your 12,000 connections will be being made over a period of time once both programs are running, this happening asynchronously wrt to your own thread(s). Some SUBscribers will start getting messages whilst others will still be unknown to the PUBlisher. When, finally, all 12,000 connections are made then it will smooth out.

Do I need to create new Socket for each SUBscription on ZMQ?

The server loops through a list of objects, the data on those objects changes in real time. Every millisecond the server publishes all of the new data of those objects. i.e. ['Carrot', 'Banana', 'Mango', 'Eggplant']
The Client can subscribe to specific objects via their name. self.sub_socket.setsockopt_string(zmq.SUBSCRIBE, 'Carrot')
On a thread the client polls these data in realtime as well:
while True:
sockets = dict(self.poller.poll(poll_timeout))
if self.sub_socket in sockets and sockets[self.sub_socket] == zmq.POLLIN:
msg = self.sub_socket.recv_string(zmq.DONTWAIT)
// do something with the msg...
The problem is when I subscribe to multiple objects let's say, Carrot, Eggplant & Banana. I only receive the changes from Carrot, sometimes Banana, and so rare on Eggplant. I think this is because from the order of looping of the server, like maybe when the client polls, receives Carrot, process the data, then polls again but the server is already done with the publishing through the list and just publishing Carrot again then client polls receives just Carrot because of that.
So I thought of creating individual sockets for each subscription? Is that a solution? I'm pretty new with ZMQ.

Q : "Is that a solution?" ... creating individual sockets for each subscription?
No. Unless motivated by some, not known to me, other reasons.
While ZeroMQ message passing infrastructure provides Zero Warranty of each message delivery, that does not mean that messages are evaporating or lost anywhere after being sent. It just says, expect Zero Warranty for each one being delivered and if one needs, one can add such wanted warranty-mechanism overheads, that others need not pay if they can work without them. Losing 1-in-1.000.000? 1-in-1.000.000.000? That depends on many factors, yet loosing a message is not a common or random state of a distributed computing system ( and has some internal reasons, details of which go beyond the scope of this post ).
Still In Doubts ?
Make a test.
Design a simple test - PUB-side sending a uniformly distributed trivial messages
SAMPLEs = int( 1E6 )
aMsgSIZE = 2048
TOPICs = [ r'Carrot', r'Banana', r'Mango', r'Eggplant', r'' ]
MASK = "{0:}" + aMsgSIZE * "_"
for i in range( SAMPLEs ):
PUB.send( MASK.format( TOPICs[np.random.randint( len( TOPICs ) - 1 ) ) )
time.sleep( 1E-3 )
Using this test, you shall receive the uniformly distributed sample with a same amount of each of the subscribed TOPICs ( if all were subscribed ).
Growing the aMsgSIZE may ( under default Context()- and Socket()-instances ) create some messaged to get "lost", but again, this ought be uniformly distributed. If not, there would be some trouble to dig deeper.
The amount of messages uniformly not delivered from the SAMPLEs amount will demonstrate how big is the need to tweak the Context() and Socket()-instances' parameters so as to provide resources enough to safely enqueue that amount of data-flow. Yet having more Socket()-s for individual subscribed Topic-strings will not solve this resources management bottleneck, if present.
Do not hesitate to post the test-results, if the uniformly distributed mix of topics was or was not skewed and how big fraction was not received at the end.
Add platform details + ZeroMQ version, all that matters, as always :o)

ZeroMQ PAIR/PAIR connection between Ruby an Python

I want to make a simple connection between a Python program and a Ruby program using ZeroMQ, I am trying to use a PAIR connection, but I have not been able.
This is my code in python (the server):
import zmq
import time
port = "5553"
context = zmq.Context()
socket = context.socket(zmq.PAIR)
socket.bind("tcp://*:%s" % port)
while True:
socket.send(b"Server message to client3")
print("Enviado mensaje")
time.sleep(1)
It does not display anything until I connect a client.
This is the code in Ruby (the client)
require 'ffi-rzmq'
context = ZMQ::Context.new
subscriber = context.socket ZMQ::PAIR
subscriber.connect "tcp://localhost:5553"
loop do
address = ''
subscriber.recv_string address
puts "[#{address}]"
end
The ruby script just freezes, it does not print anything, and the python script starts printing Enviando mensaje
B.T.W: I am using Python 3.6.9 and Ruby 2.6.5
What is the correct way to connect a zmq PAIR between Ruby and Python?

Welcome to the Zen of Zero!
In case one has never worked with ZeroMQ,
one may here enjoy to first look at "ZeroMQ Principles in less than Five Seconds" before diving into further details
Q : It does not display anything until I connect a client.
Sure, it does not, your code imperatively asked to block until a PAIR/PAIR delivery channel got happen to become able to deliver a message. As the v4.2+ API defines, the .send()-method will block during all the duration of a "mute state".
When a ZMQ_PAIR socket enters the mute state due to having reached the high water mark for the connected peer, or if no peer is connected, then any zmq_send(3) operations on the socket shall block until the peer becomes available for sending; messages are not discarded.
May try non-blocking mode of sending ( always a sign of a good engineering practice to avoid blocking, the more in distributed-computing ) and better include also <aSocket>.close() and <aContext>.term() as a rule of thumb ( best with explicit .setsockopt( zmq.LINGER, 0 ) ) for avoiding hang-ups and as a good engineering practice to explicitly close resources and release them back to the system
socket.send( b"Server message #[_{0:_>10d}_] to client3".format( i ), zmq.NOBLOCK )
Last but not least :
Q : What is the correct way to connect a zmq PAIR between Ruby and Python?
as the API documentation explains:
ZMQ_PAIR sockets are designed for inter-thread communication across the zmq_inproc(7) transport and do not implement functionality such as auto-reconnection.
there is no best way to do this, as Python / Ruby are not a case of inter-thread communications. ZeroMQ has since v2.1+ explicitly warned, that the PAIR/PAIR archetype is an experimental and ought be used only with bearing that in mind.
One may always substitute each such use-case with a tandem of PUSH/PULL-simplex channels, providing the same comfort with a pair of a .send()-only + .recv()-only channels.

How to abort context.socket.recv() the right way in ZeroMQ?

I have a small software where I have a separate thread which is waiting for ZeroMQ messages. I am using the PUB/SUB communication protocol of ZeroMQ.
Currently I am aborting that thread by setting a variable "cont_loop" to False.
But I discovered that, when no messages arrive to the ZeroMQ subscriber I cannot exit the thread (without taking down the whole program).
def __init__(self):
Thread.__init__(self)
self.cont_loop = True
def abort(self):
self.continue_loop = False
def run(self):
zmq_context = zmq.Context()
zmq_socket = zmq_context.socket(zmq.SUB)
zmq_socket.bind("tcp://*:%s" % *(5556))
zmq_socket.setsockopt(zmq.SUBSCRIBE, "")
while self.cont_loop:
data = zmq_socket.recv()
print "Message: " + data
zmq_socket.close()
zmq_context.term()
print "exit"
I tried to move socket.close() and context.term() to abort-method. So that it shuts down the subscriber but this killed the whole program.
What is the correct way to shut down the above program?

Q: What is the correct way to ... ?
A: There are many ways to achieve the set goal. Let me pick just one, as a mock-up example on how to handle distributed process-to-process messaging.
First. Assume, there are more priorities in typical software design task. Some higher, some lower, some even so low, that one can defer an execution of these low-priority sub-tasks, so that there remains more time in the scheduler, to execute those sub-tasks, that cannot handle waiting.
This said, let's view your code. The SUB-side instruction to .recv() as was being used, causes two things. One visible - it performs a RECEIVE operation on a ZeroMQ-socket with a SUB-behaviour. The second, lesser visible is, it remains hanging, until it gets something "compatible" with a current state of the SUB-behaviour ( more on setting this later ).
This means, it also BLOCKS all the time since such .recv() method call UNTIL some unknown, locally uncontrollable coincidence of states/events makes it to deliver a ZeroMQ-message, with it's content being "compatible" with the locally pre-set state of this (still blocking) SUB-behaviour instance.
That may take ages.
This is exactly why .recv() is being rather used inside a control-loop, where external handling gets both the chance & the responsibility to do what you want ( including abort-related operations & a fair / graceful termination with proper resources' release(s) ).
Receive process becomes .recv( flags = zmq.NOBLOCK ) in rather a try: except: episode. Such a way your local process does not lose it's control over the stream-of-events ( incl. the NOP being one such ).
The best next step?
Take your time and get through a great book of gems, "Code Connected, Volume 1", Pieter HINTJENS, co-father of the ZeroMQ, has published ( also as PDF ).
Many his thoughts & errors to be avoided that he had shared with us is indeed worth your time.
Enjoy the powers of ZeroMQ. It's very powerful & worth getting mastered top-down.

explain python windows service example (win32serviceutil.serviceframework)

Most python windows service examples based on the win32serviceutil.ServiceFramework use the win32event for synchronization.
For example:
http://tools.cherrypy.org/wiki/WindowsService (the example for cherrypy 3.0)
(sorry I dont have the reputation to post more links, but many similar examples can be googled)
Can somebody clearly explain why the win32events are necessary (self.stop_event in the above example)?
I guess its necessary to use the win32event due to different threads calling svcStop and svcRun? But I'm getting confused, there are so many other things happening: the split between python.exe and pythonservice.exe, system vs local threads (?), python GIL..

For the top of PythonService.cpp
PURPOSE: An executable that hosts Python services.
This source file is used to compile 2 discrete targets:
* servicemanager.pyd - A Python extension that contains
all the functionality.
* PythonService.exe - This simply loads servicemanager.pyd, and
calls a public function. Note that PythonService.exe may one
day die - it is now possible for python.exe to directly host
services.
What exactly do you mean by system threads vs local threads? You mean threads created directly from C outside the GIL?
The PythonService.cpp just related the names to callable python objects and a bunch of properties, like the accepted methods.
For example a the accepted controls from the ServiceFramework:
def GetAcceptedControls(self):
# Setup the service controls we accept based on our attributes. Note
# that if you need to handle controls via SvcOther[Ex](), you must
# override this.
accepted = 0
if hasattr(self, "SvcStop"): accepted = accepted | win32service.SERVICE_ACCEPT_STOP
if hasattr(self, "SvcPause") and hasattr(self, "SvcContinue"):
accepted = accepted | win32service.SERVICE_ACCEPT_PAUSE_CONTINUE
if hasattr(self, "SvcShutdown"): accepted = accepted | win32service.SERVICE_ACCEPT_SHUTDOWN
return accepted
I suppose the events are recommended because that way you could interrupt the interpreter from outside the GIL, even if python is in a blocking call from the main thread, e.g.: time.sleep(10) you could interrupt from those points outside the GIL and avoid having an unresponsive service.
Most of the win32 services calls are in between the python c macros:
Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS

It may be that, being examples, they don't have anything otherwise interesting to do in SvcDoRun. SvcStop will be called from another thread, so using an event is just an easy way to do the cross-thread communication to have SvcDoRun exit at the appropriate time.
If there were some service-like functionality that blocks in SvcDoRun, they wouldn't necessarily need the events. Consider the second example in the CherryPy page that you linked to. It starts the web server in blocking mode, so there's no need to wait on an event.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.