How to detect whether linger was reached when closing socket using ZeroMQ?

How to detect whether linger was reached when closing socket using ZeroMQ? - python

The following dispatch() function runs receives messages through a Queue.queue and sends them using a ZeroMQ PUSH socket to an endpoint.
I want this function to exit, once it receives None through the queue, but if the socket's underlying message buffer has any undelivered messages (the remote endpoint is down), then the application won't terminate. Thus, once the function receives a None, it closes the socket with a specified linger.
Using this approach, how can I detect whether the specified linger was reached or not? In particular, no exception is raised.
def dispatch(self):
context = zmq.Context()
socket = context.socket(zmq.PUSH)
poller = zmq.Poller()
socket.connect('tcp://127.0.0.1:5555')
poller.register(socket, zmq.POLLOUT)
while True:
try:
msg = self.dispatcher_queue.get(block=True, timeout=0.5)
except queue.Empty:
continue
if msg is None:
socket.close(linger=5000)
break
try:
socket.send_json(msg)
except Exception as exc:
raise common.exc.WatchdogException(
f'Failed to dispatch resource match to processor.\n{msg=}') from exc

Q : "How to detect whether linger was reached when closing socket using ZeroMQ?"
Well, not an easy thing.
ZeroMQ internally hides all these details from a user-level code, as the API was (since ever, till recent v4.3) crafted with all the beauties of the art of Zen-of-Zero, for the sake of maximum performance, almost linear scaling and minimum latency. Do Zero-steps that do not support ( the less if violate ) this.
There might be three principal directions of attack on solving this:
one may try to configure & use the event-observing overlay of zmq_socket_monitor() to analyse actual sequence of events on the lowest Level-of-Detail achievable
one may also try a rather brute way to set infinite LINGER attribute for the zmq.Socket()-instance & directly kill the blocking operation by sending a SIGNAL after set amount of (now) soft-"linger" expired, be it using the new in v4.3+ features of zmq_timers ( a [ms]-coarse framework of timers / callback utilities ) or one's own
one may prefer to keep things clean and still meeting the goal by "surrounding" a call to the zmq_ctx_term(), which as per v4.3 documented API will block ( be warned, that it is not warranted to be so in other API versions back & forth ). This way may help you indirectly detect a duration actually spent in blocking-state, like :
...
NOMINAL_LINGER_ON_CLOSE = 5000
MASK_a = "INF: .term()-ed ASAP, after {0:} [us] from {1:} [ms] granted"
MASK_b = "INF: .term()-ed ALAP, after {0:} [us] from {1:} [ms] granted"
...
socket.setsockopt( zmq.LINGER, NOMINAL_LINGER_ON_CLOSE ) # ____ BE EXPLICIT, ALWAYS
aClk = zmq.Stopwatch()
aClk.start() #_________________________________________________ BoBlockingSECTION
context.term() # /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ BLOCKING.......
_ = aClk.stop() #______________________________________________ EoBlockingSECTION
...
print( ( MASK_a if _ < ( NOMINAL_LINGER_ON_CLOSE * 1000 )
else MASK_b
).format( _, NOMINAL_LINGER_ON_CLOSE )
)
...
My evangelisation of always, indeed ALWAYS being rather EXPLICIT is based on both having seen the creeping "defaults", that changed from version to version ( which is fair to expect to continue the same way forth ), so a responsible design shall, indeed ALWAYS, imperatively re-enter those very values, that we want to be in-place, as our code will survive both our current version & us, mortals, still having Zero-warranty of what part of our current assumptions will remain "defaults" in any future version / revision & having the same uncertainty what version-mix would be present in the domain of our deployed piece of code ( as of EoY-2020, there are still v2.1, v2.11, v3.x, v4.x running wild somewhere out there so one never knows, do we? )

Related

zmq_getsockopt error : zmq.error.ZMQError: No such file or directory for ipc path

I am new to use ZeroMQ, so I am struggling with some code.
If I do the following code, no error is shown :
import zmq.asyncio
ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket(zmq.PULL)
rcv_socket.connect("ipc:///tmp/test")
rcv_socket.bind("ipc:///tmp/test")
But, if I try to use the function zmq_getsockopt(), it fails :
import zmq.asyncio
ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket(zmq.PULL)
rcv_socket.connect("ipc:///tmp/test")
socket_path = rcv_socket.getsockopt(zmq.LAST_ENDPOINT)
rcv_socket.bind("ipc://%s" % socket_path)
Then I get :
zmq.error.ZMQError: No such file or directory for ipc path "b'ipc:///tmp/test'".

new to use ZeroMQ, so I am struggling with some code.
Well, you will be way, way better off, if you start with first understanding The Rules of the Game, than with learning from crashes ( yes, on the very contrary to what "wannabe-evangelisation-gurus" pump into the crowds that "just-coding" is enough - which it is not, for doing indeed a serious business ).
This is why:
If you read the published API, it still will confuse you most of the time, if you have no picture of the structure of the system & do not understand its internal and external behaviours ( the Framework's Rules of the Game ) :
The ZMQ_LAST_ENDPOINT option shall retrieve the last endpoint bound for TCP and IPC transports. The returned value will be a string in the form of a ZMQ DSN. Note that if the TCP host is INADDR_ANY, indicated by a *, then the returned address will be 0.0.0.0 (for IPv4).
This says the point, yet without knowing the concept, the point is still hidden from you to see it.
The Best Next Step
If you are indeed serious into low-latency and distributed computing, the best next step, after reading the link above, is to stop coding and first take some time to read and understand the fabulous Pieter Hintjens' book "Code Connected, Volume 1" before going any further - definitely worth your time.
Then, you will see why this will never fly:
import zmq.asyncio; ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket( zmq.PULL )
rcv_socket.connect( "ipc:///tmp/test" )
socket_path = rcv_socket.getsockopt( zmq.LAST_ENDPOINT )
rcv_socket.bind( "ipc://%s" % socket_path )
whereas this one may ( yet no handling of a NULL-terminated character string is still present here ... which is per se a sign of a bad software design ) :
import zmq.asyncio; ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket( zmq.PULL )
rcv_socket.bind( "ipc:///tmp/test" )
socket_path = rcv_socket.getsockopt( zmq.LAST_ENDPOINT )
rcv_socket.connect( "ipc://%s" % socket_path )

ZMQ pub/sub for transfering base64 images

I'm facing with a task: using zmq socket to send and receive base64 string (which generated from 800x600 images). Currently, I'm using pub/sub connection to perform this task. But look like the message is large so that the socket can't transfer it immediately, and the later messages stuck in network buffer. Although I don't want to lose so many messages, I must restrict the HWM value so that the socket works properly. So I have some questions:
Is there another effective library/way to perform my task? Or should I use other connection type that zmq provides (router/dealer and request/reply)?
To transfer image (that processed by OpenCV), is there an approach I can use to minimize the size of the sending image, except converting into base64 format?
If I must continue using zmq pub/sub connection, how can I limit the time for storing old messages, not the number of them, like that for 3 minutes?
Here my python code for the socket:
Publisher
import numpy as np
import zmq
import base64
context = zmq.Context()
footage_socket = context.socket(zmq.PUB)
footage_socket.setsockopt(zmq.SNDHWM, 1)
footage_socket.connect(<tcp address>)
def send_func(frame, camera_link):
height, width, _ = frame.shape
frame = np.ascontiguousarray(frame)
base64_img = base64.b64encode(frame)
data = {"camera_link":camera_link,'base64_img':base64_img, "img_width":width, "img_height":height}
footage_socket.send_json(data)
Subcriber
footage_socket = context.socket(zmq.SUB)
footage_socket.bind(<tcp address>)
footage_socket.setsockopt(zmq.RCVHWM, 1)
def rcv_func():
while True:
print("run socket")
try:
framex = footage_socket.recv_string()
data = json.loads(framex)
frame = data['base64_img']
img = np.frombuffer(base64.b64decode(frame), np.uint8)
img = img.reshape(int(frame_height), int(frame_width), 3)
except Exception as e:
print(e)

Before we start, let me take a few notes: - avoid re-packing data into JSON, if it were just for the ease of coding. JSON-re-serialised data "grow"-in size, without delivering you a single value-added for ultra-fast & resources-efficient stream-processing. Professional systems "resort" to JSON-format only if they have plenty of time and almost unlimited spare CPU-processing power, they waste into re-packing the valuable data into a just another box-of-data-inside-another-box-of-data. Where feasible, they can pay all the costs and inefficiencies - here, you will result in getting nothing in exchange to the spent CPU-clocks, more than doubled the RAM-needed to re-pack itself and also having to transport even larger data - review, if camera indeed provides image-data that "deserve" to become 8-Byte / 64-bit "deep", if not, you have the first remarkable image-data reduction free-of-chage
Using sys.getsizeof() may surprise you:
>>> aa = np.ones( 1000 )
>>> sys.getsizeof( aa )
8096 <---------------------------- 8096 [B] object here "contains"
>>> (lambda array2TEST: array2TEST.itemsize * array2TEST.size )( aa )
8000 <---------------------------- 8000 [B] of data
>>> bb = aa.view() # a similar effect happen in smart VECTORISED computing
>>> sys.getsizeof( bb )
96 <------------------------------ 96 [B] object here "contains"
>>> (lambda array2TEST: array2TEST.itemsize * array2TEST.size )( bb )
8000 <---------------------------- 8000 [B] of data
>>> bb.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False <-------------------------------||||||||||
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> bb.dtype
dtype('float64') <-------------- 8 [B] per image-pixel even for {1|0} B/W
Q : is there an approach I can use to minimize the size of the sending image...?
Yes, there have been already spent millions of [man * years] of R&D, dedicated to solving this problem, and still evolving the best of the class methods for doing it.
The best results, as anyone may have already expected on one's own, are needed for extremely corner-cases - for a satellite imagery transport from away, far in a deep space, back home - like when JAXA was on it's second asteroid rendezvous mission, this time visiting the Ryugu asteroid.
Your as-is code produces 800x600-image-frames at so far unspecified fps-rate and color-depth. A brief view shows, how much data that can easily generate, within those said -3-minutes-, if the process is not handled with more attention and a due care:
>>> (lambda a2T: a2T.itemsize * a2T.size )( np.ones( ( 800, 600, 3 ) ) ) / 1E6
11.52 <---- each 800x600-RGB-FRAME handled/processed this way takes ~ 11.5 [MB]
#~30 fps ~345.6 [MB/s]
~ 62.2 [GB/3min]
Solution? Take the inspiration from The Best in the Class know-how :
There you have limited power ( both the energy-wise and the processing-wise - do not forget, the CPU-s "inside" this satellite were already manufactured more than some 5 - 7 years ago, before the Project launch - no one serious will dare to send a mission with bright and hot new, but unproven, COTS chips ), limited RAM ( again, the power plus weight limits, as the amount of the fuel needed to liftoff and fly "there" grows with every single gram of "The Useful Payload" ) and the last but not least - the most limiting factor - you have very limited means of R/F-COMMs - a so "loooooooong"-wire ( it takes almost half a day, to get a first bit from "there" back "here" + the same, if you try to ACK/NACK from "here" answering any remote-request or requesting a re-send after an error was detected ). The current DSN effective-telemetry data transport-speeds are about 6.4 ~ 9.6 kbps ( yes, not more than about 7000 bits/sec )
Here, the brightest minds have put all the art of the human intellect, into making this happen:
ultimate means of image compression - never send a bit unless it is indeed vital & necessary
ultimate means of transcoded-image data error self-correction added - if anything is worth adding, the error-detection is not ( you will have to wait for almost a day, to get it "re-transmited" again, hopefully without another error there ). Here we need a means of ( limited - see the costs of sending a single bit above, so this has to be very economic add-on ) self-correction, which can indeed repair some limited-scope of signal/data-transport errors, that may appear and do appear during the R/F-COMMs signal traveling from deep space back home. On larger errors, you have to wait a few days to get a re-scheduled image-data error recovery solved by another try to send a larger pack, that was not recoverable from the "damaged"-data by the capabilities engineered into the built-in error self-correction.
Where to start from?
If your use-case does not have the remarkable amount of the "spare" CPU-power available ( it is indeed needed to have pretty enough "free" CPU+RAM-resources to perform any such advanced image-data trans-coding & error-recovery re-processing, both in scale ( volume of additional data for the trans-coding and re-processing - both of which come at large sizes - orders of magnitude larger than a size of a single image-frame ) and in time ( speed of the additional CPU-processing ) ) there is no magic trick to get the ultimate image-data compression and your story ends here.
If your use-case can spin up more CPU-power, your next enemy is the time. Both the time to design a clever-enough image-processing and the time to process each image-frame, using your engineered image-data trans-coding, within a reasonably short amount of time, before sending over to the recipient end. The former is manageable by your Project-resources ( by finance - to get the right skilled engineers on board, and by people who execute (do) the actual design & engineering phase ). The latter is not manageable, it depends on your Project's needs - how fast ( fps ) and bearing what latency ( how late, in accumulated [ms] of delays ) your Project can still survive to perform the intended function.
python is an easy prototyping eco-system, once you need to boost the throughput ( ref. above ), this most probably ( 30+ years of experience make me awfully well confident in saying this - even if you pull in add-on steroids, like moving into cython + C-extensions for doing the whole circus indeed a bit, but only a bit faster, at an immense add-on cost of having to ( acquire new skill if not already on board - having an expensive learning curve duration and grows in salaries for those well-skilled ) re-engineer and re-factor your so far well-prototyped code-base ) will be the first blocker of the show going on
OpenCV can and will provide you some elementary image-manipulation tools to start from
image-data trans-coding and ordinary or ultimate data-compression have to follow, to reduce the data-size
ZeroMQ is the least problematic part - both performance-wise scalable and having unique low-latency throughput capabilities. Without any details, one may forget about the PUB/SUB, unless you keep prevented and avoided any subscription-list processing at all ( the costs of doing this would cause immense side-effects on the { central-node | network-dataflows + all remote-nodes }-overloads, having no practical effect for the intended fast and right-sized image-data pipeline-processing.
Q : If I must continue using zmq pub/sub connection, how can I limit the time for storing old messages, not the number of them, like that for 3 minutes?
ZeroMQ is a smart tool, yet one has to understand it's powers - ZeroCopy will help you in keeping low-RAM-profile in production, yet if you plan to store -3-minutes of image-data streaming, you will need both immense loads of RAM and CPU-power and it all also heavily depends on the actual amount of .recv()-ing peers.
ZeroMQ is a broker-less system, so you do not actually "store" messages, but the .send()-method just tells the ZeroMQ infrastructure, that the provided data are free-to-get-sent, whenever ZeroMQ infrastructure is seeing a chance to dispatch 'em to the designated peer-recipient ( be it locally or over the Atlantic or over the satellite-connection ). This means, the proper ZeroMQ configuration is a must, if you plan to have the sending/receiving-side's ready to enqueue / transmit / receive / dequeue ~3-minutes of even the most compressed image-data stream(s), potentially providing multiples of that, in case 1:many-party communication appears in production.
Proper analysis and sound design decisions are the only chance for your Project to survive all these requirements, given the CPU, RAM and transport-means are a-priori known to be limited.

How to obtain queue capacity and load in zero-mq

I have a publisher-subscriber architecture in ZeroMQ. I use Python.
I need to be able to tell when some queue is about to be too full, and preferably be able to do something about it.
I must be able to know if messages are lost.
I am, however, unable to find the relevant documentation on this subject.
Would love some help
Thanks!

Here is a snippet of what I did
def _flush_zmq_into_buffer(self):
# poll zmq for new messages. if any are found, put all possible items from zmq to self.buffer
# if many readers, need to use a read-write lock with writer priority
# http://code.activestate.com/recipes/577803-reader-writer-lock-with-priority-for-writers/
while True:
self._iteration += 1
self._flush_zmq_once()
sleep(0.005)
def _take_work_from_buffer(self):
while True:
try:
if self._buffer.qsize() > 0:
work_message = self._buffer.get(block=True)
# can't have any heavy operation here! this has to be as lean as possible!
else:
sleep(0.01)
continue
except queue.Empty as ex:
sleep(0.01)
continue
self._work_once(work_message)
def _flush_zmq_once(self):
self.__tick()
flushed_messages = 0
for i in range(self.max_flush_rate):
try:
message = self._parse_single_zmq_message()
self._buffer.put(message, block=True) # must block. can't lose messages.
except zmq.Again: # zmq empty
flushed_messages = i
break
self._log_load(flushed_messages)
self.__tock()
self.__print_flushed(flushed_messages)
This allows me to flush the zmq buffer into my own buffer much faster than I am parsing messages, thus not losing any message, and paying with latency.
This also allows me to know how many messages are flushed from zmq every flushing cycle, thus having an idea about the load.
The reason this is using polling over events, is that for a high rate of incoming messages, the event system would be more costly than the polling system. The last sentence is untested, but I believe it to be true.

Recovering from zmq.error.Again on a zmq.PAIR socket

I have a single client talking to a single server using a pair socket:
context = zmq.Context()
socket = context.socket(zmq.PAIR)
socket.setsockopt(zmq.SNDTIMEO, 1000)
socket.connect("tcp://%s:%i"%(host,port))
...
if msg != None:
try:
socket.send(msg)
except Exception as e:
print(e, e.errno)
The program sends approximately one 10-byte message every second. We were seeing issues where the program would eventually start to hang infinitely waiting for a message to send, so we added a SNDTIMEO. However, now we are starting to get zmq.error.Again instead. Once we get this error, the resource never becomes available again. I'm looking into which error code exactly is occurring, but I was generally wondering what techniques people use to recover from zmq.error.Again inside their programs. Should I destroy the socket connection and re-establish it?

Fact#0: PAIR/PAIR is different from other ZeroMQ archetypes
RFC 31 explicitly defines:
Overall Goals of this Pattern
PAIR is not a general-purpose socket but is intended for specific use cases where the two peers are architecturally stable. This usually limits PAIR to use within a single process, for inter-thread communication.
Next, if not correctly set the SNDHWM size and in case of the will to use the PAIR to operate over tcp://-transport-class also all the O/S-related L3/L2-attributed, any next .send() will also yield EAGAIN error.
There are a few additional counter-measures ( CONFLATE, IMMEDIATE, HEARTBEAT_{IVL|TTL|TIMEOUT} ), but there is the above mentioned principal limit on PAIR/PAIR, which sets what not to expect to happen if using this archetype.
The main suspect:
Given the said design-side limits, a damaged transport-path, the PAIR-access-point will not re-negotiate the reconstruction of the socket into the RTO-state.
For this reason, if your code indeed wants to remain using PAIR/PAIR, it may be wise to assemble also an emergency SIG/flag path so as to allow the distributed-system robustly survive such L3/L2/L1-incidents, that the PAIR/PAIR is known not to auto-take care of.
Epilogue:
your code does not use non-blocking .send()-mode, while the EAGAIN error-state is exactly used to signal a blocked-capability ( unability of the Access-Point to .send() at this very moment ) by setting the EAGAIN.
Better use the published API details:
aRetCODE = -1 # _______________________________________ PRESET
try:
aRetCODE = socket.send( msg, zmq.DONTWAIT ) #_______ .SET on RET
if ( aRetCODE == -1 ):
... # ZeroMQ: SIG'd via ERRNO:
except:
... #_______ .HANDLE EXC
finally:
...

Python Memory leak using Yocto

I'm running a python script on a raspberry pi that constantly checks on a Yocto button and when it gets pressed it puts data from a different sensor in a database.
a code snippet of what constantly runs is:
#when all set and done run the program
Active = True
while Active:
if ResponseType == "b":
while Active:
try:
if GetButtonPressed(ResponseValue):
DoAllSensors()
time.sleep(5)
else:
time.sleep(0.5)
except KeyboardInterrupt:
Active = False
except Exception, e:
print str(e)
print "exeption raised continueing after 10seconds"
time.sleep(10)
the GetButtonPressed(ResponseValue) looks like the following:
def GetButtonPressed(number):
global buttons
if ModuleCheck():
if buttons[number - 1].get_calibratedValue() < 300:
return True
else:
print "module not online"
return False
def ModuleCheck():
global moduleb
return moduleb.isOnline()
I'm not quite sure about what might be going wrong. But it takes about an hour before the RPI runs out of memory.
The memory increases in size constantly and the button is only pressed once every 15 minutes or so.
That already tells me that the problem must be in the code displayed above.

The problem is that the yocto_api.YAPI object will continue to accumulate _Event objects in its _DataEvents dict (a class-wide attribute) until you call YAPI.YHandleEvents. If you're not using the API's callbacks, it's easy to think (I did, for hours) that you don't need to ever call this. The API docs aren't at all clear on the point:
If your program includes significant loops, you may want to include a call to this function to make sure that the library takes care of the information pushed by the modules on the communication channels. This is not strictly necessary, but it may improve the reactivity of the library for the following commands.
I did some playing around with API-level callbacks before I decided to periodically poll the sensors in my own code, and it's possible that some setting got left enabled in them that is causing these events to accumulate. If that's not the case, I can't imagine why they would say calling YHandleEvents is "not strictly necessary," unless they make ARM devices with unlimited RAM in Switzerland.
Here's the magic static method that thou shalt call periodically, no matter what. I'm doing so once every five seconds and that is taking care of the problem without loading down the system at all. API code that would accumulate unwanted events still smells to me, but it's time to move on.
#noinspection PyUnresolvedReferences
#staticmethod
def HandleEvents(errmsgRef=None):
"""
Maintains the device-to-library communication channel.
If your program includes significant loops, you may want to include
a call to this function to make sure that the library takes care of
the information pushed by the modules on the communication channels.
This is not strictly necessary, but it may improve the reactivity
of the library for the following commands.
This function may signal an error in case there is a communication problem
while contacting a module.
#param errmsg : a string passed by reference to receive any error message.
#return YAPI.SUCCESS when the call succeeds.
On failure, throws an exception or returns a negative error code.
"""
errBuffer = ctypes.create_string_buffer(YAPI.YOCTO_ERRMSG_LEN)
#noinspection PyUnresolvedReferences
res = YAPI._yapiHandleEvents(errBuffer)
if YAPI.YISERR(res):
if errmsgRef is not None:
#noinspection PyAttributeOutsideInit
errmsgRef.value = YByte2String(errBuffer.value)
return res
while len(YAPI._DataEvents) > 0:
YAPI.yapiLockFunctionCallBack(errmsgRef)
if not (len(YAPI._DataEvents)):
YAPI.yapiUnlockFunctionCallBack(errmsgRef)
break
ev = YAPI._DataEvents.pop(0)
YAPI.yapiUnlockFunctionCallBack(errmsgRef)
ev.invokeData()
return YAPI.SUCCESS

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.