Recovering from zmq.error.Again on a zmq.PAIR socket - python

I have a single client talking to a single server using a pair socket:
context = zmq.Context()
socket = context.socket(zmq.PAIR)
socket.setsockopt(zmq.SNDTIMEO, 1000)
socket.connect("tcp://%s:%i"%(host,port))
...
if msg != None:
try:
socket.send(msg)
except Exception as e:
print(e, e.errno)
The program sends approximately one 10-byte message every second. We were seeing issues where the program would eventually start to hang infinitely waiting for a message to send, so we added a SNDTIMEO. However, now we are starting to get zmq.error.Again instead. Once we get this error, the resource never becomes available again. I'm looking into which error code exactly is occurring, but I was generally wondering what techniques people use to recover from zmq.error.Again inside their programs. Should I destroy the socket connection and re-establish it?

Fact#0: PAIR/PAIR is different from other ZeroMQ archetypes
RFC 31 explicitly defines:
Overall Goals of this Pattern
PAIR is not a general-purpose socket but is intended for specific use cases where the two peers are architecturally stable. This usually limits PAIR to use within a single process, for inter-thread communication.
Next, if not correctly set the SNDHWM size and in case of the will to use the PAIR to operate over tcp://-transport-class also all the O/S-related L3/L2-attributed, any next .send() will also yield EAGAIN error.
There are a few additional counter-measures ( CONFLATE, IMMEDIATE, HEARTBEAT_{IVL|TTL|TIMEOUT} ), but there is the above mentioned principal limit on PAIR/PAIR, which sets what not to expect to happen if using this archetype.
The main suspect:
Given the said design-side limits, a damaged transport-path, the PAIR-access-point will not re-negotiate the reconstruction of the socket into the RTO-state.
For this reason, if your code indeed wants to remain using PAIR/PAIR, it may be wise to assemble also an emergency SIG/flag path so as to allow the distributed-system robustly survive such L3/L2/L1-incidents, that the PAIR/PAIR is known not to auto-take care of.
Epilogue:
your code does not use non-blocking .send()-mode, while the EAGAIN error-state is exactly used to signal a blocked-capability ( unability of the Access-Point to .send() at this very moment ) by setting the EAGAIN.
Better use the published API details:
aRetCODE = -1 # _______________________________________ PRESET
try:
aRetCODE = socket.send( msg, zmq.DONTWAIT ) #_______ .SET on RET
if ( aRetCODE == -1 ):
... # ZeroMQ: SIG'd via ERRNO:
except:
... #_______ .HANDLE EXC
finally:
...

Related

Python socket sendall blocks and I'm not sure how to handle bad clients / slow consumers

To simplify things, assume a TCP client-server app where the client sends a request and the server responds. The server uses sendall to respond to each client.
Now assume a bad client that sends requests to the server but doesn't really handle the responses. I.e. the client never calls socket.recv. (It doesn't have to be a bad client btw...it may be a slow consumer on the other end).
What ends up happening, is that the server keeps sending responses using sendall, until I'm assuming a buffer gets full, and then at some point sendall blocks and never returns.
This seems like a common problem to me so what would be the recommended solution?
Is there something like a try-send that would raise or return an EWOULDBLOCK (or similar) if the recipient's buffer is full? I'd like to avoid non-blocking select type calls if possible (happy to go that way if there are no alternatives).
Thank you in advance.
Following rveed's comment, here's a solution that works for my case:
def send_to_socket(self, sock: socket.socket, message: bytes) -> bool:
try:
sock.settimeout(10.0) # protect against bad clients / slow consumers by making this timeout (instead of blocking)
res = sock.sendall(message)
sock.settimeout(None) # put back to blocking (if needed for subsequent calls to recv, etc. using this socket)
if res is not None:
return False
return True
except socket.timeout as st:
# do whatever you need to here
return False
except Exception as ex:
# handle other exceptions here
return False
If needed, instead of setting the timeout to None afterwards (i.e. back to blocking), you can store the previous timeout value (using gettimeout) and restore to that.

How to detect whether linger was reached when closing socket using ZeroMQ?

The following dispatch() function runs receives messages through a Queue.queue and sends them using a ZeroMQ PUSH socket to an endpoint.
I want this function to exit, once it receives None through the queue, but if the socket's underlying message buffer has any undelivered messages (the remote endpoint is down), then the application won't terminate. Thus, once the function receives a None, it closes the socket with a specified linger.
Using this approach, how can I detect whether the specified linger was reached or not? In particular, no exception is raised.
def dispatch(self):
context = zmq.Context()
socket = context.socket(zmq.PUSH)
poller = zmq.Poller()
socket.connect('tcp://127.0.0.1:5555')
poller.register(socket, zmq.POLLOUT)
while True:
try:
msg = self.dispatcher_queue.get(block=True, timeout=0.5)
except queue.Empty:
continue
if msg is None:
socket.close(linger=5000)
break
try:
socket.send_json(msg)
except Exception as exc:
raise common.exc.WatchdogException(
f'Failed to dispatch resource match to processor.\n{msg=}') from exc
Q : "How to detect whether linger was reached when closing socket using ZeroMQ?"
Well, not an easy thing.
ZeroMQ internally hides all these details from a user-level code, as the API was (since ever, till recent v4.3) crafted with all the beauties of the art of Zen-of-Zero, for the sake of maximum performance, almost linear scaling and minimum latency. Do Zero-steps that do not support ( the less if violate ) this.
There might be three principal directions of attack on solving this:
one may try to configure & use the event-observing overlay of zmq_socket_monitor() to analyse actual sequence of events on the lowest Level-of-Detail achievable
one may also try a rather brute way to set infinite LINGER attribute for the zmq.Socket()-instance & directly kill the blocking operation by sending a SIGNAL after set amount of (now) soft-"linger" expired, be it using the new in v4.3+ features of zmq_timers ( a [ms]-coarse framework of timers / callback utilities ) or one's own
one may prefer to keep things clean and still meeting the goal by "surrounding" a call to the zmq_ctx_term(), which as per v4.3 documented API will block ( be warned, that it is not warranted to be so in other API versions back & forth ). This way may help you indirectly detect a duration actually spent in blocking-state, like :
...
NOMINAL_LINGER_ON_CLOSE = 5000
MASK_a = "INF: .term()-ed ASAP, after {0:} [us] from {1:} [ms] granted"
MASK_b = "INF: .term()-ed ALAP, after {0:} [us] from {1:} [ms] granted"
...
socket.setsockopt( zmq.LINGER, NOMINAL_LINGER_ON_CLOSE ) # ____ BE EXPLICIT, ALWAYS
aClk = zmq.Stopwatch()
aClk.start() #_________________________________________________ BoBlockingSECTION
context.term() # /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ BLOCKING.......
_ = aClk.stop() #______________________________________________ EoBlockingSECTION
...
print( ( MASK_a if _ < ( NOMINAL_LINGER_ON_CLOSE * 1000 )
else MASK_b
).format( _, NOMINAL_LINGER_ON_CLOSE )
)
...
My evangelisation of always, indeed ALWAYS being rather EXPLICIT is based on both having seen the creeping "defaults", that changed from version to version ( which is fair to expect to continue the same way forth ), so a responsible design shall, indeed ALWAYS, imperatively re-enter those very values, that we want to be in-place, as our code will survive both our current version & us, mortals, still having Zero-warranty of what part of our current assumptions will remain "defaults" in any future version / revision & having the same uncertainty what version-mix would be present in the domain of our deployed piece of code ( as of EoY-2020, there are still v2.1, v2.11, v3.x, v4.x running wild somewhere out there so one never knows, do we? )

pySerial Capturing a long response

Hi guys I'm working a on script that will get data from a host using the Data Communications Standard (Developed by: Data Communication Standard Committee Lens Processing Division of The Vision Council), by serial port and pass the data into ModBus Protocol for the device to perform it's operations.
Since I don't fiscally have access to the host machine I'm trying to develop a secondary script to emulate the host. I am currently on the stage where I need to read a lot of information from the serial port and I get only part of the data. I was hoping to get the whole string sent on the send_job() function on my host emulator script.
Guys also can any of you tell me if this would be a good approach? the only thing the machine is supposed to do is grab 2 values from the host response and assign them to two modbus holding registers.
NOTE: the initialization function is hard coded because it will always be the same and the actual response data will not matter except for status. Also the job request is hard coded i only pass the job # that i get from a modbus holding register, the exact logic on how the host resolved this should not matter i only need to send the job number scanned from the device in this format.
main script:
def request_job_modbus(job):
data = F'[06][1c]req=33[0d][0a]job={job}[0d][0a][1e][1d]'.encode('ascii')
writer(data)
def get_job_from_serial():
response = serial_client.read_all()
resp = response.decode()
return resp
# TODO : SEND INIT SEQUENCE ONCE AND VERIFY IF REQUEST status=0
initiation_request()
init_response_status = get_init_status()
print('init method being active')
print(get_init_status())
while True:
# TODO: get job request data
job_serial = get_job_from_serial()
print(job_serial)
host emulation script:
def send_job():
job_response = '''[06][1c]ans=33[0d]job=30925[0d]status=0;"ok"[0d]do=l[0d]add=;2.50[0d]ar=1[0d]
bcerin=;3.93[0d]bcerup=;-2.97[0d]crib=;64.00[0d]do=l[0d]ellh=;64.00[0d]engmask=;613l[0d]
erdrin=;0.00[0d]erdrup=;10.00[0d]ernrin=;2.00[0d]ernrup=;-8.00[0d]ersgin=;0.00[0d]
ersgup=;4.00[0d]gax=;0.00[0d]gbasex=;-5.30[0d]gcrosx=;-7.96[0d]kprva=;275[0d]kprvm=;0.55[0d]
ldpath=\\uscqx-tcpmain-at\lds\iot\do\800468.sdf[0d]lmatid=;151[0d]lmatname=;f50[0d]
lnam=;vsp_basic_fh15[0d]sgerin=;0.00[0d]sgerup=;0.00[0d]sval=;5.18[0d]text_11=;[0d]
text_12=;[0d]tind=;1.53[0d][1e][1d]'''.encode('ascii')
writer(job_response)
def get_init_request():
req = p.readline()
print(req)
request = req.decode()[4:11]
# print(request)
if request == 'req=ini':
print('request == req=ini??? <<<<<<< cumple condicion y enviala respuesta')
send_init_response()
send_job()
while True:
# print(get_init_request())
get_init_request()
what I get in screen: main script
init method being active
bce
erd
condition was met init status=0
outside loop
ers
condition was met init status=0
inside while loop
trigger reset <<<--------------------
5782
`:lmatid=;151[0d]lmatname=;f50[0d]
lnam=;vsp_basic_fh15[0d]sgerin=;0.00[0d]sgerup=;0.00[0d]sval=;5.18[0d]text_11=;[0d]
text_12=;[0d]tind=;1.53[0d][1e][1d]
outside loop
condition was met init status=0
outside loop
what I get in screen: host emulation script
b'[1c]req=ini[0d][0a][1e][1d]'
request == req=ini??? <<<<<<< cumple condicion y enviala respuesta
b''
b'[06][1c]req=33[0d][0a]job=5782[0d][0a][1e][1d]'
b''
b''
b''
b''
b''
b''
I'm suspect you're trying to write too much at once to a hardware buffer that is fairly small. Especially when dealing with low power hardware, assuming you can stuff an entire message into a buffer is not often correct. Even full modern PC's sometimes have very small buffers for legacy hardware like serial ports. You may find when you switch from development to actual hardware, that the RTS and DTR lines need to be used to determine when to send or receive data. This will be up to whoever designed the hardware unfortunately, as they are often also ignored.
I would try chunking your data transfer into smaller bits as a test to see if the whole message gets through. This is a quick and dirty first attempt that may have bugs, but it should get you down the right path:
def get_job_from_serial():
response = b'' #buffer for response
while True:
try:
response += serial_client.read() #read any available data or wait for timeout
#this technically could only be reading 1 char at a time, but any
#remotely modern pc should easily keep up with 9600 baud
except serial.SerialTimeoutException: #timeout probably means end of data
#you could also presumably check the length of the buffer if it's always
#a fixed length to determine if the entire message has been sent yet.
break
return response
def writer(command):
written = 0 #how many bytes have we actually written
chunksize = 128 #the smaller you go, the less likely to overflow
# a buffer, but the slower you go.
while written < len(command):
#you presumably might have to wait for p.dtr() == True or similar
#though it's just as likely to not have been implemented.
written += p.write(command[written:written+chunksize])
p.flush() #probably don't actually need this
P.S. I had to go to the source code for p.read_all (for some reason I couldn't find it online), and it does not do what I think you expect it does. The exact code for it is:
def read_all(self):
"""\
Read all bytes currently available in the buffer of the OS.
"""
return self.read(self.in_waiting)
There is no concept of waiting for a complete message, it just a shorthand for grab everything currently available.

python socketserver occasionally stops sending (and receiving?) messages

I've been experiencing a problem with a socketserver I wrote where the socketserver will seem to stop sending and receiving data on one of the ports it uses (while the other port continues to handle data just fine). Interestingly, after waiting a minute (or up to an hour or so), the socketserver will start sending and receiving messages again without any observable intervention.
I am using the Eventlet socketing framework, python 2.7, everything running on an ubuntu aws instance with external apps opening persistent connections to the socketserver.
From some reading I've been doing, it looks like I may not be implementing my socket server correctly.
According to http://docs.python.org/howto/sockets.html:
fundamental truth of sockets: messages must either be fixed length (yuck), or be delimited > > (shrug), or indicate how long they are (much better), or end by shutting down the connection.
I am not entirely sure that I am using a fix length message here (or am I?)
This is how I am receiving my data:
def socket_handler(sock, socket_type):
logg(1,"socket_handler:initializing")
while True:
recv = sock.recv(1024)
if not recv:
logg(1,"didn't recieve anything")
break
if len(recv) > 5:
logg(1,"socket handler: %s" % recv )
plug_id, phone_sid, recv_json = parse_json(recv)
send = 1
if "success" in recv_json and recv_json["success"] == "true" and socket_type == "plug":
send = 0
if send == 1:
send_wrapper(sock, message_relayer(recv, socket_type))
else:
logg(2, 'socket_handler:Ignoring received input: ' + str(recv) )
logg(1, 'Closing socket handle: [%s]' % str(sock))
sock.shutdown(socket.SHUT_RDWR)
sock.close()
"sock" is a socket object returned by the listener.accept() function.
The socket_handler function is called like so:
new_connection, address = listener.accept()
...<code omitted>...
pool.spawn_n(socket_handler, new_connection, socket_type)
Does my implementation look incorrect to anyone? Am I basically implementing a fixed length conversation protocol? What can I do to help investigate the issue or make my code more robust?
Thanks in advance,
T
You might be having buffering related problems if you're requesting to receive more bytes at the server (1024) than you're actually sending from the client.
To fix the problem, what's is usually done is encode the length of the message first and then the message itself. This way, the receiver can get the length field (which is of known size) and then read the rest of the message based on the decoded length.
Note: The length field is usually as many bytes long as you need in your protocol. Some protocols are 4-byte aligned and use a 32 bit field for this, but if you find that you've got enough with 1 or 2 bytes, then you can use that. The point here is that both client and server know the size of this field.

How to handle a broken pipe (SIGPIPE) in python?

I've written a simple multi-threaded game server in python that creates a new thread for each client connection. I'm finding that every now and then, the server will crash because of a broken-pipe/SIGPIPE error. I'm pretty sure it is happening when the program tries to send a response back to a client that is no longer present.
What is a good way to deal with this? My preferred resolution would simply close the server-side connection to the client and move on, rather than exit the entire program.
PS: This question/answer deals with the problem in a generic way; how specifically should I solve it?
Assuming that you are using the standard socket module, you should be catching the socket.error: (32, 'Broken pipe') exception (not IOError as others have suggested). This will be raised in the case that you've described, i.e. sending/writing to a socket for which the remote side has disconnected.
import socket, errno, time
# setup socket to listen for incoming connections
s = socket.socket()
s.bind(('localhost', 1234))
s.listen(1)
remote, address = s.accept()
print "Got connection from: ", address
while 1:
try:
remote.send("message to peer\n")
time.sleep(1)
except socket.error, e:
if isinstance(e.args, tuple):
print "errno is %d" % e[0]
if e[0] == errno.EPIPE:
# remote peer disconnected
print "Detected remote disconnect"
else:
# determine and handle different error
pass
else:
print "socket error ", e
remote.close()
break
except IOError, e:
# Hmmm, Can IOError actually be raised by the socket module?
print "Got IOError: ", e
break
Note that this exception will not always be raised on the first write to a closed socket - more usually the second write (unless the number of bytes written in the first write is larger than the socket's buffer size). You need to keep this in mind in case your application thinks that the remote end received the data from the first write when it may have already disconnected.
You can reduce the incidence (but not entirely eliminate) of this by using select.select() (or poll). Check for data ready to read from the peer before attempting a write. If select reports that there is data available to read from the peer socket, read it using socket.recv(). If this returns an empty string, the remote peer has closed the connection. Because there is still a race condition here, you'll still need to catch and handle the exception.
Twisted is great for this sort of thing, however, it sounds like you've already written a fair bit of code.
Read up on the try: statement.
try:
# do something
except socket.error, e:
# A socket error
except IOError, e:
if e.errno == errno.EPIPE:
# EPIPE error
else:
# Other error
SIGPIPE (although I think maybe you mean EPIPE?) occurs on sockets when you shut down a socket and then send data to it. The simple solution is not to shut the socket down before trying to send it data. This can also happen on pipes, but it doesn't sound like that's what you're experiencing, since it's a network server.
You can also just apply the band-aid of catching the exception in some top-level handler in each thread.
Of course, if you used Twisted rather than spawning a new thread for each client connection, you probably wouldn't have this problem. It's really hard (maybe impossible, depending on your application) to get the ordering of close and write operations correct if multiple threads are dealing with the same I/O channel.
I face with the same question. But I submit the same code the next time, it just works.
The first time it broke:
$ packet_write_wait: Connection to 10.. port 22: Broken pipe
The second time it works:
[1] Done nohup python -u add_asc_dec.py > add2.log 2>&1
I guess the reason may be about the current server environment.
My answer is very close to S.Lott's, except I'd be even more particular:
try:
# do something
except IOError, e:
# ooops, check the attributes of e to see precisely what happened.
if e.errno != 23:
# I don't know how to handle this
raise
where "23" is the error number you get from EPIPE. This way you won't attempt to handle a permissions error or anything else you're not equipped for.

Categories