Why won't ZMQ drop messages?

Why won't ZMQ drop messages? - python

I have an application which fetches messages from a ZeroMQ publisher, using a PUB/SUB setup. The reader is slow sometimes so I set a HWM on both the sender and receiver. I expect that the receiver will fill the buffer and jump to catch up when it recovers from processing slowdowns. But the behavior that I observe is that it never drops! ZeroMQ seems to be ignoring the HWM. Am I doing something wrong?
Here's a minimal example:
publisher.py
import zmq
import time
ctx = zmq.Context()
sock = ctx.socket(zmq.PUB)
sock.setsockopt(zmq.SNDHWM, 1)
sock.bind("tcp://*:5556")
i = 0
while True:
sock.send(str(i))
print i
time.sleep(0.1)
i += 1
subscriber.py
import zmq
import time
ctx = zmq.Context()
sock = ctx.socket(zmq.SUB)
sock.setsockopt(zmq.SUBSCRIBE, "")
sock.setsockopt(zmq.RCVHWM, 1)
sock.connect("tcp://localhost:5556")
while True:
print sock.recv()
time.sleep(0.5)

I believe there are a couple things at play here:
High Water Marks are not exact (see the last paragraph in the linked section) - typically this means the real queue size will be smaller than your listed number, I don't know how this will behave at 1.
Your PUB HWM will never drop messages... due to the way PUB sockets work, it will always immediately processes the message whether there is an available subscriber or not. So unless it actually takes ZMQ .1 seconds to process the message through the queue, your HWM will never come into play on the PUB side.
What should be happening is something like the following (I'm assuming an order of operations that would allow you to actually receive the first published message):
Start up subscriber.py & wait a suitable period to make sure it's completely spun up (basically immediately)
Start up publisher.py
PUB processes and sends the first message, SUB receives and processes the first message
PUB sleeps for .1 seconds and processes & sends the second message
SUB sleeps for .5 seconds, the socket receives the second message but sits in queue until the next call to sock.recv() processes it
PUB sleeps for .1 seconds and processes & sends the third message
SUB is still sleeping for another .3 seconds, so the third message should hit the queue behind the second message, which would make 2 messages in the queue, and the third one should drop due to the HWM
... etc etc etc.
I suggest the following changes to help troubleshoot the issue:
Remove the HWM on your publisher... it does nothing but add a variable we don't need to deal with in your test case, since we never expect it to change anything. If you need it for your production environment, add it back in and test it in a high volume scenario later.
Change the HWM on your subscriber to 50. It'll make the test take longer, but you won't be at the extreme edge case, and since the ZMQ documentation states that the HWM isn't exact, the extreme edge cases could cause unexpected behavior. Mind you, I believe your test (being small numbers) wouldn't do that, but I haven't looked at the code implementing the queues so I can't say with certainty, and it may be possible that your data is small enough that your effective HWM is actually larger.
Change your subscriber sleep time to 3 full seconds... in theory, if your queue holds up to exactly 50 messages, you'll saturate that within two loops (just like you do now), and then you'll have to wait 2.5 minutes to work through those messages to see if you start getting skips, which after the first 50 messages should start jumping large groups of numbers. But I'd wait at least 5-10 minutes. If you find that you start skipping after 100 or 200 messages, then you're being bitten by the smallness of your data.
This of course doesn't address what happens if you still don't skip any messages... If you do that and still experience the same issue, then we may need to dig more into how high water marks actually work, there may be something we're missing.

I met exactly the same problem, and my demo is nearly the same with yours, the subscriber or publisher won't drop any message after either zmq.RCVHWM or zmq.SNDHWM is set to 1.
I walk around after referring to the suicidal snail pattern for slow subscriber detection in Chap.5 of zguide. Hope it helps.
BTW: would you please let me know if you've solved the bug of zmq.HWM ?

Related

Why ZeroMQ SUBs are missing messages?

I built about 12000 subscribers per computer with threading as following
subscriber side:
def client(id):
context=zmq.Context()
subscriber=context.socket(zmq.SUB)
subscriber.connect('ip:port')
subscriber.setsockopt(zmq.SUBSCRIBE,(id+'e').encode())
while 1:
signal=subscriber.recv_multipart()
write logs...
for i in range(12000):
threading.Thread(target=client,args=(str(i+j*12000),)).start()
#j is arbitrary unduplicated int
publisher side:
subscriber=zmq.Context().socket(zmq.PUB)
subscriber.bind('tcp://*:port')
while 1:
for id in client_id:
subscriber.send_multipart([(id+'e').encode()]+[message])
When I used more than one computer(by using different j) to build subscribers, sometimes some subscribers could not receive message at all.
If I restart subscribers, those who could not receive message become normal. But those who were normal become to unable to receive message.
These problem will not show any errors, only can be found in my logs.
Do excessive connection occur this problem?

As the counts of connections / messages / sizes grow larger and larger, some default guesstimates typically cease to suffice. Try to extend some otherwise working defaults on the PUB-side configuration, where the problem seems to start choking ( do not forget that since v3.?+ the subscription-list processing got transferred from the SUB-side(s) to the central PUB-side. That reduces the volumes of data-flow, yet at some (here growing to remarkable amounts) additional costs on the PUB-side ~ RAM-for-buffers + CPU-for-TOPIC-list-filtering...
So, let's start with these steps on the PUB-side :
aSock2SUBs = zmq.Context( _tweak_nIOthreads ).socket( zmq.PUB ) # MORE CPU POWER
aSock2SUBs.setsockopt( zmq.SNDBUF, _tweak_SIZE_with_SO_SNDBUF ) # ROOM IN SNDBUF
And last but not least, PUB-s do silently drop any messages, that do not "fit" under its current HighWaterMark level, so let's tweak this one too :
aSock2SUBs.setsockopt( zmq.SNDHWM, _tweak_HWM_till_no_DROPs ) # TILL NO DROPS
Other { TCP_* | TOS | RECONNECT_IVL* | BACKLOG | IMMEDIATE | HEARTBEAT_* | ... }-low-level parameter settings may help further to make your herd of 12k+ SUB-s live in peace side by side with other (both friendly & hostile ) traffic and make your application more robust, than if relying just on pre-cooked API-defaults.
Consult both the ZeroMQ API documentation altogether also with the O/S defaults, as many of these ZeroMQ low-level attributes also rely on the O/S actual configuration values.
You shall also be warned, that making 12k+ threads in Python still leaves a purely [SERIAL] code execution, as the Python central GIL-lock ownership (exclusive) avoids (yes, principally avoids) any form of [CONCURRENT] co-execution, as the very ownership of the GIL-lock is exclusive and re-[SERIAL]-ises any amount of threads into a waiting queue and results in a plain sequence of chunks' execution ( By default, Python 2 will switch threads every 100 instructions. Since Python 3.2+, by default, the GIL will be released after 5 milliseconds ( 5,000 [us] ) so that other thread can have a chance to try & also acquire the GIL-lock. You can change these defaults, if the war of 12k+ threads on swapping the ownership of the GIL-lock actually results in "almost-blocking" any and all of the TCP/IP-instrumentation for message buffering, stacking, sending, re-transmit-ing until an in time confirmed reception. One may test it until a bleeding edge, yet choosing some safer ceiling might help if other parameters have been well adjusted for robustness.
Last but not least, enjoy the Zen-of-Zero, the masterpiece of Martin SUSTRIK for distributed-computing, so well crafted for ultimately scalable, almost zero-latency, very comfortable, widely ported signalling & messaging framework.

Further to user3666197's answer, you may also have to consider the time taken for all of those clients to connect. The PUBlisher has no idea how many SUBcribers there are supposed to be, and will simply get on with the job of sending out messages to those SUBscribers presently connected, from when the very first connection is made. The PUBlisher socket does not hang on to messages its sent just in case more SUBscribers connect at some undefined time in the future. Once a message has been transferred to 1 or more SUBscribers, it's dropped from the PUBlisher's queue. Also, the connections are not made instantaneously, and 12,000 is quite a few to get through.
It doesn't matter if you start your PUBlisher or SUBscriber program first; your 12,000 connections will be being made over a period of time once both programs are running, this happening asynchronously wrt to your own thread(s). Some SUBscribers will start getting messages whilst others will still be unknown to the PUBlisher. When, finally, all 12,000 connections are made then it will smooth out.

A Process to check if Infinite Loop is still running in Python3

I am unable to grasp this with the help of Programming concepts in general with the following scenario:
Note: All Data transmission in this scenario is done via UDP packets using socket module of Python3
I have a Server which sends some certain amount of data, assume 300 Packets over a WiFi Channel
At the other end, I have a receiver which works on a certain Decoding process to decode the data. This Decoding Process is kind of Infinite Loop which returns Boolean Value true or false at every iteration depending on certain aspects which can be neglected as of now
a Rough Code Snippet is as follows:Python3
incomingPacket = next(bringNextFromBuffer)
if decoder.consume_data(incomingPacket):
# this if condition is inside an infinite loop
# unless the if condition becomes True keep
# keep consuming data in a forever for loop
print("Data has been received")
Everything as of moment works since the Server and Client are in proximity and the data can be decoded. But in practical scenarios I want to check the loop that is mentioned above. For instance, after a certain amount of time, if the above loop is still in the Forever (Infinite) state I would like to send out something back to the server to start the data sending again.
I am not much clear with multithreading concept, but can I use a thread over here in this scenario?
For Example:
Thread a Process for a certain amount of time and keep checking the decoder.consume_data() function and if the time expires and the output is still False can I then send out a kind of Feedback to the server using struct.pack() over sockets.
Of course the networking logic, need NOT be addressed as of now. But is python capable of MONITORING THIS INFINITE LOOP VIA A PARALLEL THREAD OR OTHER CONCEPT OF PROGRAMMING?
Caveats
Unfortunately the Receiver in question is a dumb receiver i.e. No user control is specified. Only thing Receiver can do is decode the data and perhaps send a Feedback to the Server stating whether the data is received or not and that is possible only when the above mentioned LOOP is completed.
What is a possible solution here?
(Would be happy to share more information on request)

Yes you can do this. Roughly it'll look like this:
from threading import Thread
from time import sleep
state = 'running'
def monitor():
while True:
if state == 'running':
tell_client()
sleep(1) # to prevent too much happening here
Thread(target=monitor).start()
while state == 'running':
receive_data()

How to abort context.socket.recv() the right way in ZeroMQ?

I have a small software where I have a separate thread which is waiting for ZeroMQ messages. I am using the PUB/SUB communication protocol of ZeroMQ.
Currently I am aborting that thread by setting a variable "cont_loop" to False.
But I discovered that, when no messages arrive to the ZeroMQ subscriber I cannot exit the thread (without taking down the whole program).
def __init__(self):
Thread.__init__(self)
self.cont_loop = True
def abort(self):
self.continue_loop = False
def run(self):
zmq_context = zmq.Context()
zmq_socket = zmq_context.socket(zmq.SUB)
zmq_socket.bind("tcp://*:%s" % *(5556))
zmq_socket.setsockopt(zmq.SUBSCRIBE, "")
while self.cont_loop:
data = zmq_socket.recv()
print "Message: " + data
zmq_socket.close()
zmq_context.term()
print "exit"
I tried to move socket.close() and context.term() to abort-method. So that it shuts down the subscriber but this killed the whole program.
What is the correct way to shut down the above program?

Q: What is the correct way to ... ?
A: There are many ways to achieve the set goal. Let me pick just one, as a mock-up example on how to handle distributed process-to-process messaging.
First. Assume, there are more priorities in typical software design task. Some higher, some lower, some even so low, that one can defer an execution of these low-priority sub-tasks, so that there remains more time in the scheduler, to execute those sub-tasks, that cannot handle waiting.
This said, let's view your code. The SUB-side instruction to .recv() as was being used, causes two things. One visible - it performs a RECEIVE operation on a ZeroMQ-socket with a SUB-behaviour. The second, lesser visible is, it remains hanging, until it gets something "compatible" with a current state of the SUB-behaviour ( more on setting this later ).
This means, it also BLOCKS all the time since such .recv() method call UNTIL some unknown, locally uncontrollable coincidence of states/events makes it to deliver a ZeroMQ-message, with it's content being "compatible" with the locally pre-set state of this (still blocking) SUB-behaviour instance.
That may take ages.
This is exactly why .recv() is being rather used inside a control-loop, where external handling gets both the chance & the responsibility to do what you want ( including abort-related operations & a fair / graceful termination with proper resources' release(s) ).
Receive process becomes .recv( flags = zmq.NOBLOCK ) in rather a try: except: episode. Such a way your local process does not lose it's control over the stream-of-events ( incl. the NOP being one such ).
The best next step?
Take your time and get through a great book of gems, "Code Connected, Volume 1", Pieter HINTJENS, co-father of the ZeroMQ, has published ( also as PDF ).
Many his thoughts & errors to be avoided that he had shared with us is indeed worth your time.
Enjoy the powers of ZeroMQ. It's very powerful & worth getting mastered top-down.

Python Game Server - Optimizing networking

I have a Python game server running. The game is turn based. Players complain about short lag spikes which are several turns in duration (ie. their client is stuck waiting for awhile and then suddenly they receive several turns worth of updates all at once). I'm hoping to find some way to improve the networking consistency, but I'm not sure what's left to be done.
Here's what I'm doing:
Asynchronous sockets
TCP_NODELAY flag is set
epoll for polling
Here's how I receive plays:
for (fileno, event) in events:
if fileno == self.server_socket.fileno():
self.add_new_client()
elif event & select.EPOLLIN:
c = self.clients[fileno]
c.read() # reads and processes all input and generates all output
...
And here's how I send updates:
if turn_finished:
for user in self.clients.itervalues():
for msg in user.queued_messages:
msg = self.encode(msg)
bytes_sent = user.socket.send(msg)
...
Whenever I write to or read from sockets, I check that all bytes were sent and log any socket errors. These things almost never show up in the logs.
Would it be better if I only did one socket.send() call?
Is there anything I can check or tweak on the linux (Ubuntu) host?
There seems to be some issue with data arriving late to the clients. Can anyone give any suggestions for debugging this issue?
About the messages sent:
If a player is idle, they typically receive get 0-3 updates a turn. If a player is in the middle of action with other players, they typically receive to 2-3 updates per other player on their screen. Updates are typically about <20 bytes long. There are a couple updates (only sent to 1 player at a time) that are 256 and 500-600 bytes in size.
There are typically about 10-50 players active a time, and no more than 10 in the same screen.
P.S. - I run with PyPy. I've profiled and everything looks good. All player moves are handled in <3 ms and the server idles the vast majority of the time.

It sounds like you need to debug the client and find out what it is doing while it is "stuck".
Only once you understand the cause will you be able to find a solution.

Receiving multiple messages via socketserver but one is sent

A have a application with two threads. Its a network controlled game,
1. thread (Server)
Accept socket connections and receive messages
When message is sent, create an event and add it to the queue
Code:
class SingleTCPHandler(SocketServer.StreamRequestHandler):
def handle(self):
try:
while True:
sleep(0.06)
message = self.rfile.readline().strip()
my_event = pygame.event.Event(USEREVENT, {'control':message})
print message
pygame.event.post(my_event)
2. thread (pygame)
In charge of game rendering
Receives messages via event queue which Server populates
Renders the game based on messages every 60ms
This is how the game looks. The control messages are just speeds for the little square.
For the purpose of debug i connect to the server from a virtual machine with:
ncat 192.168.56.1 2000
And then send control messages. In production, these messages will be sent every 50ms by an Android device.
The problem
In my debug environment, i manually type messages with a period of a few seconds. During the time i don't type anything the game gets rendered many times. What happens is that the message (in server code) is constantly rendered with the previously received value.
I send the following:
1:0.5
On the console where the app is started i receive the following due to line print message in Server code:
alan#alan ~/.../py $ python main.py
1:0.5
What the game does is it acts as it is constantly (with the period it renders, and not every few seconds as i type) receiving this value.
SInce that is happenig i would expect that the print message which is in while True also outputs constantly and that the output is:
alan#alan ~/.../py $ python main.py
1:0.5
1:0.5
1:0.5
1:0.5
....
However that is not the case. Please advise (I'm also open for proposals to what to change the subject to if it isn't explanatory enough)

Your while True loop is polling the socket, which is only going to get messages when they are sent; it has no idea or care what the downstream event consumer is doing with those messages, it is just going to dispatch an event for and print the contents of the next record on the socket queue every .6 seconds. If you want the game to print the current command every render loop, you'll have to put the print statement in the render loop itself, not in the socket poller. Also, since you seem to want to have the last command "stick" and not post a new event unless the user actually inputs something, you might want to put an if message: block around the event dispatch code in the socket handler you have here. Right now, you'll send an empty event every .6 seconds if the user hasn't provided you any input since the last time you checked.
I also don't think it's probably advisable to put a sleep, or the loop you have for that matter, in your socket handler. The SocketServer is going to be calling it every time you receive data on the socket, so that loop is effectively being done for you, and all doing it here is going to do is open you up to overflowing the buffer, I think. If you want to control how often you post events to pygame, you probably want to do that by either blocking events of a certain type from being added if there is already 1 queued, or by grabbing all events of a given type from the queue each game loop and then just ignoring all but the first or last one. You could also control it by checking in the handler if it has been some amount of time since the last event was posted, but then you have to make sure the event consumer is capable of handling an event queue with multiple events waiting on it, and does the appropriate queue flushing when needed.
Edit:
Docs:
The difference is that the readline() call in the second handler will call recv() multiple times until it encounters a newline character, while the single recv() call in the first handler will just return what has been sent from the client in one sendall() call.
So yes, reading the whole line is guaranteed. In fact, I don't think the try is necessary either, since this won't even be called unless there is input to handle.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.