Synopsis:
My program occasionally runs into a condition where it wants to send data over a socket, but that socket is blocked waiting for a response to a previous command that never came. Is there any way to unblock the socket and pick back up with it when this happens? If not that, how could I test whether the socket is blocked so I could close it and open a new one? (I need blocking sockets in the first place)
Details:
I'm connecting to a server over two sockets. Socket 1 is for general command communication. Socket 2 is for aborting running commands. Aborts can come at any time and frequently. Every command sent over socket 1 gets a response, such as:
socket1 send: set command data
socket1 read: set command ack
There is always some time between the send and the read, as the server doesn't send anything back until the command is finished executing.
To interrupt commands in progress, I connect over a another socket and issue an abort command. I then use socket 1 to issue a new command.
I am finding that occasionally commands issued over socket 1 after an abort are hanging the program. It appears that socket 1 is blocked waiting for a response to a previously issued command that never returned (and that got interrupted). While usually it works sometimes it doesn't (I didn't write the server).
In these cases, is there any way for me to check to see if socket 1 is blocked waiting for a read, and if so, abandon that read and move on? Or even any way to check at all so I can close that socket and start again?
thx!
UPDATE 1: thanks for the answers. As for why I'm using blocking sockets, it's because I'm controlling a CNC-type machine with this code, and I need to know when the command I've asked it to execute is done executing. The server returns the ACK when it's done, so that seems like a good way to handle it. I like the idea of refactoring for non-blocking but can't envision a way to get info on when the command is done otherwise. I'll look at select and the other options.
Not meaning to seem disagreeable, but you say you need blocking sockets and then go on to describe some very good reasons for needing non-blocking sockets. I would recommend refactoring to use non-blocking.
Aside from that, the only method I'm aware of to know if a socket is blocked is the fact that your program called recv or one of its variants and has not yet returned. Someone else may know an API that I don't, but setting a "blocked" boolean before the recv call and clearing it afterward is probably the best hack to get you that info. But you don't want to do that. Trust me, the refactor will be worth it in the long run.
The traditional solution to this problem is to use select. Before writing, test whether the socket will support writing, and if not, do something else (such as waiting for a response first). One level above select, Python provides the asyncore module to enable such processing. Two level above, Twisted is an entire framework dealing with asynchronous processing of messages.
Sockets should be full duplex. If Python blocks a thread from writing to a socket while another thread is reading from the same socket I would regard it as a major bug in Python. This doesn't occur in any other programming language I've used.
What you really what is to block on a select() or poll(). The only way to unblock a blocked socket is to receive data or a signal which is probably not acceptable. A select() or poll() call can block waiting for one or more sockets, either on reading or writing (waiting for buffer space). They can also take a timeout if you want to wait periodically to check on other things. Take a look at my answer to Block Socket with Unix and C/C++ Help
Related
In my (python) code I have a thread listening for changes from a couchdb feed (continuous changes). The changes request has a timeout parameter which is too big in certain circumstances (for example when a user wants to interrupt the program manually with ^C).
How can I abort a long-running blocking http request?
Is this possible, or do I need to reduce the timeout to make my program more responsive?
This would be unfortunate, because having a timeout small enough to make the program really responsive (say, 1s), means that there are lots of connections being created (one per second!), which defeats the purpose of listening to changes, and makes it very difficult to make sure that we are not missing any changes (in the re-connecting timespan we can indeed miss changes, so that special code is needed to handle that case)
The other option is to forcefully abort the thread, but that is not really an option in python.
If I understand correctly it looks like you are waiting too long between requests before deciding whether to respond to the users or not. You are right continuously closing and creating new connections will defeat the purpose of changes feed.
A solution could be to use heartbeat query parameter in which couchdb will keep sending newlines to tell the client that the connection is still alive.
http://localhost:5984/hello/_changes?feed=continuous&heartbeat=1000&include_docs=true
as long as you are getting heartbeats (newlines) you can be sure that you are getting new changes. A new line will indicate that no changes have occurred. Where as an actual change will be reported back. No need to close the connection. Respond to your clients if resp!="/n"
Blocking the thread execution in general prevents the thread from beeing terminated. You need to wait until the request timed out. But this is already clear.
Using a library that supports non blocking requests is maybe a solution, but I don't know if there is any.
Anyway ... you've mentioned that reducing the timeout will lead to more connections. I'd suggest to implement a waiting loop between requests that can be interrupted by an external signal to terminate the thread. with this loop you can control the number of requests independent from the timeout.
Not sure if I have myself a problem with a python script I'm using.
Basically, it spawns threads, each creating a tcp connections.
Well the script finishes, i even check if any threads are still functioning, all of them return False ( not active which is good).
The issue is that, if I check ( I use CurPorts from nirsoft ) some tcp connections ( between 1 and 9 sometimes ) are still in established status and sometimes in Sent status.
Is that a problem ?They die out eventually, but after several minutes.
IS that on python's side fault, or basic windows procedure?
I close the sockets with S.close, if that's of any help. I'm not using daemon threads, just simple threads, and I just wait until all of them finish (t.join())
I need to know i I should worry or just let them be. reason is that they are eating up the ephemeral port number, and besides that i do not know if its keeping resources from me.
Would appreciate a noob friendly response.
Last Edit: I do use S.shutdown() before I send S.close()
I do not do any error checking though, I have no idea how to go about doing that besides try:
The simple answer is that TCP connections need to be gracefully shutdown before being closed.
There are more details on how to call shutdown and close there:
How do you close TCP connections gracefully without exceptions?
socket.shutdown vs socket.close
http://msdn.microsoft.com/en-us/library/ms738547(VS.85).aspx
Sorry to bother everyone with this, but I've been stumped for a while now.
The problem is that I decided to reconfigure this chat program I had using sockets so that instead of a client and a sever/client, it would have a server, and then two separate clients.
I asked earlier as to how I might get my server to 'manage' these connections of the clients so that it could redirect the data between them. And I got a fantastic answer that provided me with exactly the code I would apparently need to do this.
The problem is I don't understand how it works, and I did ask in the comments but I didn't get much of a reply except for some links to documentation.
Here's what I was given:
connections = []
while True:
rlist,wlist,xlist = select.select(connections + [s],[],[])
for i in rlist:
if i == s:
conn,addr = s.accept()
connections.append(conn)
continue
data = i.recv(1024)
for q in connections:
if q != i and q != s:
q.send(data)
As far as I understand, the select module gives the ability to make waitable objects in the case of select.select.
I've got the rlist, the pending to be read list, the wlist, the pending to be written list, and then the xlist, the pending exceptional condition.
He's assigning the pending to be written list to "s" which in my part of the chat server, is the socket that is listening on the assigned port.
That's about as much as I feel I understand clearly enough. But I would really really like some explanation.
If you don't feel like I asked an appropriate question, tell me in the comments and I'll delete it. I don't want to violate any rules, and I'm pretty sure I am not duplicating threads as I did do research for a while before resorting to asking.
Thanks!
Note: my explanation here assumes you're talking about TCP sockets, or at least some type which is connection-based. UDP and other datagram (i.e. non-connection-based) sockets are similar in some ways, but the way you use select on them in slightly different.
Each socket is like an open file which can have data read and written to it. Data that you write goes into a buffer inside the system waiting to be sent out on the network. Data that arrives from the network is buffered inside the system until you read it. Lots of clever stuff is going on underneath, but when you're using a socket that's all you really need to know (at least initially).
It's often useful to remember that the system is doing this buffering in the explanation that follows, because you'll realise that the TCP/IP stack in the OS sends and receives data independently of your application - this is done so your application can have a simple interface (that's what the socket is, a way of hiding all the TCP/IP complexity from your code).
One way of doing this reading and writing is blocking. Using that system, when you call recv(), for example, if there is data waiting in the system then it will be returned immediately. However, if there is no data waiting then the call blocks - that is, your program halts until there is data to read. Sometimes you can do this with a timeout, but in pure blocking IO then you really can wait forever until the other end either sends some data or closes the connection.
This doesn't work too badly for some simple cases, but only where you're talking to one other machine - when you're talking on more than one socket, you can't just wait for data from one machine because the other one may be sending you stuff. There are also other issues which I won't cover in too much detail here - suffice to say it's not a good approach.
One solution is to use different threads for each connection, so the blocking is OK - other threads for other connections can be blocked without affecting each other. In this case you'd need two threads for each connection, one to read and one to write. However, threads can be tricky beasts - you need to carefully synchronise your data between them, which can make coding a little complicated. Also, they're somewhat inefficient for a simple task like this.
The select module allows you a single-threaded solution to this problem - instead of blocking on a single connection, it allows you a function which says "go to sleep until at least one of these sockets has some data I can read on it" (that's a simplification which I'll correct in a moment). So, once that call to select.select() returns, you can be certain that one of the connections you're waiting on has some data, and you can safely read it (even with blocking IO, if you're careful - since you're sure there's data there, you won't ever block waiting for it).
When you first start your application, you have only a single socket which is your listening socket. So, you only pass that in the call to select.select(). The simplification I made earlier is that actually the call accepts three lists of sockets for reading, writing and errors. The sockets in the first list are watched for reading - so, if any of them have data to read, the select.select() function returns control to your program. The second list is for writing - you might think you can always write to a socket, but actually if the other end of the connection isn't reading data fast enough then your system's write buffer can fill up and you can temporarily be unable to write. It looks like the person who gave you your code ignored this complexity, which isn't too bad for a simple example because usually the buffers are big enough you're unlikely to hit problems in simple cases like this, but it's an issue you should address in the future once the rest of your code works. The final list is watched for errors - this isn't widely used, so I'll skip it for now. Passing the empty list is fine here.
At this point someone connects to your server - as far as select.select() is concerned this counts as making the listen socket "readable", so the function returns and the list of readable sockets (the first return value) will include the listen socket.
The next part runs over all the connections which have data to read, and you can see the special case for your listen socket s. The code calls accept() on it which will take the next waiting new connection from the listen socket and turn it into a brand new socket for that connection (the listen socket continues to listen and may have other new connections also waiting on it, but that's fine - I'll cover this in a second). The brand new socket is added to the connections list and that's the end of handling the listen socket - the continue will move on to the next connection returned from select.select(), if any.
For other connections that are readable, the code calls recv() on them to recover the next 1024 bytes (or whatever is available if less than 1024 bytes). Important note - if you hadn't used select.select() to make sure the connection was readable, this call to recv() could block and halt your program until data arrived on that specific connection - hopefully this illustrates why the select.select() is required.
Once some data has been read the code runs over all the other connections (if any) and uses the send() method to copy the data down them. The code correctly skips the same connection as the data just arrived on (that's the business about q != i) and also skips s, but as it happens this isn't required since as far as I can see it's never actually added to the connections list.
Once all readable connections have been processed, the code returns to the select.select() loop to wait for more data. Note that if a connection still has data, the call returns immediately - this is why accepting only a single connection from the listen socket is OK. If there are more connections, select.select() will return again immediately and the loop can handle the next available connection. You can use non-blocking IO to make this a bit more efficient, but it makes things more complicated so let's keep things simple for now.
This is a reasonable illustration, but unfortunately it suffers from some problems:
As I mentioned, the code assumes you can always call send() safely, but if you have one connection where the other end isn't receiving properly (maybe that machine is overloaded) then your code here could fill up the send buffer and then hang when it tries to call send().
The code doesn't cope with connections closing, which will often result in an empty string being returned from recv(). This should result in the connection being closed and removed from the connections list, but this code doesn't do it.
I've updated the code slightly to try and solve these two issues:
connections = []
buffered_output = {}
while True:
rlist,wlist,xlist = select.select(connections + [s],buffered_output.keys(),[])
for i in rlist:
if i == s:
conn,addr = s.accept()
connections.append(conn)
continue
try:
data = i.recv(1024)
except socket.error:
data = ""
if data:
for q in connections:
if q != i:
buffered_output[q] = buffered_output.get(q, b"") + data
else:
i.close()
connections.remove(i)
if i in buffered_output:
del buffered_output[i]
for i in wlist:
if i not in buffered_output:
continue
bytes_sent = i.send(buffered_output[i])
buffered_output[i] = buffered_output[i][bytes_sent:]
if not buffered_output[i]:
del buffered_output[i]
I should point out here that I've assumed that if the remote end closes the connection, we also want to close immediately here. Strictly speaking this ignores the potential for TCP half-close, where the remote end has sent a request and closes its end, but still expects data back. I believe very old versions of HTTP used to sometimes do this to indicate the end of the request, but in practice this is rarely used any more and probably isn't relevant to your example.
Also it's worth noting that a lot of people make their sockets non-blocking when using select - this means that a call to recv() or send() which would otherwise block will instead return an error (raise an exception in Python terms). This is done partly for safety, to make sure a careless bit of code doesn't end up blocking the application; but it also allows some slightly more efficient approaches, such as reading or writing data in multiple chunks until there's none left. Using blocking IO this is impossible because the select.select() call only guarantees there's some data to read or write - it doesn't guarantee how much. So you can only safely call a blocking send() or recv() once on each connection before you need to call select.select() again to see whether you can do so again. The same applies to the accept() on a listening socket.
The efficiency savings are only generally a problem on systems which have a large number of busy connections, however, so in your case I'd keep things simple and not worry about blocking for now. In your case, if your application seems to hang up and become unresponsive then chances are you're doing a blocking call somewhere where you shouldn't.
Finally, if you want to make this code portable and/or faster, it might be worth looking at something like libev, which essentially has several alternatives to select.select() which work well on different platforms. The principles are broadly similar, however, so it's probably best to focus on select for now until you get your code running, and the investigate changing it later.
Also, I note that a commenter has suggested Twisted which is a framework which offers a higher-level abstraction so that you don't need to worry about all of the details. Personally I've had some issues with it in the past, such as it being difficult to trap errors in a convenient way, but many people use it very successfully - it's just an issue of whether their approach suits the way you think about things. Worth investigating at the very least to see whether its style suits you better than it does me. I come from a background writing networking code in C/C++ so perhaps I'm just sticking to what I know (the Python select module is quite close to the C/C++ version on which it's based).
Hopefully I've explained things sufficiently there - if you still have questions, let me know in the comments and I can add more detail to my answer.
I've been struggling along with sockets, making OK progress, but I keep running into problems, and feeling like I must be doing something wrong for things to be this hard.
There are plenty of tutorials out there that implement a TCP client and server, usually where:
The server runs in an infinite loop, listening for and echoing back data to clients.
The client connects to the server, sends a message, receives the same thing back, and then quits.
That I can handle. However, no one seems to go into the details of what you should and shouldn't be doing with sequential communication between the same two machines/processes.
I'm after the general sequence of function calls for doing multiple messages, but for the sake of asking a real question, here are some constraints:
Each event will be a single message client->server, and a single string response.
The messages are pretty short, say 100 characters max.
The events occur relatively slowly, max of say, 1 every 5 seconds, but usually less than half that speed.
and some specific questions:
Should the server be closing the connection after its response, or trying to hang on to the connection until the next communication?
Likewise, should the client close the connection after it receives the response, or try to reuse the connection?
Does a closed connection (either through close() or through some error) mean the end of the communication, or the end of the life of the entire object?
Can I reuse the object by connecting again?
Can I do so on the same port of the server?
Or do I have reinstantiate another socket object with a fresh call to socket.socket()?
What should I be doing to avoid getting 'address in use' errors?
If a recv() times out, is the socket reusable, or should I throw it away? Again, can I start a new connection with the same socket object, or do I need a whole new socket?
If you know that you will communicate between the two processes soon again, there is no need for closing the connection. If your server has to deal with other connections as well, you want to make it multithreaded, though.
The same. You know that both have to do the same thing, right?
You have to create a new socket on the client and you can also not reuse the socket on the server side: you have to use the new socket returned by the next (clientsocket, address) = serversocket.accept() call. You can use the same port. (Think of webservers, they always accept connections to the same port, from thousands of clients)
In both cases (closing or not closing), you should however have a message termination sign, for example a \n. Then you have to read from the socket until you have reached the sign. This usage is so common, that python has a construct for that: socket.makefile and file.readline
UPDATE:
Post the code. Probably you have not closed the connection correctly.
You can call recv() again.
UPDATE 2:
You should never assume that the connection is reliable, but include mechanisms to reconnect in case of errors. Therefore it is ok to try to use the same connection even if there are longer gaps.
As for errors you get: if you need specific help for your code, you should post small (but complete) examples.
I started using ZeroMQ this week, and when using the Request-Response pattern I am not sure how to have a worker safely "hang up" and close his socket without possibly dropping a message and causing the customer who sent that message to never get a response. Imagine a worker written in Python who looks something like this:
import zmq
c = zmq.Context()
s = c.socket(zmq.REP)
s.connect('tcp://127.0.0.1:9999')
while i in range(8):
s.recv()
s.send('reply')
s.close()
I have been doing experiments and have found that a customer at 127.0.0.1:9999 of socket type zmq.REQ who makes a fair-queued request just might have the misfortune of having the fair-queuing algorithm choose the above worker right after the worker has done its last send() but before it runs the following close() method. In that case, it seems that the request is received and buffered by the ØMQ stack in the worker process, and that the request is then lost when close() throws out everything associated with the socket.
How can a worker detach "safely" — is there any way to signal "I don't want messages anymore", then (a) loop over any final messages that have arrived during transmission of the signal, (b) generate their replies, and then (c) execute close() with the guarantee that no messages are being thrown away?
Edit: I suppose the raw state that I would want to enter is a "half-closed" state, where no further requests could be received — and the sender would know that — but where the return path is still open so that I can check my incoming buffer for one last arrived message and respond to it if there is one sitting in the buffer.
Edit: In response to a good question, corrected the description to make the number of waiting messages plural, as there could be many connections waiting on replies.
You seem to think that you are trying to avoid a “simple” race condition such as in
... = zmq_recv(fd);
do_something();
zmq_send(fd, answer);
/* Let's hope a new request does not arrive just now, please close it quickly! */
zmq_close(fd);
but I think the problem is that fair queuing (round-robin) makes things even more difficult: you might already even have several queued requests on your worker. The sender will not wait for your worker to be free before sending a new request if it is its turn to receive one, so at the time you call zmq_send other requests might be waiting already.
In fact, it looks like you might have selected the wrong data direction. Instead of having a requests pool send requests to your workers (even when you would prefer not to receive new ones), you might want to have your workers fetch a new request from a requests queue, take care of it, then send the answer.
Of course, it means using XREP/XREQ, but I think it is worth it.
Edit: I wrote some code implementing the other direction to explain what I mean.
I think the problem is that your messaging architecture is wrong. Your workers should use a REQ socket to send a request for work and that way there is only ever one job queued at the worker. Then to acknowledge completion of the work, you could either use another REQ request that doubles as ack for the previous job and request for a new one, or you could have a second control socket.
Some people do this using PUB/SUB for the control so that each worker publishes acks and the master subscribes to them.
You have to remember that with ZeroMQ there are 0 message queues. None at all! Just messages buffered in either the sender or receiver depending on settings like High Water Mark, and type of socket. If you really do need message queues then you need to write a broker app to handle that, or simply switch to AMQP where all communication is through a 3rd party broker.
I've been thinking about this as well. You may want to implement a CLOSE message which notifies the customer that the worker is going away. You could then have the worker drain for a period of time before shutting down. Not ideal, of course, but might be workable.
There is a conflict of interest between sending requests as rapidly as possible to workers, and getting reliability in case a worked crashes or dies. There is an entire section of the ZeroMQ Guide that explains different answers to this question of reliability. Read that, it'll help a lot.
tl;dr workers can/will crash and clients need a resend functionality. The Guide provides reusable code for that, in many languages.
Wouldn't the simplest solution be to have the customer timeout when waiting for the reply and then retry if no reply is received?
Try sleeping before the call to close. This is fixed in 2.1 but not in 2.0 yet.