How to know if the remote tcp device is powered off - python

In my GO code, I am establishing a TCP connection as below:
conn, err1 := net.Dial("tcp", <remote_address>)
if err1 == nil {
buf := make([]byte, 256)
text, err := conn.Read(buf[:])
if err == io.EOF {
//remote connection close handle
fmt.Println("connection got reset by peer")
panic(err)
}
}
To simulate the other end, I am running a python script on a different computer, which opens a socket and sends some random data to the socket above lines of codes are listening to. Now my problem is, when I am killing this python code by pressing ctrl+C, the remote connection closed event is recognised finely by above code and I get a chance to handle that.
However, if I simply turn off the remote computer (where the python script is running) my code doesn't get notified at all.
In my case, the connection should always be opened and should be able to send the data randomly, and only if the remote machine gets powered off, my GO code should get notified.
Can someone help me in this scenario, how would I get notification when the remote machine hosting the socket itself gets powered off? How would I get the trigger remotely in my GO code?
PS - This seems to be a pretty common problem in real time, though not in the testing environment.

There is no way to determine the difference between a host that is powered off and a connection that has been broken, so you treat them the same way.
You can send a heartbeat message on your own, and close the connection when you reach some timeout period between heartbeat packets. The timeout can either be set manually by timing the packets, or you can use SetReadDeadline before each read to terminate the connection immediately when the deadline is reached.
You can also use TCP Keepalive to do this for you, using TCPConn.SetKeepAlive to enable it and TCPConn.SetKeepAlivePeriod to set the interval between keepalive packets. The time it takes to actually close the connection will be system dependent.
You should also set a timeout when dialing, since connecting to a down host isn't guaranteed to return an ICMP Host Unreachable response. You can use DialTimeout, a net.Dialer with the Timeout parameter set, or Dialer.DialContext.
Simply reading through the stdlib documentation should provide you with plenty of information: https://golang.org/pkg/net/

You need to add some kind of heartbeat message. Then, looking at GO documentation, you can use DialTimeout instead of Dial, each time you receive the heartbeat message or any other you can reset the timeout.
Another alternative is to use TCP keepalive. Which you can do in Python by using setsockopt, I can't really help you with GO but this link seems like a good description of how to enable keepalive with it:
http://felixge.de/2014/08/26/tcp-keepalive-with-golang.html

Related

Notification for FIN/ACK using python socket

I have a basic implementation of a TCP client using python sockets, all the client does is connect to a server and send heartbeats every X seconds. The problem is that I don't want to send the server a heartbeat if the connection is closed, but I'm not sure how to detect this situation without actually sending a heartbeat and catch an exception. When I turn off the server, in the traffic capture I see FIN/ACK arriving and the client sends an ACK back, this is when I want my code to do something (or at least change some internal state of the connection). Currently, what happens is after the server went down and X seconds passed since last heartbeat the client will try to send another heartbeat, only then I see RST packet in the traffic capture and get an exception of broken pipe (errno 32). Clearly python socket handles the transport layer and the heartbeats are part of application layer, the problem I want to solve is not to send the redundant heartbeat after FIN/ACK arrived from server, any simple way to know the connection state with python socket?

Using sniffing with python elasticsearch client to solve dead TCP connection issues

The Python elasticsearch client in my applicaiton is having connectivity issues (refused connections) because idle TCP connections timeout due to a firewall (I have no way to prevent this).
The easiest way for me to fix this would be if I could prevent the connection from going idle by sending some data over it periodically, the sniffing options in the elasticsearch client seem ideal for this, however they're not very well documented:
sniff_on_start – flag indicating whether to obtain a list of nodes
from the cluser at startup time
sniffer_timeout – number of seconds
between automatic sniffs
sniff_on_connection_fail – flag controlling
if connection failure triggers a sniff
sniff_timeout – timeout used for the sniff request - it should be a fast api call and we are talking potentially to more nodes so we want to fail quickly. Not used during initial sniffing (if sniff_on_start is on) when the connection still isn’t initialized.
What I would like is for the client to sniff every (say) 5 minutes, should I be using the sniff_timeout or sniffer_timeout option? Also, should the sniff_on_start parameter be set to True?
I used the suggestion from #val and found that these settings solved my problem:
sniff_on_start=True
sniffer_timeout=60
sniff_on_connection_fail=True
The sniffing puts enough traffic on the TCP connections so that they are never idle for long enough for our firewall to kill the conneciton.

Python; Troubles controlling dead sockets through select

I have some code which will connect to a host and do nothing but listen for incoming data until either the client is shut down or the host send a close statement. For this my code works well.
However when the host dies without sending a close statement, my client keeps listening for incoming data forever as expected. To resolve this I made the socket timeout every foo seconds and start the process of checking if the connection is alive or not. From the Python socket howto I found this:
One very nasty problem with select: if somewhere in those input lists of sockets is one which has died a nasty death, the select will fail. You then need to loop through every single damn socket in all those lists and do a select([sock],[],[],0) until you find the bad one. That timeout of 0 means it won’t take long, but it’s ugly.
# Example code written for this question.
from select import select
from socket include socket, AF_INET, SOCK_STREAM
socket = socket(AF_INET, SOCK_STREAM)
socket.connect(('localhost', 12345))
socklist = [socket,]
attempts = 0
def check_socklist(socks):
for sock in socklist:
(r, w, e) = select([sock,], [], [], 0)
...
...
...
while True:
(r, w, e) = select(socklist, [], [], 60)
for sock in r:
if sock is socket:
msg = sock.recv(4096)
if not msg:
attempts +=1
if attempts >= 10:
check_socket(socklist)
break
else:
attempts = 0
print msg
This text creates three questions.
I was taught that to check if a connection is alive or not, one has to write to the socket and see if a response returns. If not, the connection has to be assumed it is dead. In the text it says that to check for bad connections, one single out each socket, pass it to select's first parameter and set the timeout to zero. How will this confirm that the socket is dead or not?
Why not test if the socket is dead or alive by trying to write to the socket instead?
What am I looking for when the connection is alive and when it is dead? Select will timeout at once, so having no data there will prove nothing.
I realize there are libraries like gevent, asyncore and twisted that can help me with this, but I have chosen to do this my self to get a better understanding of what is happening and to get more control over the source my self.
If a connected client crashes or exits, but its host OS and computer are still running, then its OS's TCP stack will send your server a FIN packet to let your computer's TCP stack know that the TCP connection has been closed. Your Python app will see this as select() indicating that the client's socket is ready-for-read, and then when you call recv() on the socket, recv() will return 0. When that happens, you should respond by closing the socket.
If the connected client's computer never gets a chance to send a FIN packet, on the other hand (e.g. because somebody reached over and yanked its Ethernet cord or power cable out of the socket), then your server won't realize that the TCP connection is defunct for quite a while -- possibly forever. The easiest way to avoid having a "zombie socket" is simply to have your server send some dummy data on the socket every so often, e.g. once per minute or something. The client should know to discard the dummy data. The benefit of sending the dummy data is that your server's TCP stack will then notice that it's not getting any ACK packets back for the data packet(s) it sent, and will resend them; and after a few resends your server's TCP stack will give up and decide that the connection is dead, at which point you'll see the same behavior that I described in my first paragraph.
If you write something to a socket and then wait for an answer to check the connection, the server should support this "ping" messages. It is not alway the case. Otherwise the server app may crash itself or disconnect your client if the server doesn't wait this message.
If select failed in the way you described, the socket framework knows which socket is dead. You just need to find it. But if a socket is dead by that nasty death like server's app crash, it doesn't mean mandatory that client's socket framework will detect that. E.g. in the case when a client is waiting some messages from the server and the server crashes, in some cases the client can wait forever. For example Putty, to avoid this scenario, can use application's protocol-level ping (SSH ping option) of the server to check the connection; SSH server can use TCP keepalive to check the connection and to prevent network equipment from dropping connections without activity.
(see p.1).
You are right that select's timeout and having no data proves nothing. As documentation says you have to check every socket when select fails.

Socket Lose Connection

I know Twisted can do this well but what about just plain socket?
How'd you tell if you randomly lost your connection in socket? Like, If my internet was to go out of a second and come back on.
I'm assuming you're talking about TCP.
If your internet connection is out for a second, you might not lose the TCP connection at all, it'll just retransmit and resume operation.
There's ofcourse 100's of other reasons you could lose the connection(e.g. a NAT gateway inbetween decided to throw out the connection silently. The other end gets hit by a nuke. Your router burns up. The guy at the other end yanks out his network cable, etc. etc.)
Here's what you should do if you need to detect dead peers/closed sockets etc.:
Read from the socket or in any other way wait for events of incoming data on it. This allows you to detect when the connection was gracefully closed, or an error occured on it (reading on it returns 0 or -1) - atleast if the other end is still able to send a TCP FIN/RST or ICMP packet to your host.
Write to the socket - e.g. send some heartbeats every N seconds. Just reading from the socket won't detect the problem when the other end fails silently. If that PC goes offline, it can obviously not tell you that it did - so you'll have to send it something and see if it responds.
If you don't want to write heartbeats every N seconds, you can atleast turn on TCP keepalive - and you'll eventually get notified if the peer is dead. You still have to read from the socket, and the keepalive are usually sent every 2 hours by default. That's still better than keeping dead sockets around for months though.
If the internet comes and goes momentarily, you might not actually lose the TCP session. If you do, the socket API will throw some kind of exception, usually socket.timeout.

Monitoring a tcp port

For fun, I've been toying around with writing a load balancer in python and have been trying to figure the best (correct?) way to test if a port is available and the remote host is still there.
I'm finding that, once connected, it becomes difficult to tell when the remote host goes down. I've turned keep alive on, but can't get it to recognize a downed connection sooner than a minute (I realize polling more often than a minute might be overkill, but lets say I wanted to), even after setting the various TCP_KEEPALIVE options to their lowest.
When I use nonblocking sockets, I've noticed that a recv() will return an error ("resource temporarily unavailable") when it reads from a live socket, but returns "" when reading from a dead one (send and recv of 0 bytes, which might be the cause?). That seems like an odd way to test for it connected, though, and makes it impossible to tell if the connected died but after sending some data.
Aside from connecting/disconnecting for every check, is there something I can do? Can I manually send a tcp keepalive, or can I establish a lower level connection that will let me test the connectivity without sending real data the remote server would potentially process?
I'd recommend not leaving your (single) test socket connected - make a new connection each time you need to poll. Every load balancer / server availability system I've ever seen uses this method instead of a persistent connection.
If the remote server hasn't responded within a reasonable amount of time (e.g. 10s) mark it as "down". Use timers and signals rather than function response codes to handle that timeout.
"it becomes difficult to tell when the remote host goes down"
Correct. This is a feature of TCP. The whole point of TCP is to have an enduring connection between ports. Theoretically an application can drop and reconnect to the port through TCP (the socket libraries don't provide a lot of support for this, but it's part of the TCP protocol).
ping was invented for that purpose
also you might be able to send malformed TCP packets to your destination. For example, in the TCP headers there is a flag for acknowleging end of transmission, its the FIN message. If you send a message with ACK and FIN the remote host should complain with a return packet and you'll be able to evaluate round trip time.
It is theoretically possible to spam a keepalive packet. But to set it to very low intervals, you may need to dig into raw sockets. Also, your host may ignore it if its coming in too fast.
The best way to check if a host is alive in a TCP connection is to send data, and wait for an ACK packet. If the ACK packet arrives, the SEND function will return non-zero.
You can use Bash pseudo-device files for TCP/UDP connection with a specific I/O port, for example:
printf "" > /dev/tcp/example.com/80 && echo Works
This would open the connection, but won't send anything. You can test it by:
nc -vl 1234 &
printf "" > /dev/tcp/localhost/1234
For simple monitoring use cron with above command or using watch:
watch bash -c 'echo > /dev/tcp/localhost/1234 && echo Works || echo FAIL'
However it's recommended to use specific tools which are designed for that such as Monit, Nagios, etc.
Monit
Here is example rule using Monit (monit):
# Verify host.
check host example with address example.com
if failed
port 80
protocol http
then alert

Categories