Using sniffing with python elasticsearch client to solve dead TCP connection issues

Using sniffing with python elasticsearch client to solve dead TCP connection issues - python

The Python elasticsearch client in my applicaiton is having connectivity issues (refused connections) because idle TCP connections timeout due to a firewall (I have no way to prevent this).
The easiest way for me to fix this would be if I could prevent the connection from going idle by sending some data over it periodically, the sniffing options in the elasticsearch client seem ideal for this, however they're not very well documented:
sniff_on_start – flag indicating whether to obtain a list of nodes
from the cluser at startup time
sniffer_timeout – number of seconds
between automatic sniffs
sniff_on_connection_fail – flag controlling
if connection failure triggers a sniff
sniff_timeout – timeout used for the sniff request - it should be a fast api call and we are talking potentially to more nodes so we want to fail quickly. Not used during initial sniffing (if sniff_on_start is on) when the connection still isn’t initialized.
What I would like is for the client to sniff every (say) 5 minutes, should I be using the sniff_timeout or sniffer_timeout option? Also, should the sniff_on_start parameter be set to True?

I used the suggestion from #val and found that these settings solved my problem:
sniff_on_start=True
sniffer_timeout=60
sniff_on_connection_fail=True
The sniffing puts enough traffic on the TCP connections so that they are never idle for long enough for our firewall to kill the conneciton.

Related

Why is there a discrepancy between python sockets and tcp ping for the same IP:port destination?

My setup:
I am using an IP and port provided by portmap.io to allow me to perform port forwarding.
I have OpenVPN installed (as required by portmap.io), and I run a ready-made config file when I want to operate my project.
My main effort involves sending messages between a client and a server using sockets in Python.
I have installed a software called tcping, which basically allows me to ping an IP:port over a tcp connection.
This figure basically sums it up:
Results I'm getting:
When I try to "ping" said IP, the average RTT ends up being around 30ms consistently.
I try to use the same IP to program sockets in Python, where I have a server script on my machine running, and a client script on any other machine but binding to this IP. I try sending a small message like "Hello" over the socket, and I am finding that the message is taking a significantly greater amount of time to travel across, and an inconsistent one for that matter. Sometimes it ends up taking 1 second, sometimes 400ms...
What is the reason for this discrepancy?

What is the reason for this discrepancy?
tcpping just measures the time needed to establish the TCP connection. The connection establishment is usually completely done in the OS kernel, so there is not even a switch to user space involved.
Even some small data exchange at the application is significantly more expensive. First, the initial TCP handshake must be done. Usually only once the TCP handshake is done the client starts sending the payload, which then needs to be delivered to the other side, put into the sockets read buffer, schedule the user space application to run, read the data from the buffer in the application and process, create and deliver the response to the peers OS kernel, let the kernel deliver the response to the local system and lots of stuff here too until the local app finally gets the response and ends the timing of how long this takes.
Given that the time for the last one is that much off from the pure RTT I would assume though that the server system has either low performance or high load or that the application is written badly.

twisted - detection of lost connection takes more than 30 minutes

I've written a tcp client using python and twisted, it connects to a server and communicate in a simple string based protocol (Defined by the server manufacturer). The TCP/IP connection should persist, and reconnect in case of failure.
When some sort of network error occurs (I assume on the server side or on some node along the way), it takes a very long time for the client to realize that and initiate a new connection, much more than a few minutes.
Is there a way to speed that up? Some sort of built in TCP/IP keep alive functionality that can detect the disconnect sooner?
I can implement a keep alive mechanism myself, and look for timeouts, not sure that's the best practice in this case. What do you think? Also, when using reactor.connectTCP() and reactor.run() with a ClientFactory, what's the best way to force a re-connection?

Application level keep-alives for TCP-based protocols are a good idea. You should probably implement this. This gives you complete and precise control over the timeout semantics you want from your application.
TCP itself has a keepalive mechanism. You can enable this with an ITCPTransport method call from your protocol. For example:
class YourProtocol(Protocol):
def connectionMade(self):
self.transport.setTcpKeepAlive(True)
The exact semantics of this keepalive are platform and configuration dependent. It's entirely possible this is already enabled and is what's detecting your connection lose. Thirty minutes is a pretty plausible amount of time for this mechanism to notice a lost connection.

As stated in by Jean-Paul Calderone, you can either implement an application level keepalive or use the TCP keepalive mechanism. The application level keepalive is the preferred method as it gives you more fine-grained control.
The TCP keepalive mechanism lives on the OS level and the defaults are OS dependant, but are configurable. For example the default linux TCP keepalive works in the following way:
After 2 hours send a keepalive probe.
If this fails, send another probe every 75 seconds.
After 9 consecutive fails, mark the connection as closed. This will be picked up by the server and it will trigger whatever cleanup mechanisms it has in place.
See: https://en.wikipedia.org/wiki/Keepalive#TCP_keepalive and http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
So while the TCP keepalive will eventually reap your dead connections, it will take quite a long time to kick in.

Efficient way to send results every 1-30 seconds from one machine to another

Key points:
I need to send roughly ~100 float numbers every 1-30 seconds from one machine to another.
The first machine is catching those values through sensors connected to it.
The second machine is listening for them, passing them to an http server (nginx), a telegram bot and another program sending emails with alerts.
How would you do this and why?
Please be accurate. It's the first time I work with sockets and with python, but I'm confident I can do this. Just give me crucial details, lighten me up!
Some small portion (a few rows) of the core would be appreciated if you think it's a delicate part, but the main goal of my question is to see the big picture.

Main thing here is to decide on a connection design and to choose protocol. I.e. will you have a persistent connection to your server or connect each time when new data is ready to it.
Then will you use HTTP POST or Web Sockets or ordinary sockets. Will you rely exclusively on nginx or your data catcher will be another serving service.
This would be a most secure way, if other people will be connecting to nginx to view sites etc.
Write or use another server to run on another port. For example, another nginx process just for that. Then use SSL (i.e. HTTPS) with basic authentication to prevent anyone else from abusing the connection.
Then on client side, make a packet every x seconds of all data (pickle.dumps() or json or something), then connect to your port with your credentials and pass the packet.
Python script may wait for it there.
Or you write a socket server from scratch in Python (not extra hard) to wait for your packets.
The caveat here is that you have to implement your protocol and security. But you gain some other benefits. Much more easier to maintain persistent connection if you desire or need to. I don't think it is necessary though and it can become bulky to code break recovery.
No, just wait on some port for a connection. Client must clearly identify itself (else you instantly drop the connection), it must prove that it talks your protocol and then send the data.
Use SSL sockets to do it so that you don't have to implement encryption yourself to preserve authentication data. You may even rely only upon in advance built keys for security and then pass only data.
Do not worry about the speed. Sockets are handled by OS and if you are on Unix-like system you may connect as many times you want in as little time interval you need. Nothing short of DoS attack won't inpact it much.
If on Windows, better use some finished server because Windows sometimes do not release a socket on time so you will be forced to wait or do some hackery to avoid this unfortunate behaviour (non blocking sockets and reuse addr and then some flo control will be needed).
As far as your data is small you don't have to worry much about the server protocol. I would use HTTPS myself, but I would write myown light-weight server in Python or modify and run one of examples from internet. That's me though.

The simplest thing that could possibly work would be to take your N floats, convert them to a binary message using struct.pack(), and then send them via a UDP socket to the target machine (if it's on a single LAN you could even use UDP multicast, then multiple receivers could get the data if needed). You can safely send a maximum of 60 to 170 double-precision floats in a single UDP datagram (depending on your network).
This requires no application protocol, is easily debugged at the network level using Wireshark, is efficient, and makes it trivial to implement other publishers or subscribers in any language.

How to detect non-graceful disconnect of Twisted on Linux?

I wrote a server based on Twisted, and I encountered a problem, some of the clients are disconnected not gracefully. For example, the user pulls out the network cable.
For a while, the client on Windows is disconnected (the connectionLost is called, and it is also written in Twisted). And on the Linux server side, my connectionLost of twisted is never triggered. Even it try to writes data to client when the connection is lost. Why Twisted can't detect those non-graceful disconnection (even write data to client) on Linux? How to makes Twisted detect non-graceful disconnections? Because the feature Twisted can't detect non-graceful, I have lots of zombie user on my server.
---- Update ----
I thought it might be the feature of socket of unix-like os, so, what is the behavior of socket on unix-like for handling situation like this?
Thanks.
Victor Lin.

You're describing the behavior of TCP connections on an unreliable network. Twisted is merely exposing this behavior: after all, when you set up a TCP connection with Twisted, it is nothing more than a TCP connection.
You're mistaken when you say that the connectionLost callback isn't invoked even if you try to send data over it. After two minutes, the underlying TCP connection will disappear and Twisted will inform you of this by calling connectionLost.
If you need to detect this condition more quickly than that, then you can implement your own timeouts using reactor.callLater.

Seconding what Jean-Paul said, if you need more fine grained TCP connection management, just use reactor.CallLater. We have exactly that implementation on a Twisted/wxPython trading platform, and it works a treat. You might also want to tweak the behaviour of the ReconnectingClientFactory in order to achieve the results I understand your looking for.

Monitoring a tcp port

For fun, I've been toying around with writing a load balancer in python and have been trying to figure the best (correct?) way to test if a port is available and the remote host is still there.
I'm finding that, once connected, it becomes difficult to tell when the remote host goes down. I've turned keep alive on, but can't get it to recognize a downed connection sooner than a minute (I realize polling more often than a minute might be overkill, but lets say I wanted to), even after setting the various TCP_KEEPALIVE options to their lowest.
When I use nonblocking sockets, I've noticed that a recv() will return an error ("resource temporarily unavailable") when it reads from a live socket, but returns "" when reading from a dead one (send and recv of 0 bytes, which might be the cause?). That seems like an odd way to test for it connected, though, and makes it impossible to tell if the connected died but after sending some data.
Aside from connecting/disconnecting for every check, is there something I can do? Can I manually send a tcp keepalive, or can I establish a lower level connection that will let me test the connectivity without sending real data the remote server would potentially process?

I'd recommend not leaving your (single) test socket connected - make a new connection each time you need to poll. Every load balancer / server availability system I've ever seen uses this method instead of a persistent connection.
If the remote server hasn't responded within a reasonable amount of time (e.g. 10s) mark it as "down". Use timers and signals rather than function response codes to handle that timeout.

"it becomes difficult to tell when the remote host goes down"
Correct. This is a feature of TCP. The whole point of TCP is to have an enduring connection between ports. Theoretically an application can drop and reconnect to the port through TCP (the socket libraries don't provide a lot of support for this, but it's part of the TCP protocol).

ping was invented for that purpose
also you might be able to send malformed TCP packets to your destination. For example, in the TCP headers there is a flag for acknowleging end of transmission, its the FIN message. If you send a message with ACK and FIN the remote host should complain with a return packet and you'll be able to evaluate round trip time.

It is theoretically possible to spam a keepalive packet. But to set it to very low intervals, you may need to dig into raw sockets. Also, your host may ignore it if its coming in too fast.
The best way to check if a host is alive in a TCP connection is to send data, and wait for an ACK packet. If the ACK packet arrives, the SEND function will return non-zero.

You can use Bash pseudo-device files for TCP/UDP connection with a specific I/O port, for example:
printf "" > /dev/tcp/example.com/80 && echo Works
This would open the connection, but won't send anything. You can test it by:
nc -vl 1234 &
printf "" > /dev/tcp/localhost/1234
For simple monitoring use cron with above command or using watch:
watch bash -c 'echo > /dev/tcp/localhost/1234 && echo Works || echo FAIL'
However it's recommended to use specific tools which are designed for that such as Monit, Nagios, etc.
Monit
Here is example rule using Monit (monit):
# Verify host.
check host example with address example.com
if failed
port 80
protocol http
then alert

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.