SQLAlchemy SSL SYSCALL timeout coping mechanism

SQLAlchemy SSL SYSCALL timeout coping mechanism - python

I'm using a combination of SQLAlchemy and Postgres. Once every while my database cluster replaces a failing node, circle of life I guess.
I was under the impression that by configuring my engine in the following manner:
engine = create_engine(
env_config.pg_connection_string,
echo=False,
pool_size=env_config.pg_pool_size,
pool_timeout=1, # Number of seconds to wait before giving up on getting a connection from the pool.
pool_recycle=3600, # Replace connections on CHECKOUT after 1 hour
connect_args={
'connect_timeout': 10, # Maximum wait for connection
"options": "-c statement_timeout=30s" # Maximum amount of time set for statements
},
)
my connections would be timing out on queries >30s, and my connections would timeout after trying for 10 seconds.
What I'm noticing in practice is that in a situation where my db node is being replaced from my db cluster, it sometimes takes 15 mins(900s) dealing with an exception like psycopg2.DatabaseError: SSL SYSCALL error: No route to host. If a db transaction is active while the node is being replaced it could take up to 16 mins for it to raise the SYSCALL exception. All new transactions are being handled well, and I guess routed to the right host? But existing session / transactions seem to block and halt for up to 16 minutes.
My explanation would be that a SSL SYSCALL issue is neither a connection nor a statement related setting, so both configured time-outs would not have an impact. My question remains 'How do I stop or timeout these SSL SYSCALL issues?', I would rather just fail quickly and retry the same query than spend 15 minutes in a blocking call. I'm not sure where to resolve this, I'm guessing either in my DB layer (Postgres, SQLAlchemy, or db driver) or a configuration in my network layer (Centos).
Some more digging in my postgres configurations reveal that both the TCP related settings in postgres for tcp_keepalives_count and tcp_keep_alives_interval are 6 and 10. Which makes we wonder why the connection hasn't been killed after 60 seconds. Also, is it even possible to receive TCP ACKS even though there is no 'Route to Host', the SSL SYSCALL issue.

Unless someone else has a more fitting explanation I'm convinced my issue is being caused by a combination of TCP tcp_retries2 and non gracefully halting of open db connections. Whenever my primary db node is being replaced its being nuked from the cluster, any established connections with that node are being left open / in established state. With the current default TCP settings it could take up to 15 minutes before the connection is dropped, not really sure why this manifests in a SSL SYSCALL exception though.
This issue that covers my problem is covered really well on one of the issues / PR's at the PGbounder repo: https://github.com/pgbouncer/pgbouncer/issues/138, TCP connections taking a long time before marked marked / considered 'dead'.
I suggest reading that page in order to get a better understanding, my assumption being that my issue is also caused by the default TCP settings.
Long story short, I consider to have two options:
Manually tune TCP settings on my host, this will affect all other TCP using components on that machine.
Setup something like PGBouncer so TCP tuning can be done service locally, without affecting anything else on that machine.

Related

Occasional 'temporary failure in name resolution' while connecting to AWS Aurora cluster

I am running an Amazon Web Services RDS Aurora 5.6 database cluster. There are a couple of lambda's talking to these database instances, all written in python. Now everything was running well, but then suddenly, since a couple of days ago, the python code sometimes starts throwing the following error:
[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)
This happens in 1 every 1000 or so new connections. What is interesting that I haven't touched this whole service in the last couple of days (since it started happening). All lambdas are using the official MySQL-connector client and connect on every initialization with the following snippet:
import mysql.connector as mysql
import os
connection = mysql.connect(user=os.environ['DATABASE_USER'],
password=os.environ['DATABASE_PASSWORD'],
database=os.environ['DATABASE_NAME'],
host=os.environ['DATABASE_HOST'],
autocommit=True)
To rule out that this is a problem in the Python MySQL client I added the following to resolve the host:
import os
import socket
host = socket.gethostbyname(os.environ['DATABASE_HOST'])
Also here I sometimes get the following error:
[ERROR] gaierror: [Errno -2] Name or service not known
Now I suspect this has something to do with DNS, but since I'm just using the cluster endpoint there is not much I can do about that. What is interesting is that I also recently encountered exactly the same problem in a different region, with the same setup (Aurora 5.6 cluster, lambda's in python connecting to it) and the same happens there.
I've tried restarting all the machines in the cluster, but the problem still seems to occur. Is this really a DNS issue? What can do I to stop this from happening?

AWS Support have told me that this error is likely to be caused by a traffic quota in AWS's VPCs.
According to their documentation on DNS Quotas:
Each Amazon EC2 instance limits the number of packets that can be sent
to the Amazon-provided DNS server to a maximum of 1024 packets per
second per network interface. This quota cannot be increased. The
number of DNS queries per second supported by the Amazon-provided DNS
server varies by the type of query, the size of response, and the
protocol in use. For more information and recommendations for a
scalable DNS architecture, see the Hybrid Cloud DNS Solutions for
Amazon VPC whitepaper.
It's important to note that the metric we're looking at here is packets per second, per ENI. What's important about this? Well, it may not be immediately obvious that although the actual number of packets per query varies, there are typically multiple packets per DNS query.
While these packets cannot be seen in VPC flow logs, upon reviewing my own packet captures, I can see some resolutions consisting of about 4 packets.
Unfortunately, I can't say much about the whitepaper; at this stage, I'm not really considering the implementation of a hybrid DNS service as a "good" solution.
Solutions
I'm looking into ways to alleviate the risk of this error occurring, and to limit it's impacts when it does occur. As I see it, there are number of options to achieve this:
Force Lambda Functions to resolve the Aurora Cluster's DNS before doing anything else and use the private IP address for the connection and handle failures with an exponential back-off. To minimise the cost of waiting for reties, I've set a total timeout of 5 seconds for DNS resolution. This number includes all back-off wait time.
Making many, short-lived connections comes with a potentially costly overhead, even if you're closing the connection. Consider using connection pooling on the client side, as it is a common misconception that Aurora's connection pooling is sufficient to handle the overhead of many short-lived connections.
Try not to rely on DNS where possible. Aurora automatically handles failover and promotion/demotion of instances, so it's important to know that you're always connected to the "right" (or write, in some cases :P) instance. As updates to the Aurora cluster's DNS name can take time to propagate, even with it's 5 second TTLs, it might be better to make use of the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, in which MySQL exposes " in near-real-time" metadata about DB instances. Note that the table "contains cluster-wide metadata". If you cbf, have a look at option 4.
Use a smart driver, which:
is a database driver or connector with the ability to read DB
cluster topology from the metadata table. It can route new
connections to individual instance endpoints without relying on
high-level cluster endpoints. A smart driver is also typically
capable of load balancing read-only connections across the available
Aurora Replicas in a round-robin fashion.
Not solutions
Initially, I thought it might be a good idea to create a CNAME which points to the cluster, but now I'm not so sure that caching Aurora DNS query results is wise. There are a few reasons for this, which are discussed in varying levels of details in The Aurora Connection Management Handbook:
Unless you use a smart database driver, you depend on DNS record
updates and DNS propagation for failovers, instance scaling, and load
balancing across Aurora Replicas. Currently, Aurora DNS zones use a
short Time-To-Live (TTL) of 5 seconds. Ensure that your network and
client configurations don’t further increase the DNS cache TTL
Aurora's cluster and reader endpoints abstract the role changes
(primary instance promotion/demotion) and topology changes (addition
and removal of instances) occurring in the DB cluster
I hope this helps!

I had the same error with an instance (and ruled out the DNS lookup limit). After some time I stumbled on an AWS support thread indicating that it could be a hardware problem.
The physical underlying host of your instance (i-3d124c6d) looks to have intermittently been having a issues, some of which would have definitely caused service interruption.
Could you try stopping and starting this instance? Doing so will cause it to be brought up on new underlying hardware and then we could utilize your pingdom service to verify if further issues arise.
from: https://forums.aws.amazon.com/thread.jspa?threadID=171805.
Stopping and restarting the instance resolved the issue for me.

IBM db2 connection gets closed after some time

I'm trying to connect to db2 (ibm_db). The connection is successful, i'm able to make changes in the db. But after a while the connection gets closed. I'm not closing the connection anywhere.
It throws this errror:
[IBM][CLI Driver] CLI0106E Connection is closed. SQLSTATE=08003 SQLCODE=-99999
2019-04-11 03:11:20,558 - INFO - werkzeug - 9.46.72.43 - - [11/Apr/2019 03:11:20] POST 200
Here is my code: (Not exact. But something similar)
import ibm_db
conn = ibm_db.connect("database","username","password")
def update():
stmt = ibm_db.exec_immediate(conn, "UPDATE employee SET bonus = '1000' WHERE job = 'MANAGER'")
How do i maintain the connection the whole time. I mean whenever the service is running.

Your design of only making a connection when the service starts is unsuitable for long running services.
There's nothing you can do to stop the other end (i.e. the Db2-server, or any intervening gateway) from closing the connection. The connection can get closed for a variety of reasons. For example, the Db2-server may be configured to discard idle sessions, or sessions that break some site-specific workload-management rules. Network issues can cause connections to become unavailable. Service-management matters can cause connections to be forced off etc.
Check out the pconnect method to see if it helps you. Otherwise consider a better design such as connection-pooling, reconnect-on-demand etc.

Using sniffing with python elasticsearch client to solve dead TCP connection issues

The Python elasticsearch client in my applicaiton is having connectivity issues (refused connections) because idle TCP connections timeout due to a firewall (I have no way to prevent this).
The easiest way for me to fix this would be if I could prevent the connection from going idle by sending some data over it periodically, the sniffing options in the elasticsearch client seem ideal for this, however they're not very well documented:
sniff_on_start – flag indicating whether to obtain a list of nodes
from the cluser at startup time
sniffer_timeout – number of seconds
between automatic sniffs
sniff_on_connection_fail – flag controlling
if connection failure triggers a sniff
sniff_timeout – timeout used for the sniff request - it should be a fast api call and we are talking potentially to more nodes so we want to fail quickly. Not used during initial sniffing (if sniff_on_start is on) when the connection still isn’t initialized.
What I would like is for the client to sniff every (say) 5 minutes, should I be using the sniff_timeout or sniffer_timeout option? Also, should the sniff_on_start parameter be set to True?

I used the suggestion from #val and found that these settings solved my problem:
sniff_on_start=True
sniffer_timeout=60
sniff_on_connection_fail=True
The sniffing puts enough traffic on the TCP connections so that they are never idle for long enough for our firewall to kill the conneciton.

Should a connection to Redis cluster be made on each Flask request?

I have a Flask API, it connects to a Redis cluster for caching purposes. Should I be creating and tearing down a Redis connection on each flask api call? Or, should I try and maintain a connection across requests?
My argument against the second option is that I should really try and keep the api as stateless as possible, and I also don't know if keeping some persistent across request might causes threads race conditions or other side effects.
However, if I want to persist a connection, should it be saved on the session or on the application context?

This is about performance and scale. To get those 2 buzzwords buzzing you'll in fact need persistent connections.
Eventual race conditions will be no different than with a reconnect on every request so that shouldn't be a problem. Any RCs will depend on how you're using redis, but if it's just caching there's not much room for error.
I understand the desired stateless-ness of an API from a client sides POV, but not so sure what you mean about the server side.
I'd suggest you put them in the application context, not the sessions (those could become too numerous) whereas the app context gives you the optimal 1 connection per process (and created immediately at startup). Scaling this way becomes easy-peasy: you'll never have to worry about hitting the max connection counts on the redis box (and the less multiplexing the better).

It's good idea from the performance standpoint to keep connections to a database opened between requests. The reason for that is that opening and closing connections is not free and takes some time which may become problem when you have too many requests. Another issue that a database can only handle up to a certain number of connections and if you open more, database performance will degrade, so you need to control how many connections are opened at the same time.
To solve both of these issues you may use a connection pool. A connection pool contains a number of opened database connections and provides access to them. When a database operation should be performed from a connection shoul be taken from a pool. When operation is completed a connection should be returned to the pool. If a connection is requested when all connections are taken a caller will have to wait until some connections are returned to the pool. Since no new connections are opened in this processed (they all opened in advance) this will ensure that a database will not be overloaded with too many parallel connections.
If connection pool is used correctly a single connection will be used by only one thread will use it at any moment.
Despite of the fact that connection pool has a state (it should track what connections are currently in use) your API will be stateless. This is because from the API perspective "stateless" means: does not have a state/side-effects visible to an API user. Your server can perform a number of operations that change its internal state like writing to log files or writing to a cache, but since this does not influence what data is being returned as a reply to API calls this does not make this API "stateful".
You can see some examples of using Redis connection pool here.
Regarding where it should be stored I would use application context since it fits better to its purpose.

Why does PyMongo throw AutoReconnect?

While researching some strange issues with my Python web application (in particular, issues regarding MongoDB connectivity), I noticed something on the official PyMongo documentation page. My web application uses Flask, but this shouldn't influence the issue I'm facing.
The PyMongo driver does connection pooling, but it also throws an exception (AutoReconnect) when a connection is stale and a reconnect is due.
It states that (regarding the AutoReconnect exception):
In order to auto-reconnect you must handle this exception, recognizing
that the operation which caused it has not necessarily succeeded.
Future operations will attempt to open a new connection to the
database (and will continue to raise this exception until the first
successful connection is made).
I have noticed that this actually happens constantly (and it doesn't seem to be an error). Connections are closed by the MongoDB server after what seems like several minutes of inactivity, and need to be recreated by the web application.
What I don't understand it why the PyMongo driver throws an error when it reconnects (which the user of the driver needs to handle themselves), instead of doing it transparently. (There could even be an option a user could set so that AutoReconnect exceptions do get thrown, but wouldn't a sensible default be that these exceptions don't get thrown at all, and the connections are recreated seamlessly?)
I have never encountered this behavior using other database systems, which is why I'm a bit confused.
It's also worth mentioning that my web application's MongoDB connections never fail when connecting to my local development MongoDB server (I assume it would have something to do with the fact that it's a local connection, and that the connection is done through a UNIX socket instead of a network socket, but I could be wrong).

You're misunderstanding AutoReconnect. It is raised when the driver attempts to communicate with the server (to send a command or other operation) and a network failure or similar problem occurs. The name of the exception is meant to communicate that you do not have to create a new instance of MongoClient, the existing client will attempt to reconnect automatically when your application tries the next operation. If the same problem occurs, AutoReconnect is raised again.
I suspect the reason you are seeing sockets timeout (and AutoReconnect being raised) is that there is a load balancer between the server and your application that closes connections after some period of inactivity. For example, this apparently happens on Microsoft's Azure platform after 13 minutes of no activity on a socket. You might be able to fix this by using the socketKeepAlive option, added in PyMongo 2.8. Note that you will also have to set the keepalive interval on your application server to an appropriate value (the default on Linux is 2 hours). See here for more information.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.