Should a connection to Redis cluster be made on each Flask request?

Should a connection to Redis cluster be made on each Flask request? - python

I have a Flask API, it connects to a Redis cluster for caching purposes. Should I be creating and tearing down a Redis connection on each flask api call? Or, should I try and maintain a connection across requests?
My argument against the second option is that I should really try and keep the api as stateless as possible, and I also don't know if keeping some persistent across request might causes threads race conditions or other side effects.
However, if I want to persist a connection, should it be saved on the session or on the application context?

This is about performance and scale. To get those 2 buzzwords buzzing you'll in fact need persistent connections.
Eventual race conditions will be no different than with a reconnect on every request so that shouldn't be a problem. Any RCs will depend on how you're using redis, but if it's just caching there's not much room for error.
I understand the desired stateless-ness of an API from a client sides POV, but not so sure what you mean about the server side.
I'd suggest you put them in the application context, not the sessions (those could become too numerous) whereas the app context gives you the optimal 1 connection per process (and created immediately at startup). Scaling this way becomes easy-peasy: you'll never have to worry about hitting the max connection counts on the redis box (and the less multiplexing the better).

It's good idea from the performance standpoint to keep connections to a database opened between requests. The reason for that is that opening and closing connections is not free and takes some time which may become problem when you have too many requests. Another issue that a database can only handle up to a certain number of connections and if you open more, database performance will degrade, so you need to control how many connections are opened at the same time.
To solve both of these issues you may use a connection pool. A connection pool contains a number of opened database connections and provides access to them. When a database operation should be performed from a connection shoul be taken from a pool. When operation is completed a connection should be returned to the pool. If a connection is requested when all connections are taken a caller will have to wait until some connections are returned to the pool. Since no new connections are opened in this processed (they all opened in advance) this will ensure that a database will not be overloaded with too many parallel connections.
If connection pool is used correctly a single connection will be used by only one thread will use it at any moment.
Despite of the fact that connection pool has a state (it should track what connections are currently in use) your API will be stateless. This is because from the API perspective "stateless" means: does not have a state/side-effects visible to an API user. Your server can perform a number of operations that change its internal state like writing to log files or writing to a cache, but since this does not influence what data is being returned as a reply to API calls this does not make this API "stateful".
You can see some examples of using Redis connection pool here.
Regarding where it should be stored I would use application context since it fits better to its purpose.

Related

Occasional 'temporary failure in name resolution' while connecting to AWS Aurora cluster

I am running an Amazon Web Services RDS Aurora 5.6 database cluster. There are a couple of lambda's talking to these database instances, all written in python. Now everything was running well, but then suddenly, since a couple of days ago, the python code sometimes starts throwing the following error:
[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)
This happens in 1 every 1000 or so new connections. What is interesting that I haven't touched this whole service in the last couple of days (since it started happening). All lambdas are using the official MySQL-connector client and connect on every initialization with the following snippet:
import mysql.connector as mysql
import os
connection = mysql.connect(user=os.environ['DATABASE_USER'],
password=os.environ['DATABASE_PASSWORD'],
database=os.environ['DATABASE_NAME'],
host=os.environ['DATABASE_HOST'],
autocommit=True)
To rule out that this is a problem in the Python MySQL client I added the following to resolve the host:
import os
import socket
host = socket.gethostbyname(os.environ['DATABASE_HOST'])
Also here I sometimes get the following error:
[ERROR] gaierror: [Errno -2] Name or service not known
Now I suspect this has something to do with DNS, but since I'm just using the cluster endpoint there is not much I can do about that. What is interesting is that I also recently encountered exactly the same problem in a different region, with the same setup (Aurora 5.6 cluster, lambda's in python connecting to it) and the same happens there.
I've tried restarting all the machines in the cluster, but the problem still seems to occur. Is this really a DNS issue? What can do I to stop this from happening?

AWS Support have told me that this error is likely to be caused by a traffic quota in AWS's VPCs.
According to their documentation on DNS Quotas:
Each Amazon EC2 instance limits the number of packets that can be sent
to the Amazon-provided DNS server to a maximum of 1024 packets per
second per network interface. This quota cannot be increased. The
number of DNS queries per second supported by the Amazon-provided DNS
server varies by the type of query, the size of response, and the
protocol in use. For more information and recommendations for a
scalable DNS architecture, see the Hybrid Cloud DNS Solutions for
Amazon VPC whitepaper.
It's important to note that the metric we're looking at here is packets per second, per ENI. What's important about this? Well, it may not be immediately obvious that although the actual number of packets per query varies, there are typically multiple packets per DNS query.
While these packets cannot be seen in VPC flow logs, upon reviewing my own packet captures, I can see some resolutions consisting of about 4 packets.
Unfortunately, I can't say much about the whitepaper; at this stage, I'm not really considering the implementation of a hybrid DNS service as a "good" solution.
Solutions
I'm looking into ways to alleviate the risk of this error occurring, and to limit it's impacts when it does occur. As I see it, there are number of options to achieve this:
Force Lambda Functions to resolve the Aurora Cluster's DNS before doing anything else and use the private IP address for the connection and handle failures with an exponential back-off. To minimise the cost of waiting for reties, I've set a total timeout of 5 seconds for DNS resolution. This number includes all back-off wait time.
Making many, short-lived connections comes with a potentially costly overhead, even if you're closing the connection. Consider using connection pooling on the client side, as it is a common misconception that Aurora's connection pooling is sufficient to handle the overhead of many short-lived connections.
Try not to rely on DNS where possible. Aurora automatically handles failover and promotion/demotion of instances, so it's important to know that you're always connected to the "right" (or write, in some cases :P) instance. As updates to the Aurora cluster's DNS name can take time to propagate, even with it's 5 second TTLs, it might be better to make use of the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, in which MySQL exposes " in near-real-time" metadata about DB instances. Note that the table "contains cluster-wide metadata". If you cbf, have a look at option 4.
Use a smart driver, which:
is a database driver or connector with the ability to read DB
cluster topology from the metadata table. It can route new
connections to individual instance endpoints without relying on
high-level cluster endpoints. A smart driver is also typically
capable of load balancing read-only connections across the available
Aurora Replicas in a round-robin fashion.
Not solutions
Initially, I thought it might be a good idea to create a CNAME which points to the cluster, but now I'm not so sure that caching Aurora DNS query results is wise. There are a few reasons for this, which are discussed in varying levels of details in The Aurora Connection Management Handbook:
Unless you use a smart database driver, you depend on DNS record
updates and DNS propagation for failovers, instance scaling, and load
balancing across Aurora Replicas. Currently, Aurora DNS zones use a
short Time-To-Live (TTL) of 5 seconds. Ensure that your network and
client configurations don’t further increase the DNS cache TTL
Aurora's cluster and reader endpoints abstract the role changes
(primary instance promotion/demotion) and topology changes (addition
and removal of instances) occurring in the DB cluster
I hope this helps!

I had the same error with an instance (and ruled out the DNS lookup limit). After some time I stumbled on an AWS support thread indicating that it could be a hardware problem.
The physical underlying host of your instance (i-3d124c6d) looks to have intermittently been having a issues, some of which would have definitely caused service interruption.
Could you try stopping and starting this instance? Doing so will cause it to be brought up on new underlying hardware and then we could utilize your pingdom service to verify if further issues arise.
from: https://forums.aws.amazon.com/thread.jspa?threadID=171805.
Stopping and restarting the instance resolved the issue for me.

Python gRPC long pooling

I have server which must nofity some clients across gRPC connection.
Clients connect to server without timeout and wait for messages every time. Server will notify clients when new record was added to database.
How can I manage server for better performance with multithreading? May be should I use monitor and if record was added I would notify server side gRPC to retrieve data from database and send it to clients?
How do you think?
Thanks

We have some better plans for later in time, but today the best solution might be to implement something that presents the interface of concurrent.futures.Executor but that gives you better efficiency.

How does Postges Server know to keep a database connection open

I wonder how does Postgres sever determine to close a DB connection, if I forgot at the Python source code side.
Does the Postgres server send a ping to the source code? From my understanding, this is not possible.

PostgreSQL indeed does something like that, although it is not a ping.
PostgreSQL uses a TCP feature called keepalive. Once enabled for a socket, the operating system kernel will regularly send keepalive messages to the other party (the peer), and if it doesn't get an answer after a couple of tries, it closes the connection.
The default timeouts for keepalive are pretty long, in the vicinity of two hours. You can configure the settings in PostgreSQL, see the documentation for details.
The default values and possible values vary according to the operating system used.
There is a similar feature available for the client side, but it is less useful and not enabled by default.

When your script quits your connection will close and the server will clean it up accordingly. Likewise, it's often the case in garbage collected languages like Python that when you stop using the connection and it falls out of scope it will be closed and cleaned up.
It is possible to write code that never releases these resources properly, that just perpetually creates new handles, something that can be problematic if you don't have something server-side that handles killing these after some period of idle time. Postgres doesn't do this by default, though it can be configured to, but MySQL does.
In short Postgres will keep a database connection open until you kill it either explicitly, such as via a close call, or implicitly, such as the handle falling out of scope and being deleted by the garbage collector.

Why do I constantly see "Resetting dropped connection" when uploading data to my database?

I'm uploading hundreds of millions of items to my database via a REST API from a cloud server on Heroku to a database in AWS EC2. I'm using Python and I am constantly seeing the following INFO log message in the logs.
[requests.packages.urllib3.connectionpool] [INFO] Resetting dropped connection: <hostname>
This "resetting of the dropped connection" seems to take many seconds (sometimes 30+ sec) before my code continues to execute again.
Firstly what exactly is happening here and why?
Secondly is there a way to stop the connection from dropping so that I am able to upload data faster?
Thanks for your help.
Andrew.

Requests uses Keep-Alive by default. Resetting dropped connection, from my understanding, means a connection that should be alive was dropped somehow. Possible reasons are:
Server doesn't support Keep-Alive.
There's no data transfer in established connections for a while, so server drops connections.
See https://stackoverflow.com/a/25239947/2142577 for more details.

The problem is really that the server has closed the connection even though the client has requested it be kept alive.
This is not necessarily because the server doesn't support keepalives, but could be that the server is configured to only allow a certain number of requests on a connection. This could be done to help spread out requests on different servers, but I think this practice is/was common as a practical defence against poorly written code that operates in the server (eg. PHP) that doesn't clean up after itself after serving a request (perhaps due to an error condition etc.)
If you think this is the case for you and you'd like to not see these logs (which are logged at INFO level), then you can add the following to quieten that part of the logging:
# Really don't need to hear about connections being brought up again after server has closed it
logging.getLogger("requests.packages.urllib3.connectionpool").setLevel(logging.WARNING)

This is common practice for services that expose RESTful APIs to avoid abuse (or DoS).
If you're stressing their API they'll drop your connection.
Try getting your script to sleep a bit every once in a while to avoid the drop.

pymongo connection pooling and client requests

I know pymongo is thread safe and has an inbuilt connection pool.
In a web app that I am working on, I am creating a new connection instance on every request.
My understanding is that since pymongo manages the connection pool, it isn't wrong approach to create a new connection on each request, as at the end of the request the connection instance will be reclaimed and will be available on subsequent requests.
Am I correct here, or should I just create a single instance to use across multiple requests?

The "wrong approach" depends upon the architecture of your application. With pymongo being thread-safe and automatic connection pooling, the actual use of a single shared connection, or multiple connections, is going to "work". But the results will depend on what you expect the behavior to be. The documentation comments on both cases.
If your application is threaded, from the docs, each thread accessing a connection will get its own socket. So whether you create a single shared connection, or request a new one, it comes down to whether your requests are threaded or not.
When using gevent, you can have a socket per greenlet. This means you don't have to have a true thread per request. The requests can be async, and still get their own socket.
In a nutshell:
If your webapp requests are threaded, then it doesn't matter which way you access a new connection. The result will be the same (socket per thread)
If your webapp is async via gevent, then it doesn't matter which way you access a new conection. The result will be the same. (socket per greenlet)
If your webapp is async, but NOT via gevent, then you have to take into consideration the notes on the best suggested workflow.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.