I have an app built using Fastapi & SQLAlchemy for handling all the DB-related stuff.
When the APIs are triggered via the frontend, I see that the connections are opened & they remain in IDLE state for a while. Is it possible to reduce the IDLE time via sqlalchemy?
I do the following to connect to the Postgres DB:
import sqlalchemy as db
eng = db.create_engine(<SQLALCHEMY_DATABASE_URI>)
conn = eng.connect()
metadata = db.MetaData()
table = db.Table(
<table_name>,
metadata,
autoload=True,
autoload_with=eng)
user_id = 1
try:
if ids_by_user is None:
query = db.select([table.columns.created_at]).where(
table.columns.user_id == user_id,
).order_by(
table.columns.created_at.desc()
)
result = conn.execute(query).fetchmany(1)
time = result[0][0]
time_filtering_query = db.select([table]).where(
table.columns.created_at == time
)
time_result = conn.execute(time_filtering_query).fetchall()
conn.close()
return time_result
else:
output_by_id = []
for i in ids_by_user:
query = db.select([table]).where(
db.and_(
table.columns.id == i,
table.columns.user_id == user_id
)
)
result = conn.execute(query).fetchall()
output_by_id.append(result)
output_by_id = [output_by_id[j][0]
for j in range(len(output_by_id))
if output_by_id[j]]
conn.close()
return output_by_id
finally:
eng.dispose()
Even after logging out of the app, the connections are still active & in idle state for a while and don't close immediately.
Edit 1
I tried using NullPool & the connections are still idle & in ROLLBACK, which is the same as when didn't use NullPool
You can reduce connection idle time by setting a maximum lifetime per connection by using pool_recycle. Note that connections already checked out will not be terminated until they are no longer in use.
If you are interested in reducing both the idle time and keeping the overall number of unused connections low, you can set a lower pool_size and then set max_overflow to allow for more connections to be allocated when the application is under heavier load.
from sqlalchemy import create_engine
e = create_engine(<SQLALCHEMY_DATABASE_URI>,
pool_recycle=3600 # idle connections will be terminated after 1 hour
pool_size=5 #pool size under normal conditions
max_overflow=5 #additional connections when pool size is exeeded
)
Google cloud has a helpful guide for optimizing Postgres connection pooling that you might find useful
Related
I discovered that SQLAlchemy does not release the database connections (in my case) so this piles up to the point that it might crash the server. The connections are made from different threads.
Here is the simplified code
"""
Test to see DB connection allocation size while making call from multiple threads
"""
from time import sleep
from threading import Thread, current_thread
import uuid
from sqlalchemy import func, or_, desc
from sqlalchemy import event
from sqlalchemy import ForeignKey, Column, Integer, String, DateTime, UniqueConstraint
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm import scoped_session
from sqlalchemy.orm import relationship
from sqlalchemy.orm import scoped_session, Session
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.types import Integer, DateTime, String, Boolean, Text, Float
from sqlalchemy.engine import Engine
from sqlalchemy.pool import NullPool
# MySQL
SQLALCHEMY_DATABASE = 'mysql'
SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://amalgam:amalgam#localhost/amalgam?charset=utf8mb4' # https://stackoverflow.com/questions/47419943/pymysql-warning-1366-incorrect-string-value-xf0-x9f-x98-x8d-t
SQLALCHEMY_ECHO = False
SQLALCHEMY_ENGINE_OPTIONS = {'pool_size': 40, 'max_overflow': 0}
SQLALCHEMY_ISOLATION_LEVEL = "AUTOCOMMIT"
# DB Engine
# engine = create_engine(SQLALCHEMY_DATABASE_URI, echo=SQLALCHEMY_ECHO, pool_recycle=3600,
# isolation_level= SQLALCHEMY_ISOLATION_LEVEL,
# **SQLALCHEMY_ENGINE_OPTIONS
# ) # Connect to server
engine = create_engine(SQLALCHEMY_DATABASE_URI,
echo=SQLALCHEMY_ECHO,
# poolclass=NullPool,
pool_recycle=3600,
isolation_level= SQLALCHEMY_ISOLATION_LEVEL,
**SQLALCHEMY_ENGINE_OPTIONS
) # Connect to server
session_factory = sessionmaker(bind=engine)
Base = declarative_base()
# ORM Entity
class User(Base):
LEVEL_NORMAL = 'normal'
LEVEL_ADMIN = 'admin'
__tablename__ = "users"
id = Column(Integer, primary_key=True)
name = Column(String(100), nullable=True)
email = Column(String(100), nullable=True, unique=True)
password = Column(String(100), nullable=True)
level = Column(String(100), default=LEVEL_NORMAL)
# Workers
NO = 10
workers = []
_scoped_session_factory = scoped_session(session_factory)
def job(job_id):
session = _scoped_session_factory()
print("Job is {}".format(job_id))
user = User(name='User {} {}'.format(job_id, uuid.uuid4()), email='who cares {} {}'.format(job_id, uuid.uuid4()))
session.add(user)
session.commit()
session.close()
print("Job {} done".format(job_id))
sleep(10)
# Create worker threads
for i in range(NO):
workers.append(Thread(target=job, kwargs={'job_id':i}))
# Start them
for worker in workers:
worker.start()
# Join them
for worker in workers:
worker.join()
# Allow some time to see MySQL's "show processlist;" command
sleep(10)
The moment the program reaches
sleep(10)
and I run the
show processlist;
it give the following result - meaning that all connections to the DB are still alive.
How can I force closing those connections?
Note: I could make use of
poolclass=NullPool
but I feel that that solution is too restrictive - I would like to still have access to a database pool but being able to somehow close connections when wanted
The following is from the signature for QueuePool constructor
pool_size – The size of the pool to be maintained, defaults to 5. This
is the largest number of connections that will be kept persistently in
the pool. Note that the pool begins with no connections; once this
number of connections is requested, that number of connections will
remain. pool_size can be set to 0 to indicate no size limit; to
disable pooling, use a NullPool instead.
max_overflow – The maximum overflow size of the pool. When the number
of checked-out connections reaches the size set in pool_size,
additional connections will be returned up to this limit. When those
additional connections are returned to the pool, they are disconnected
and discarded. It follows then that the total number of simultaneous
connections the pool will allow is pool_size + max_overflow, and the
total number of “sleeping” connections the pool will allow is
pool_size. max_overflow can be set to -1 to indicate no overflow
limit; no limit will be placed on the total number of concurrent
connections. Defaults to 10.
SQLALCHEMY_ENGINE_OPTIONS = {'pool_size': 40, 'max_overflow': 0}
Given the above, this configuration is asking SQLAlchemy to keep up to 40 connections open.
If you don't like that, but want to keep some connections available you might try a configuration like this:
SQLALCHEMY_ENGINE_OPTIONS = {'pool_size': 10, 'max_overflow': 30}
This will keep 10 persistent connections in the pool, and will burst up to 40 connections if requested concurrently. Any connection in surplus of the configured pool size are immediately closed upon being checked back into the pool.
I am using SQL server database. I've noticed that when executing the code below, I get a connection to the database left over in 'sleeping' state with an 'AWAITING COMMAND' status.
engine = create_engine(url, connect_args={'autocommit': True})
res = engine.execute(f"CREATE DATABASE my_database")
res.close()
engine.dispose()
With a breakpoint after the engine.dispose() call, I can see an entry on the server in the EXEC sp_who2 table. This entry only disappears after I kill the process.
Probably Connection Pooling
Connection Pooling
A connection pool is a standard technique used to
maintain long running connections in memory for efficient re-use, as
well as to provide management for the total number of connections an
application might use simultaneously.
Particularly for server-side web applications, a connection pool is
the standard way to maintain a “pool” of active database connections
in memory which are reused across requests.
SQLAlchemy includes several connection pool implementations which
integrate with the Engine. They can also be used directly for
applications that want to add pooling to an otherwise plain DBAPI
approach.
.
I'm not sure if this is what gets in the way of my teardown method which drops the database
To drop a database that's possibly in use try:
USE master;
ALTER DATABASE mydb SET RESTRiCTED_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE mydb;
You basically want to kill all the connections You could use something like this:
For MS SQL Server 2012 and above
USE [master];
DECLARE #kill varchar(8000) = '';
SELECT #kill = #kill + 'kill ' + CONVERT(varchar(5), session_id) + ';'
FROM sys.dm_exec_sessions
WHERE database_id = db_id('MyDB')
EXEC(#kill);
For MS SQL Server 2000, 2005, 2008
USE master;
DECLARE #kill varchar(8000); SET #kill = '';
SELECT #kill = #kill + 'kill ' + CONVERT(varchar(5), spid) + ';'
FROM master..sysprocesses
WHERE dbid = db_id('MyDB')
EXEC(#kill);
Or something more script-like:
DECLARE #pid SMALLINT, #sql NVARCHAR(100)
DECLARE curs CURSOR LOCAL FORWARD_ONLY FOR
SELECT DISTINCT pid FROM master..sysprocesses where dbid = DB_ID(#dbname)
OPEN curs
fetch next from curs into #pid
while ##FETCH_STATUS = 0
BEGIN
SET #sql = 'KILL ' + CONVERT(VARCHAR, #pid)
EXEC(#sql)
FETCH NEXT FROM curs into #pid
END
CLOSE curs
DEALLOCATE curs
More can be found here:
Script to kill all connections to a database (More than RESTRICTED_USER ROLLBACK)
Preface
I want to process tasks listed in a database table in parallel. Not looking for working code.
The Setup
1 PostgreSQL database server D
1 processing server P
1 User terminal T
using Python 3.6, psycopg2.7.6, PostgreSQL 11
D holds tables with data to be processed and a tasks table. A user at T ssh's into P, where the following command can be issued:
python -m core.utils.task
This task.py script is essentially a while loop that gets a task t from the tasks table on D with the status 'new' until there are no new tasks left. A task t is basically a set of arguments for another function called do_something(t). do_something(t) itself will make many connections to D to get data that needs to be processed and set task's to status 'done' once it finished – the while loop starts all over and gets a new task.
In order to run python -m core.utils.task multiple times I open multiple ssh connections. Not so good, I know; threading or multiprocessing would be better. But his is just for testing if I can run the mentioned command twice.
There is a script that manages all the database interactions called pgsql.py which is needed to get a task and then by do_something(t). I adapted a singleton pattern from this SE post.
Pseudo-Code (mostly)
task.py
import mymodule
import pgsql
def main():
while True:
r, c = pgsql.SQL.select_task() # rows and columns
task = dotdict(dict(zip(c, r[0])))
mymodule.do_something(task)
if __name__ == "__main__":
main()
mymodule.py
import pgsql
def do_something(t):
input = pgsql.SQL.get_images(t.table,t.schema,t.image_id,t.image_directory)
some_other_function(input)
pgsql.SQL.task_status(t.task_id,'done')
pgsql.py
import psycopg2 as pg
class Postgres(object):
"""Adapted from https://softwareengineering.stackexchange.com/a/358061/348371"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = object.__new__(cls)
db_config = {'dbname': 'dev01', 'host': 'XXXXXXXX',
'password': 'YYYYY', 'port': 5432, 'user': 'admin'}
try:
print('connecting to PostgreSQL database...')
connection = Postgres._instance.connection = pg.connect(**db_config)
connection.set_session(isolation_level='READ COMMITTED', autocommit=True)
except Exception as error:
print('Error: connection not established {}'.format(error))
Postgres._instance = None
else:
print('connection established')
return cls._instance
def __init__(self):
self.connection = self._instance.connection
def query(self, query):
try:
with self.connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
cols = [desc[0] for desc in cur.description]
except Exception as error:
print('error execting query "{}", error: {}'.format(query, error))
return None
else:
return rows, cols
def __del__(self):
self.connection.close()
db = Postgres()
class SQL():
def select_task():
s = """
UPDATE schema.tasks
SET status = 'ready'
WHERE task_id = ( SELECT task_id
FROM schema.tasks
WHERE tasks.status = 'new'
LIMIT 1)
RETURNING *
;
""".format(m=mode)
return Postgres.query(db, s)
def task_status(id,status):
s = """
UPDATE
schema.tasks
SET
status = '{s}'
WHERE
tasks.task_id = '{id}'
;
""".format(s=status,
id=id)
return Postgres.query(db, s)
Problem
This works with one ssh connection. Tasks are retrieved from the database and processed, once finished the task is set to 'done'. Once I open a second ssh connection in a second terminal to run python -m core.utils.task (so to say, in parallel) the exact same rows of the tasks table are processed in both - ignoring that they have been updated.
Question
What are your suggestions to get this to work? There are millions of tasks and I need to run them in parallel. Before implementing threading or multiprocessing I wanted to test it with multiple ssh connections first, bad idea? I have fiddled around with the isolation levels and autocommit settings in psycopg2's set_session() but without luck. I checked the sessions in the Database server and can see that each process of python -m core.utils.task has its own PID, only connecting once, exactly like this singleton pattern should work. Any ideas or pointers how to deal with this are much appreciated!
The main problem is that performing one task is not an atomic operation. Therefore, in different ssh sessions, the same task can be processing several times.
In this implementation, you can try to use an "INPROGRESS" status for task so as not to retrieve tasks that are already being processed (with "INPROGRESS" status). But be sure to use autocommit.
But I would implement this using threads and database connection pool. And would extract tasks in batches using OFFSET and LIMIT. The do_something, select_task and task_status functions would implement for batch of tasks.
Also, there is no need to implement the Postgres class as a singleton.
Amended (see the comments below)
You can add FOR UPDATE SKIP LOCKED to the SQL query in current implementation (see url).
If you want to work with batches, then separate the data by some serial column (well, or just sort the data in a table).
My implementation using batches.
This can be implemented using ThreadPoolExecutor and PersistentConnectionPool.
I have a Python 3 program that updates a large list of rows based on their ids (in a table in a Postgres 9.5 database).
I use multiprocessing to speed up the process. As Psycopg's connections can’t be shared across processes, I create a connection for each row, then close it.
Overall, multiprocessing is faster than single processing (5 times faster with 8 CPUs). However, creating a connection is slow: I'd like to create just a few connections and keep them open as long as required.
Since .map() chops ids_list into a number of chunks which it submits to the process pool, would it be possible to share a database connection for all ids in the same chunk/process?
Sample code:
from multiprocessing import Pool
import psycopg2
def create_db_connection():
conn = psycopg2.connect(database=database,
user=user,
password=password,
host=host)
return conn
def my_function(item_id):
conn = create_db_connection()
# Other CPU-intensive operations are done here
cur = conn.cursor()
cur.execute("""
UPDATE table
SET
my_column = 1
WHERE id = %s;
""",
(item_id, ))
cur.close()
conn.commit()
if __name__ == '__main__':
ids_list = [] # Long list of ids
pool = Pool() # os.cpu_count() processes
pool.map(my_function, ids_list)
Thanks for any help you can provide.
You can use the initializer parameter of the Pool constructor.
Setup the DB connection in the initializer function. Maybe pass the connection credentials as parameters.
Have a look at the docs: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool
I'm developing on heroku using their Postgres add-on with the Dev plan, which has a connection limit of 20. I'm new to python and this may be trivial, but I find it difficult to abstract the database connection without causing OperationalError: (OperationalError) FATAL: too many connections for role.
Currently I have databeam.py:
import os
from flask import Flask
from flask.ext.sqlalchemy import SQLAlchemy
from settings import databaseSettings
class Db(object):
def __init__(self):
self.app = Flask(__name__)
self.app.config.from_object(__name__)
self.app.config['SQLALCHEMY_DATABASE_URI'] = os.environ.get('DATABASE_URL', databaseSettings())
self.db = SQLAlchemy(self.app)
db = Db()
And when I'm creating a controller for a page, I do this:
import databeam
db = databeam.db
locations = databeam.locations
templateVars = db.db.session.query(locations).filter(locations.parent == 0).order_by(locations.order.asc()).all()
This does produce what I want, but slowly and at times causes the error metioned above. Since I come from a php background I have a certain mindset of how to deal with DB connections (I.e. like the example above), but I fear it doesn't fit well with python.
What is the proper way of abstracting the db connection in one place and then just using the same connection in all imports?
Within SQL Alchemy you should be able to create a connection pool. This pool is what the pool size would be for each Dyno. On the Dev and Basic plan since you could have up to 20, you could set this at 20 if you run 1 dyno, 10 if you run 2, etc. To configure your pool you can setup the engine:
engine = create_engine('postgresql://me#localhost/mydb',
pool_size=20, max_overflow=0)
This sets up your db engine with a pool which you pull from automatically then. You can also configure the pool manually, more details on that can be found on the pooling guide of SQL Alchemy - http://docs.sqlalchemy.org/en/latest/core/pooling.html