I'm developing a Flask API, and I have the following code to have a connection pool using Psycopg2. I wonder should I consider to close the connection pool when the program terminates and how should I do this?
#contextmanager
def get_cursor(:
global connection_pool
if not cls.connection_pool:
cls.connection_pool = ThreadedConnectionPool(5, 25, dsn=PoolingWrap.generate_conn_string())
con = cls.connection_pool.getconn()
try:
yield con.cursor(cursor_factory=RealDictCursor)
finally:
cls.connection_pool.putconn(con)
I believe that so long as you are closing the transactions in each connection correctly, then you should not have any problem just leaving the global pool.
I think the worst that can happen in that case is some connections on the DB side take a short while to work out that they've been closed on the client side - but this should not be able to cause any data consistency type issues.
However, if you really want to close the connection pool before you exit, one way to do this is to register an atexit function.
import atexit
#atexit.register
def close_connection_pool():
global connection_pool
connection_pool.closeall()
Related
I am running a python app where I for various reasons have to host my program on a server in one part of the world and then have my database in another.
I tested via a simple script, and from my home which is in a neighboring country to the database server, the time to write and retrieve a row from the database is about 0.035 seconds (which is a nice speed imo) compared to 0,16 seconds when my python server in the other end of the world performs same action.
This is an issue as I am trying to keep my python app as fast as possible so I was wondering if there is a smart way to do this?
As I am running my code synchronously my program is waiting every time it has to write to the db, which is about 3 times a second so the time adds up. Is it possible to run the connection to the database in a separate thread or something, so it doesn't halt the whole program while it tries to send data to the database? Or can this be done using asyncio (I have no experience with async code)?
I am really struggling figuring out a good way to solve this issue.
In advance, many thanks!
Yes, you can create a thread that does the writes in the background. In your case, it seems reasonable to have a queue where the main thread puts things to be written and the db thread gets and writes them. The queue can have a maximum depth so that when too much stuff is pending, the main thread waits. You could also do something different like drop things that happen too fast. Or, use a db with synchronization and write a local copy. You also may have an opportunity to speed up the writes a bit by committing multiple at once.
This is a sketch of a worker thread
import threading
import queue
class SqlWriterThread(threading.Thread):
def __init__(self, db_connect_info, maxsize=8):
super().__init__()
self.db_connect_info = db_connect_info
self.q = queue.Queue(maxsize)
# TODO: Can expose q.put directly if you don't need to
# intercept the call
# self.put = q.put
self.start()
def put(self, statement):
print(f"DEBUG: Putting\n{statement}")
self.q.put(statement)
def run(self):
db_conn = None
while True:
# get all the statements you can, waiting on first
statements = [self.q.get()]
try:
while True:
statements.append(self.q.get(), block=False)
except queue.Empty:
pass
try:
# early exit before connecting if channel is closed.
if statements[0] is None:
return
if not db_conn:
db_conn = do_my_sql_connect()
try:
print("Debug: Executing\n", "--------\n".join(f"{id(s)} {s}" for s in statements))
# todo: need to detect closed connection, then reconnect and resart loop
cursor = db_conn.cursor()
for statement in statements:
if statement is None:
return
cursor.execute(*statement)
finally:
cursor.commit()
finally:
for _ in statements:
self.q.task_done()
sql_writer = SqlWriterThread(('user', 'host', 'credentials'))
sql_writer.put(('execute some stuff',))
I just can't figure out how to use the aiosqlite module so that I can keep the connection around for later use.
The example based on the aiosqlite project page
async with aiosqlite.connect('file.db') as conn:
cursor = await conn.execute("SELECT 42;")
rows = await cursor.fetchall()
print('rows %s' % rows)
works fine, but I want to keep the connection around so that I can use it throughout my program.
Typically, with sqlite, I open a connection, squirrel it away and then use it throughout the life of the program.
I also tried things like:
conn = aiosqlite.connect('file.db')
c = await conn.__enter__()
AttributeError: 'Connection' object has no attribute '__enter__'
Is there a way to use this module without a context manager?
The "best" way would be for the entry-point of your application to create the aiosqlite connection using the context manager method, store a reference to the connection object somewhere, and then run the application's "run loop" method from within that context. This would ensure that when your application exits, the sqlite connection is cleaned up appropriately. This could look something like this:
async def main():
async with aiosqlite.connect(...) as conn:
# save conn somewhere
await run_loop()
Alternately, you can await the appropriate enter/exit methods:
try:
conn = aiosqlite.connect(...)
await conn.__aenter__()
# do stuff
finally:
await conn.__aexit__()
Regardless, do beware that the asynchronous nature of aiosqlite does mean that shared connections will potentially result in overlap on transactions. If you need the assurance that concurrent queries take place with separate transactions, then you will need a separate connection per transaction.
According to the Python sqlite docs on sharing connections:
When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption.
This applies equally to aiosqlite and asyncio. For example, the following code will potentially overlap both inserts into a single transaction:
async def one(db):
await db.execute("insert ...")
await db.commit()
async def two(db):
await db.execute("insert ...")
await db.commit()
async def main():
async with aiosqlite.connect(...) as db:
await asyncio.gather(one(db), two(db))
The correct solution here would be to either create a connection for each transaction, or use something like executescript to execute the entire transaction at once.
Two points. Firstly a modifying statement locks the whole DB, as per:
When a database is accessed by multiple connections, and one of the processes modifies the database, the SQLite database is locked until that transaction is committed. The timeout parameter specifies how long the connection should wait for the lock to go away until raising an exception. The default for the timeout parameter is 5.0 (five seconds).
(https://docs.python.org/3/library/sqlite3.html#connection-objects)
so read-only statements will not be able to execute anyway, as long as a "write" is in progress. Consequently, I'd open a new connection for each write operation.
Secondly, for a file-based DB such as SQLite the cost of opening a connection is obviously much lower then for a network-accessed DB. You may profile it; a simple test like:
x = 0
for i in range(1000):
async with aiosqlite.connect('pic_db.db') as db:
async with db.execute('SELECT 12', ()) as cursor:
x += 1
shows the two statements take together ~550ms; if the for and connect are interchanged, it takes ~150ms; so each connect costs you about 0.4ms. If that's an issue, I'd try to reuse a connection for read-only statements (and open a new one for each modifying statement).
I have a function called connection which connects to the SQL databases. It can be simplified as:
import psycopg2 as pg
def connection():
_conn_DB = pg.connect(dbname="h2", user="h2naya", host=h2_IP, password=DB_PW)
_conn_DB.set_client_encoding('UTF8')
_conn_ML = pg.connect(dbname="postgres", user="h2naya", host="localhost", password=ML_PW)
_conn_ML.set_client_encoding('UTF8')
return {"db": _conn_DB, "ml": _conn_ML}
Now I am trying to get data within a 7-day period starting from a specified date, so I create another function:
import pandas as pd
conns = connection()
def get_latest_7d(date_string):
with conns["db"] as conn:
# date_string is formatted in my_sql in the real life
fasting_bg_latest_7d = pd.read_sql(my_sql, conn)
return fasting_bg_latest_7d
Now I can map get_latest_7d with a list of date strings, e.g. map(get_latest_7d, ["2018-07-01", "2018-07-02", "2018-07-03"]). It works.
As the list of date strings is really long, I'd like tool use multiprocessing.Pool.map to accelerate the procedure. However, the codes below give me InterfaceError: connection already closed:
from multiprocessing import Pool
with Pool() as p:
results = p.map(get_latest_7d, ["2018-07-01", "2018-07-02", "2018-07-03"])
I tried different ways and found the only working one is moving the line conns = connection() inside get_latest_7d, and not closing the connections before returning the data frame. By saying "closing the connections" I mean:
for conn_val in conns.values():
conn_val.close()
It seems that I need to create connections for each process, and I am not allowed to close the connections before the process ends. I am curious about:
Why can't the connection be shared across the processes?
Why does closing a connection in one process affect other processes?
Is there any better practice for my purpose?
PS: It seems to be recommended to build connections in each process but I am not sure that I fully understand.
I have a PySpark job that updates some objects in HBase (Spark v1.6.0; happybase v0.9).
It sort-of works if I open/close an HBase connection for each row:
def process_row(row):
conn = happybase.Connection(host=[hbase_master])
# update HBase record with data from row
conn.close()
my_dataframe.foreach(process_row)
After a few thousand upserts, we start to see errors like this:
TTransportException: Could not connect to [hbase_master]:9090
Obviously, it's inefficient to open/close a connection for each upsert. This function is really just a placeholder for a proper solution.
I then tried to create a version of the process_row function that uses a connection pool:
pool = happybase.ConnectionPool(size=20, host=[hbase_master])
def process_row(row):
with pool.connection() as conn:
# update HBase record with data from row
For some reason, the connection pool version of this function returns an error (see complete error message):
TypeError: can't pickle thread.lock objects
Can you see what I'm doing wrong?
Update
I saw this post and suspect I'm experiencing the same issue: Spark attempts to serialize the pool object and distribute it to each of the executors, but this connection pool object cannot be shared across multiple executors.
It sounds like I need to split the dataset into partitions, and use one connection per partition (see design patterns for using foreachrdd). I tried this, based on an example in the documentation:
def persist_to_hbase(dataframe_partition):
hbase_connection = happybase.Connection(host=[hbase_master])
for row in dataframe_partition:
# persist data
hbase_connection.close()
my_dataframe.foreachPartition(lambda dataframe_partition: persist_to_hbase(dataframe_partition))
Unfortunately, it still returns a "can't pickle thread.lock objects" error.
down the line happybase connections are just tcp connections so they cannot be shared between processes. a connection pool is primarily useful for multi-threaded applications and also proves useful for single-threaded applications that can use the pool as a global "connection factory" with connection reuse, which may simplify code because no "connection" objects need to be passed around. it also makes error recovery is a bit easier.
in any case a pool (which is just a group of connections) cannot be shared between processes. trying to serialise it does not make sense for that reason. (pools use locks which causes serialisation to fail but that is just a symptom.)
perhaps you can use a helper that conditionally creates a pool (or connection) and stores it as a module-local variable, instead of instantiating it upon import, e.g.
_pool = None
def get_pool():
global _pool
if _pool is None:
_pool = happybase.ConnectionPool(size=1, host=[hbase_master])
return pool
def process(...)
with get_pool().connection() as connection:
connection.table(...).put(...)
this instantiates the pool/connection upon first use instead of on import time.
How would one go about cancelling execution of a query statement using pyscopg2 (the python Postgres driver)?
As an example, let's say I have the following code:
import psycopg2
cnx_string = "something_appropriate"
conn = psycopg2.connect(cnx_string)
cur = conn.cursor()
cur.execute("long_running_query")
Then I want to cancel the execution of that long running query from another thread - what method would I have to call on the connection/cursor objects to do this?
You can cancel a query by calling the pg_cancel_backend(pid) PostgreSQL function in a separate connection.
You can know the PID of the backend to cancel from the connection.get_backend_pid() method of psycopg2 (available from version 2.0.8).
The connection object has a cancel member. Using this and threading you could use
sqltimeout = threading.Timer(sql_timeout_seconds, conn.cancel)
sqltimeout.start()
When the timer expires, the cancel is sent to the connection and an exception will be raised by the server.
Don't forget to cancel the timer when the query normally finishes....
sqltimeout.cancel()
psycopg2's async execution support has been removed.
If you can use py-postgresql and its transactions (it's py3k), the internal implementation is asynchronous and supports being interrupted.