Share DB connection in a process pool - python

I have a Python 3 program that updates a large list of rows based on their ids (in a table in a Postgres 9.5 database).
I use multiprocessing to speed up the process. As Psycopg's connections can’t be shared across processes, I create a connection for each row, then close it.
Overall, multiprocessing is faster than single processing (5 times faster with 8 CPUs). However, creating a connection is slow: I'd like to create just a few connections and keep them open as long as required.
Since .map() chops ids_list into a number of chunks which it submits to the process pool, would it be possible to share a database connection for all ids in the same chunk/process?
Sample code:
from multiprocessing import Pool
import psycopg2
def create_db_connection():
conn = psycopg2.connect(database=database,
user=user,
password=password,
host=host)
return conn
def my_function(item_id):
conn = create_db_connection()
# Other CPU-intensive operations are done here
cur = conn.cursor()
cur.execute("""
UPDATE table
SET
my_column = 1
WHERE id = %s;
""",
(item_id, ))
cur.close()
conn.commit()
if __name__ == '__main__':
ids_list = [] # Long list of ids
pool = Pool() # os.cpu_count() processes
pool.map(my_function, ids_list)
Thanks for any help you can provide.

You can use the initializer parameter of the Pool constructor.
Setup the DB connection in the initializer function. Maybe pass the connection credentials as parameters.
Have a look at the docs: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool

Related

How to reduce idle time of a Postgres connection using SQLAlchemy?

I have an app built using Fastapi & SQLAlchemy for handling all the DB-related stuff.
When the APIs are triggered via the frontend, I see that the connections are opened & they remain in IDLE state for a while. Is it possible to reduce the IDLE time via sqlalchemy?
I do the following to connect to the Postgres DB:
import sqlalchemy as db
eng = db.create_engine(<SQLALCHEMY_DATABASE_URI>)
conn = eng.connect()
metadata = db.MetaData()
table = db.Table(
<table_name>,
metadata,
autoload=True,
autoload_with=eng)
user_id = 1
try:
if ids_by_user is None:
query = db.select([table.columns.created_at]).where(
table.columns.user_id == user_id,
).order_by(
table.columns.created_at.desc()
)
result = conn.execute(query).fetchmany(1)
time = result[0][0]
time_filtering_query = db.select([table]).where(
table.columns.created_at == time
)
time_result = conn.execute(time_filtering_query).fetchall()
conn.close()
return time_result
else:
output_by_id = []
for i in ids_by_user:
query = db.select([table]).where(
db.and_(
table.columns.id == i,
table.columns.user_id == user_id
)
)
result = conn.execute(query).fetchall()
output_by_id.append(result)
output_by_id = [output_by_id[j][0]
for j in range(len(output_by_id))
if output_by_id[j]]
conn.close()
return output_by_id
finally:
eng.dispose()
Even after logging out of the app, the connections are still active & in idle state for a while and don't close immediately.
Edit 1
I tried using NullPool & the connections are still idle & in ROLLBACK, which is the same as when didn't use NullPool
You can reduce connection idle time by setting a maximum lifetime per connection by using pool_recycle. Note that connections already checked out will not be terminated until they are no longer in use.
If you are interested in reducing both the idle time and keeping the overall number of unused connections low, you can set a lower pool_size and then set max_overflow to allow for more connections to be allocated when the application is under heavier load.
from sqlalchemy import create_engine
e = create_engine(<SQLALCHEMY_DATABASE_URI>,
pool_recycle=3600 # idle connections will be terminated after 1 hour
pool_size=5 #pool size under normal conditions
max_overflow=5 #additional connections when pool size is exeeded
)
Google cloud has a helpful guide for optimizing Postgres connection pooling that you might find useful

Multiple Database connections using UPDATE ... RETURNING, seem to not update rows in tasks table

Preface
I want to process tasks listed in a database table in parallel. Not looking for working code.
The Setup
1 PostgreSQL database server D
1 processing server P
1 User terminal T
using Python 3.6, psycopg2.7.6, PostgreSQL 11
D holds tables with data to be processed and a tasks table. A user at T ssh's into P, where the following command can be issued:
python -m core.utils.task
This task.py script is essentially a while loop that gets a task t from the tasks table on D with the status 'new' until there are no new tasks left. A task t is basically a set of arguments for another function called do_something(t). do_something(t) itself will make many connections to D to get data that needs to be processed and set task's to status 'done' once it finished – the while loop starts all over and gets a new task.
In order to run python -m core.utils.task multiple times I open multiple ssh connections. Not so good, I know; threading or multiprocessing would be better. But his is just for testing if I can run the mentioned command twice.
There is a script that manages all the database interactions called pgsql.py which is needed to get a task and then by do_something(t). I adapted a singleton pattern from this SE post.
Pseudo-Code (mostly)
task.py
import mymodule
import pgsql
def main():
while True:
r, c = pgsql.SQL.select_task() # rows and columns
task = dotdict(dict(zip(c, r[0])))
mymodule.do_something(task)
if __name__ == "__main__":
main()
mymodule.py
import pgsql
def do_something(t):
input = pgsql.SQL.get_images(t.table,t.schema,t.image_id,t.image_directory)
some_other_function(input)
pgsql.SQL.task_status(t.task_id,'done')
pgsql.py
import psycopg2 as pg
class Postgres(object):
"""Adapted from https://softwareengineering.stackexchange.com/a/358061/348371"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = object.__new__(cls)
db_config = {'dbname': 'dev01', 'host': 'XXXXXXXX',
'password': 'YYYYY', 'port': 5432, 'user': 'admin'}
try:
print('connecting to PostgreSQL database...')
connection = Postgres._instance.connection = pg.connect(**db_config)
connection.set_session(isolation_level='READ COMMITTED', autocommit=True)
except Exception as error:
print('Error: connection not established {}'.format(error))
Postgres._instance = None
else:
print('connection established')
return cls._instance
def __init__(self):
self.connection = self._instance.connection
def query(self, query):
try:
with self.connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
cols = [desc[0] for desc in cur.description]
except Exception as error:
print('error execting query "{}", error: {}'.format(query, error))
return None
else:
return rows, cols
def __del__(self):
self.connection.close()
db = Postgres()
class SQL():
def select_task():
s = """
UPDATE schema.tasks
SET status = 'ready'
WHERE task_id = ( SELECT task_id
FROM schema.tasks
WHERE tasks.status = 'new'
LIMIT 1)
RETURNING *
;
""".format(m=mode)
return Postgres.query(db, s)
def task_status(id,status):
s = """
UPDATE
schema.tasks
SET
status = '{s}'
WHERE
tasks.task_id = '{id}'
;
""".format(s=status,
id=id)
return Postgres.query(db, s)
Problem
This works with one ssh connection. Tasks are retrieved from the database and processed, once finished the task is set to 'done'. Once I open a second ssh connection in a second terminal to run python -m core.utils.task (so to say, in parallel) the exact same rows of the tasks table are processed in both - ignoring that they have been updated.
Question
What are your suggestions to get this to work? There are millions of tasks and I need to run them in parallel. Before implementing threading or multiprocessing I wanted to test it with multiple ssh connections first, bad idea? I have fiddled around with the isolation levels and autocommit settings in psycopg2's set_session() but without luck. I checked the sessions in the Database server and can see that each process of python -m core.utils.task has its own PID, only connecting once, exactly like this singleton pattern should work. Any ideas or pointers how to deal with this are much appreciated!
The main problem is that performing one task is not an atomic operation. Therefore, in different ssh sessions, the same task can be processing several times.
In this implementation, you can try to use an "INPROGRESS" status for task so as not to retrieve tasks that are already being processed (with "INPROGRESS" status). But be sure to use autocommit.
But I would implement this using threads and database connection pool. And would extract tasks in batches using OFFSET and LIMIT. The do_something, select_task and task_status functions would implement for batch of tasks.
Also, there is no need to implement the Postgres class as a singleton.
Amended (see the comments below)
You can add FOR UPDATE SKIP LOCKED to the SQL query in current implementation (see url).
If you want to work with batches, then separate the data by some serial column (well, or just sort the data in a table).
My implementation using batches.
This can be implemented using ThreadPoolExecutor and PersistentConnectionPool.

Parallelize data import from mongodb python

How to import data in parallel manner from mongodb. A solution is to scan all the mongodb, lets say it is 1000 rows there. And then to split, and fetch them in 100's batches and then combine them again so all are 1000.
Below is the code to import data to python from mongodb.
import pandas as pd
from pymongo import MongoClient
def _connect_mongo(host, port, username, password, db):
""" A util for making a connection to mongo """
if username and password:
mongo_uri = 'mongodb://%s:%s#%s:%s/%s' % (username, password, host, port, db)
conn = MongoClient(mongo_uri)
else:
conn = MongoClient(host, port)
return conn[db]
def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db[collection].find(query)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
# Delete the _id
if no_id:
del df['_id']
return df
As I said my comment, have you tried optimizing your database by using indexes? If the database is slow, I don't think parallelizing will improve it. If you still want to go with parallel, call read_mongo with multiple threads.
For indexes you should check https://docs.mongodb.com/manual/indexes/
There's no code related thing to add here, you just need to better understand your databases.
As for the code, python has concurrent (threads) or parallelism (multiprocessing package). You'd need to make your program call read_mongo with your already define/split queries.
There are many examples out there. I'd try the indexes before because it will help for the parallel stuff after.

Objects created in a thread can only be used in that same thread

I can't find the problem:
#app.route('/register', methods=['GET', 'POST'])
def register():
form = RegisterForm(request.form)
if request.method=='POST' and form.validate():
name = form.name.data
email = form.email.data
username = form.username.data
password = sha256_crypt.encrypt(str(form.password.data))
c.execute("INSERT INTO users(name,email,username,password)
VALUES(?,?,?,?)", (name, email, username, password))
conn.commit
conn.close()
Error:
File "C:\Users\app.py", line 59, in register c.execute("INSERT INTO
users(name,email,username,password) VALUES(?,?,?,?)", (name, email,
username, password)) ProgrammingError: SQLite objects created in a
thread can only be used in that same thread.The object was created
in thread id 23508 and this is thread id 22640
Does this mean I can't use the name, email username & password in an HTML file? How do I solve this?
Where you make your connection to the database add the following.
conn = sqlite3.connect('your.db', check_same_thread=False)
Your cursor 'c' is not created in the same thread; it was probably initialized when the Flask app was run.
You probably want to generate SQLite objects (the conneciton, and the cursor) in the same method, such as:
#app.route('/')
def dostuff():
with sql.connect("database.db") as con:
name = "bob"
cur = con.cursor()
cur.execute("INSERT INTO students (name) VALUES (?)",(name))
con.commit()
msg = "Done"
engine = create_engine(
'sqlite:///restaurantmenu.db',
connect_args={'check_same_thread': False}
)
Works for me
You can try this:
engine=create_engine('sqlite:///data.db', echo=True, connect_args={"check_same_thread": False})
It worked for me
In my case, I have the same issue with two python files creating sqlite engine and therefore possibly operating on different threads. Reading SQLAlchemy doc here, it seems it is better to use singleton technique in both files:
# maintain the same connection per thread
from sqlalchemy.pool import SingletonThreadPool
engine = create_engine('sqlite:///mydb.db',
poolclass=SingletonThreadPool)
It does not solve all cases, meaning I occasionally getting the same error, but i can easily overcome it, refreshing the browser page. Since I'm only using this to debug my code, this is OK for me. For more permanent solution, should probably choose another database, like PostgreSQL or other database
As mentioned in https://docs.python.org/3/library/sqlite3.html and pointed out by #Snidhi Sofpro in a comment
By default, check_same_thread is True and only the creating thread may use the connection. If set False, the returned connection may be shared across multiple threads. When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption.
One way to achieve serialization:
import threading
import sqlite3
import queue
import traceback
import time
import random
work_queue = queue.Queue()
def sqlite_worker():
con = sqlite3.connect(':memory:', check_same_thread=False)
cur = con.cursor()
cur.execute('''
CREATE TABLE IF NOT EXISTS test (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT,
source INTEGER,
seq INTEGER
)
''')
while True:
try:
(sql, params), result_queue = work_queue.get()
res = cur.execute(sql, params)
con.commit()
result_queue.put(res)
except Exception as e:
traceback.print_exc()
threading.Thread(target=sqlite_worker, daemon=True).start()
def execute_in_worker(sql, params):
# you might not really need the results if you only use this
# for writing unless you use something like https://www.sqlite.org/lang_returning.html
result_queue = queue.Queue()
work_queue.put(((sql, params), result_queue))
return result_queue.get(timeout=5)
def insert_test_data(seq):
time.sleep(random.randint(0, 100) / 100)
execute_in_worker(
'INSERT INTO test (text, source, seq) VALUES (?, ?, ?)',
['foo', threading.get_ident(), seq]
)
threads = []
for i in range(10):
thread = threading.Thread(target=insert_test_data, args=(i,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
for res in execute_in_worker('SELECT * FROM test', []):
print(res)
# (1, 'foo', 139949462500928, 9)
# (2, 'foo', 139949496071744, 5)
# (3, 'foo', 139949479286336, 7)
# (4, 'foo', 139949487679040, 6)
# (5, 'foo', 139949854099008, 3)
# (6, 'foo', 139949470893632, 8)
# (7, 'foo', 139949862491712, 2)
# (8, 'foo', 139949845706304, 4)
# (9, 'foo', 139949879277120, 0)
# (10, 'foo', 139949870884416, 1)
As you can see, the data is inserted out of order but it's still all handled one by one in a while loop.
https://docs.python.org/3/library/queue.html
https://docs.python.org/3/library/threading.html#threading.Thread.join
https://docs.python.org/3/library/threading.html#threading.get_ident
I had the same problem and I fixed it by closing my connection after every call:
results = session.query(something, something).all()
session.close()
The error doesn't lie on the variables called in your .execute(), but rather the object instances that SQLite uses to access the DB.
I assume that you have:
conn = sqlite3.connect('your_database.db')
c = conn.cursor()
somewhere at the top of the Flask script, & this would be initialized when you first run the script. When the register function is called, a new thread, different from the initial script run handles the process. Thus, in this new thread, you're utilizing object instances that are from a different thread, which SQLite captures as an error: rightfully so, because this may lead to data corruption if you anticipate for your DB to be accessed by different threads during the app run.
So a different method, instead of disabling the check-same-thread SQLite functionality, you could try initializing your DB connection & cursor within the HTTP Methods that are being called.
With this, the SQLite objects & utilization will be on the same thread at runtime.
The code would be redundant, but it might save you in situations where the data is being accessed asynchronously, & will also prevent data corruption.
I was having this problem and I just use the answer in this post. Which I will repost here:
creator = lambda: sqlite3.connect('file::memory:?cache=shared', uri=True)
engine = sqlalchemy.create_engine('sqlite://', creator=creator)
engine.connect()
Which bypasses the problem that you can't give this string "file::memory:?cache=shared" as URL to sqlalchemy. I seen a lot of answers but this solved all my problems of using a SQLite inmemory database that is shared among multiple threads. I initialize the database by creating two tables with two threads for speed. Before this the only way I could do this was with an file backed DB. However that was giving me latency issues in a Cloud deployment.
Create "database.py":
import sqlite3
def dbcon():
return sqlite3.connect("your.db")
Then import:
from database import dbcon
db = dbcon()
db.execute("INSERT INTO users(name,email,username,password)
VALUES(?,?,?,?)", (name, email, username, password))
You probably won't need to close it because the thread will be killed right away.

Using Django DB connection in custom threaded scripts

There are many questions about is django db connection thread safe, but they all seem to be asking the default request threads.
What if I am writing custom script that uses database connection in threads:
from django.db import connections
import threading
class Transform(object):
def transform_data(self, listing):
cursor = self.connection.cursor()
cursor.execute('SELECT ... WHERE id = %s', listing.id)
data = cursor.fetchall()
...
def run(self):
connection = self.connections['legacy']
for listing in listings:
threading.Thread(target=self.transform_data, args=[listing])
How safe is data inside transform_data thread in terms of the result from cursor is not mixed up with other threads?
Ideally each thread should be using its own connection. If you do that when you execute the select query inside transform_data you are essentially getting a snapshot of the data at that point in time. You can retrieve the rows without having to worry about their being updated or deleted by other threads provided that the other threads have their own connection.
If all threads share the same connection what exactly happens is very dependent on what database you are using and transaction isolation level
Each item in the connections object returns a thread-local connection to that database. By default, these connections cannot be shared between threads; attempting to do so will result in a DatabaseError.
Always use connections[alias] within the thread that executes your queries. Never access connections[alias] in the parent thread and pass the object to the child thread. This will ensure that every connection object you use is local to the current thread, avoiding any threading issues.
To fix your code and make it thread-safe, you would change it like this:
from django.db import connections
import threading
class Transform(object):
def transform_data(self, listing):
# Access the database connection on the global `connections` object
# from within the child thread.
cursor = connections['legacy'].cursor()
cursor.execute('SELECT ... WHERE id = %s', listing.id)
data = cursor.fetchall()
...
def run(self):
for listing in listings:
threading.Thread(target=self.transform_data, args=[listing])

Categories