Best practices when using a database and multiprocessing?

Best practices when using a database and multiprocessing? - python

I'm trying to determine the correct way to establish a DB connection within a process.map() function. Do I establish the connection and close it within the function thats being run? Do I do this outside of the process.map()? Below is my some psuedo code thats close to what I currently have.
from multiprocessing import Pool, Manager, Process
import pyodbc
server = DB_SERVER
database = DATABASE
username = USERNAME
password = PASSWORD
def ref_worker(file):
conn = pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password)
temp_data = pd.read_csv(file, dtype=str)
# logic here
cursor = conn.cursor()
sql_string = "insert into db (column) values(field)
cursor.execute(sql_string, temp_data)
cursor.commit()
cursor.close()
conn.close()
if __name__ == '__main__':
driver = ''
driver_names = [x for x in pyodbc.drivers() if x.endswith(' for SQL Server')]
if driver_names:
driver = driver_names[0]
if driver:
conn_str = 'DRIVER={}; ...'.format(driver)
else:
print('(No suitable driver found. Cannot connect.)')
file_list = pd.read_csv('list_of_files.csv', header=None, dtype=str)
pool = Pool(8) # Create a multiprocessing Pool
pool.map(ref_worker, file_list) # process data_inputs iterable with pool
pool.close()
pool.join()

Related

How to do multiprocessing for list of values?

I have list in python having 1000 values
Each value has to process separately
How to do the process independently
Here question is regarding multiprocessing or independent process
my current code
import mysql.connector
for each in value_list:
# Connecting to the Database
mydb = mysql.connector.connect(
host ='localhost',
database ='College',
user ='root',
)
cs = mydb.cursor()
# drop clause
statement ="UPDATE STUDENT SET AGE = 23 WHERE Name=each"
cs.execute(statement)
mydb.commit()
# Disconnecting from the database
mydb.close()
My above code will process one by one. since each update is independent of another how to achieve using multiprocessing
Is there any way to use like from joblib import Parallel, delayed

Here's an example of how this could be done using multiprocessing.
from multiprocessing import Process, Queue
import mysql.connector as MYSQL
NPROCS = 5
value_list = [] # List of names to be updated
CONFIG = {
'host': 'localhost',
'user': 'root',
#'passwd': 'secret',
'database': 'College'
}
def process(queue):
conn = None
try:
conn = MYSQL.connect(**CONFIG)
while param := queue.get():
cursor = conn.cursor()
sql = f'UPDATE STUDENT SET AGE=23 WHERE Name="{param}"'
cursor.execute(sql)
cursor.close()
except Exception as e:
print(e)
finally:
if conn:
conn.commit()
conn.close()
def main():
queue = Queue()
procs = [Process(target=process, args=(queue,)) for _ in range(NPROCS)]
for p in procs:
p.start()
for each in value_list:
queue.put(each)
for _ in range(NPROCS):
queue.put(None)
for p in procs:
p.join()
if __name__ == '__main__':
main()

AWS: Unable to insert values in Redshift table

I am trying to insert values into a table within my redshift cluster, it is connected as I can read the table but I can't insert values on it. When I use SELECT statements it works fine but when I try to insert values from lambda function, it is aborted with no error or log info about why was it aborted.
The query part is like this:
conn = psycopg2.connect(dbname = 'dev',
host =
'redshift-cluster-summaries.c0xcgwtgz65l.us-east-2.redshift.amazonaws.com',
port = '5439',
user = '****',
password = '****%')
cur = conn.cursor()
cur.execute("INSERT INTO public.summaries(topic,summary)
values('data', 'data_summary');")
#print(cur.fetchone())
cur.close()
conn.close()
As I said, there is no log information about why was it aborted, neither it is giving me any kind of error. Actually, when I just use a Select statement, it works.
Is there anyone who can guide me through what could be going on?

You forgot to do conn.commit()
conn = psycopg2.connect(dbname = 'dev',
host = 'redshift-cluster-summaries.c0xcgwtgz65l.us-east-2.redshift.amazonaws.com',
port = '5439',
user = '****',
password = '****%')
cur = conn.cursor()
cur.execute("INSERT INTO public.summaries(topic,summary) values('data', 'data_summary');")
cur.close()
conn.commit()
conn.close()
a bit improved way to run this
from contextlib import contextmanager
#contextmanager
def cursor():
with psycopg2.connect(dbname = 'dev',
host = 'redshift-cluster-summaries.c0xcgwtgz65l.us-east-2.redshift.amazonaws.com',
port = '5439',
user = '****',
password = '****%') as conn:
try:
yield conn.cursor()
finally:
try:
conn.commit()
except psycopg2.InterfaceError:
pass
def run_insert(query):
with cursor() as cur:
cur.execute(query)
cur.close()
run_insert("INSERT INTO public.summaries(topic,summary) values('data', 'data_summary');")

SqlAlchemy & pyMysql connection pooling on a lambda with multiple DB connections

So the issue is I have multiple databases that I want to use the same Database Pool in SqlAlchemy. This resides on a Lambda and the pool is created upon initiation of the Lambda. I want the subsequent DB connections to use the existing pool.
What works just fine is the initial pool connection bpConnect and any subsequent queries to that connection.
What DOESN'T work is the companyConnect connection. I get the following error:
sqlalchemy.exc.StatementError: (builtins.AttributeError) 'XRaySession' object has no attribute 'cursor'
I have these for my connections:
# Pooling
import sqlalchemy.pool as pool
#################### Engines ###################################################
def bpGetConnection():
engine_endpoint = f"mysql+pymysql://{os.environ['DB_USERNAME']}:{os.environ['DB_PASSWORD']}#{os.environ['DB_HOST']}:{str(os.environ['DB_PORT'])}/{os.environ['database']}"
engine = create_engine(engine_endpoint, echo_pool=True)
session = XRaySessionMaker(bind=engine, autoflush=True, autocommit=False)
db = session()
return db
bpPool = pool.StaticPool(bpGetConnection)
def companyGetConnection(database):
engine_endpoint = f"mysql+pymysql://{os.environ['DB_USERNAME']}:{os.environ['DB_PASSWORD']}#{os.environ['DB_HOST']}:{str(os.environ['DB_PORT'])}/{database}"
compEngine = create_engine(engine_endpoint, pool=bpPool)
session = XRaySessionMaker(bind=compEngine, autoflush=True, autocommit=False)
db = Session()
return db
#################### POOLING #############################################
def bpConnect():
conn = bpPool.connect()
return conn
def companyConnect(database):
conn = companyGetConnection(database)
return conn
#################################################################
They are called in this example:
from connections import companyConnect, bpConnect
from models import Company, Customers
def getCustomers(companyID):
db = bpConnect()
myQuery = db.query(Company).filter(Company.id == companyID).one()
compDB = companyConnect(myQuery.database)
customers = compDB.query(Customers).all()
return customers

I figured out how to do it with dynamic pools on a lambda:
class DBRegistry(object):
_db = {}
def get(self, url, **kwargs):
if url not in self._db:
engine = create_engine(url, **kwargs)
Session = XRaySessionMaker(bind=engine, autoflush=True, autocommit=False)
session = scoped_session(Session)
self._db[url] = session
return self._db[url]
compDB = DBRegistry()
def bpGetConnection():
engine_endpoint = f"mysql+pymysql://{os.environ['DB_USERNAME']}:{os.environ['DB_PASSWORD']}#{os.environ['DB_HOST']}:{str(os.environ['DB_PORT'])}/{os.environ['database']}?charset=utf8"
engine = create_engine(engine_endpoint)
session = XRaySessionMaker(bind=engine, autoflush=True, autocommit=False)
db = session()
return db
bpPool = pool.QueuePool(bpGetConnection, pool_size=500, timeout=11)
def bpConnect():
conn = bpPool.connect()
return conn
def companyConnect(database):
engine_endpoint = f"mysql+pymysql://{os.environ['DB_USERNAME']}:{os.environ['DB_PASSWORD']}#{os.environ['DB_HOST']}:{str(os.environ['DB_PORT'])}/{database}?charset=utf8"
conn = compDB.get(engine_endpoint, poolclass=QueuePool)
return conn
So basically it will use one pool for the constant connection needed on the main database and another pool which it will dynamically change the database it needs. When a connection to one of those company databases is needed, it will check if that pool already exists in the registry of pools. If the pool does not exist it will create one and register it.

sqlalchemy MSSQL+pyodbc schema none

I'm trying to connect to SQL server 2019 via sqlalchemy. I'm using both mssql+pyodbc and msql+pyodbc_mssql, but on both cases it cannot connect, always returns default_schema_name not defined.
Already checked database, user schema defined and everything.
Example:
from sqlalchemy import create_engine
import urllib
from sqlalchemy import create_engine
server = 'server'
database = 'db'
username = 'user'
password = 'pass'
#cnxn = 'DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password+';Trusted_Connection=yes'
cnxn = 'DSN=SQL Server;SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password+';Trusted_Connection=yes'
params = urllib.parse.quote_plus(cnxn)
engine = create_engine('mssql+pyodbc:///?odbc_connect=%s' % params)
cnxn = engine.connect()
return None, dialect.default_schema_name
AttributeError: 'MSDialect_pyodbc' object has no attribute 'default_schema_name'
TIA.....

Hopefully the following provides enough for a minimum viable sample. I'm using it in a larger script to move 12m rows 3x a day, and for that reason I've included an example of chunking that I pinched from elsewhere.
#Set up enterprise DB connection
# Enterprise DB to be used
DRIVER = "ODBC Driver 17 for SQL Server"
USERNAME = "SQLUsername"
PSSWD = "SQLPassword"
SERVERNAME = "SERVERNAME01"
INSTANCENAME = "\SQL_01"
DB = "DATABASE_Name"
TABLE = "Table_Name"
#Set up SQL database connection variable / path
#I have included this as an example that can be used to chunk data up
conn_executemany = sql.create_engine(
f"mssql+pyodbc://{USERNAME}:{PSSWD}#{SERVERNAME}{INSTANCENAME}/{DB}?driver={DRIVER}", fast_executemany=True
)
#Used for SQL Loading from Pandas DF
def chunker(seq, size):
return (seq[pos : pos + size] for pos in range(0, len(seq), size))
#Used for SQL Loading from Pandas DF
def insert_with_progress(df, engine, table="", schema="dbo"):
con = engine.connect()
# Replace table
#engine.execute(f"DROP TABLE IF EXISTS {schema}.{table};") #This only works for SQL Server 2016 or greater
try:
engine.execute(f"DROP TABLE Temp_WeatherGrids;")
except:
print("Unable to drop temp table")
try:
engine.execute(f"CREATE TABLE [dbo].[Temp_WeatherGrids]([col_01] [int] NULL,[Location] [int] NULL,[DateTime] [datetime] NULL,[Counts] [real] NULL) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];")
except:
print("Unable to create temp table")
# Insert with progress
SQL_SERVER_CHUNK_LIMIT = 250000
chunksize = math.floor(SQL_SERVER_CHUNK_LIMIT / len(df.columns))
for chunk in chunker(df, chunksize):
chunk.to_sql(
name=table,
con=con,
if_exists="append",
index=False
)
if __name__ == '__main__':
# intialise data. Example - make your own dataframe. DateTime should be pandas datetime objects.
data = {'Col_01':[0, 1, 2, 3],
'Location':['Bar', 'Pub', 'Brewery', 'Bottleshop'],
'DateTime':["1/1/2018", "1/1/2019", "1/1/2020", "1/1/2021"],
'Counts':[1, 2, 3, 4}
# Create DataFrame
df = pd.DataFrame(data)
insert_with_progress(df, conn_executemany, table=TABLE)
del [df]

cx_Oracle and context lib is this correct

I am curious to know if this is the correct way of using cx_Oracle with context lib and connection pooling using DBRCP.
import cx_Oracle
import threading
import time
def get_connection():
connection = cx_Oracle.connect(user='username', password='password', dsn='mydsn_name/service_name:pooled')
return connection
def myfunc():
with get_connection() as conn:
cursor = conn.cursor()
for _ in range(10):
cursor.execute("select * from mytable")
val = cursor.fetchone()
time.sleep(60)
print("Thread", threading.current_thread().name, "fetched sequence =", val)
results = []
for thread in range(0,10):
current_thread = threading.Thread(name = f'Thread {thread}', target = myfunc)
results.append(current_thread)
current_thread.start()
print('Started All Threads')
for thread in results:
thread.join()
print("All done!")
I am not sure If i am doing the right thing here .
And have no idea how to confirm that the connection is being returned to the connection pool.
And each thread is not opening a brand new connection to the database.
Although the doc's on cx_Oracle seem to indicate i am on the right path.

You'll get most benefit if you also use a cx_Oracle connect pool at the same time as DRCP. You need to set cclass with DRCP otherwise you will lose its benefits. You can then decide what level of session reuse (the 'purity') to use. Check the cx_Oracle tutorial. From solutions/connect_pool2.py:
pool = cx_Oracle.SessionPool("pythonhol", "welcome", "localhost/orclpdb:pooled",
min = 2, max = 5, increment = 1, threaded = True)
def Query():
con = pool.acquire(cclass="PYTHONHOL", purity=cx_Oracle.ATTR_PURITY_SELF)
#con = pool.acquire(cclass="PYTHONHOL", purity=cx_Oracle.ATTR_PURITY_NEW)
cur = con.cursor()
for i in range(4):
cur.execute("select myseq.nextval from dual")
seqval, = cur.fetchone()
There are V$ views like V$CPOOL_STATS you can query to check whether DRCP is being used. Links to some resources are in https://oracle.github.io/node-oracledb/doc/api.html#drcp

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best practices when using a database and multiprocessing? - python

Related

How to do multiprocessing for list of values?

AWS: Unable to insert values in Redshift table

SqlAlchemy & pyMysql connection pooling on a lambda with multiple DB connections

sqlalchemy MSSQL+pyodbc schema none

cx_Oracle and context lib is this correct

Categories

Resources