Python - pool.map_async pass params to the function - python

I'm getting data into a list of pandas.Dataframe.
Then I need to send this data to a DB. It tool ~2min per iteration.
So I wanted to know of parallel processing is a good idea ? (no lock issue or something ?)
So I wanted to implement it from :
for df in df_list:
# Send a DF's batch to the DB
print('Sending DF\'s data to DB')
df.to_sql('ga_canaux', engine, if_exists='append', index=False)
db.check_data()
To something I find about multiprocessing:
with multiprocessing.Pool(processes=4) as pool:
results = pool.map_async(df.to_sql(???), df_list)
results.wait()
How can I pass the params I need in df.to_sql with map_async ?
EDIT :
I try something to pass N arguments like :
pool = multiprocessing.Pool()
args = ((df, engine, db) for df in df_list)
results = pool.map(multiproc, args)
results.wait()
but I get the error TypeError: can't pickle _thread._local objects
EDIT2:
I changed a bit the way I'm doing mp and it's kinda work (179s vs 732s with the same sample dataset). But I'm facing an error when I try to read from the DB inside the pool.
# Connect to the remote DB
global DB
DB = Database()
global ENGINE
ENGINE = DB.connect()
pool = mp.Pool(mp.cpu_count() - 1)
pool.map(multiproc, df_list)
pool.close()
pool.join()
def multiproc(df):
print('Sending DF\'s data to DB')
df.to_sql('ga_canaux', ENGINE, if_exists='append', index=False)
DB.check_data() // HERE
Error :
(psycopg2.OperationalError) SSL SYSCALL error: EOF detected
[SQL: 'SELECT COUNT(*) FROM ga_canaux'] (Background on this error at: http://sqlalche.me/e/e3q8)
EDIT 3
When I try a on a bigger sample I got a DB timeout : psycopg2.DatabaseError: SSL SYSCALL error: Operation timed out

Related

Python3 Loop Through Array of DBs Then Connect

I am trying to loop through a number of mysql hosts which have the same connection info and execute the same query on each of them & fetch the results.
I'm still learning python & am stuck on the following;
import pymysql
ENDPOINTS=['endpoint01', 'endpoint02', 'endpoint03', 'endpoint04']
USER="SOME_USER"
PASS="SOME_PASSWORD"
print("Testing")
for x in ENDPOINTS:
# Open database connection
DATAB = pymysql.connect(x,USER,PASS)
cursor = DATAB.cursor()
cursor.execute("show databases like 'THIS%'")
data = cursor.fetchall()
print (data)
DATAB.close()
And this is the error I receive;
DATAB = pymysql.connect(x,USER,PASS)
TypeError: __init__() takes 1 positional argument but 4 were given
You're passing the parameters incorrectly. Try
DATAB = pymysql.connect(host=x,user=USER,password=PASS):

Blob storage trigger timing out anywhere from ~10 seconds to a couple minutes

I'm getting quite a few timeouts as my blob storage trigger is running. It seems to timeout whenever I'm inserting values into an Azure SQL DB. I have the functionTimeout parameter set in the host.json to "functionTimeout": "00:40:00", although I'm seeing timeouts happen within a couple of minutes. Why would this be the case? My function app is on ElasticPremium pricing tier.
System.TimeoutException message:
Exception while executing function: Functions.BlobTrigger2 The operation has timed out.
My connection to the db (I close it at the end of the script):
# urllib.parse.quote_plus for python 3
params = urllib.parse.quote_plus(fr'Driver={DRIVER};Server=tcp:{SERVER_NAME},1433;Database=newTestdb;Uid={USER_NAME};Pwd={PASSWORD};Encrypt=yes;TrustServerCertificate=no;Connection Timeout=0;')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine_azure = create_engine(conn_str,echo=True)
conn = engine_azure.connect()
This is the line of code that is run before the timeout happens (Inserting to db):
processed_df.to_sql(blob_name_file.lower(), conn, if_exists = 'append', index=False, chunksize=500)

SQLalchemy fetch via pandas not completed when running in airflow env but when started manually

I have a function that connects to a mysql db and executes a query, that takes quite long (approx. 10 min)
def foo(connections_string): # connection_string something like "mysql://user:key#jost/db"
statement = "SELECT * FROM largtable"
conn = None
df = None
try:
engine = sqlalchemy.create_engine(
connections_string,
connect_args={
"connect_timeout": 1500,
},
poolclass = QueuePool,
pool_pre_ping = True,
pool_size = 10,
pool_recycle=3600,
pool_timeout = 900,
)
conn = engine.connect()
df = pd.read_sql_query(statement, conn)
except Exception:
raise Exception("could not load data")
finally:
if conn:
conn.close()
return df
When I run this in my local envionment, it works and takes about 600 seconds. When I run this via airflow, it fails after about 5 to 6 Mins with the error (_mysql_exceptions.OperationalError) (2013, 'Lost connection to MySQL server during query')
I have tried the suggestions on stakoverflow to adjust the timeout of sqlalchemy (e.g., this and this) and from the sqlalchemy docs, which lead to the additional args (pool_ and connection_args) for the create_engine() function. However, these didn't seem to have any effect at all.
I've also tried to replace sqlalchemy with pymysql, which lead to the same error on airflow. Thus, I didn't try flask-sqlalchemy yet, since I expect the same result.
Since it works in the basically same environment (py version 3.7.x, sqlalchemy 1.3.3 and pandas 1.3.x) if not run by airflow but doesn't when run by airflow, I think there is some global variable, that overrules my timeout settings. But I have no idea where to start the search.
And some additional info, b/c somebody could work with the info: I got it running with airflow twice now in off-hours (5 am and sundays). But not again since.
PS: unfortunately, pagination as suggested here is not an option, since the query runtime results from transformations and calculations.

AWS Lambda Connection To MySQL DB on RDS Times Out Intermittently

To get my feet wet with deploying Lambda functions on AWS, I wrote what I thought was a simple one: calculate the current time and write it to a table in RDS. A separate process triggers the lambda every 5 minutes. For a few hours it will work as expected, but after some time the connection starts to hang. Then, after a few hours, it will magically work again. Error message is
(pymysql.err.OperationalError) (2003, \"Can't connect to MySQL server on '<server info redacted>' (timed out)\")
(Background on this error at: http://sqlalche.me/e/e3q8)
The issue is not with the VPC, or else the lambda wouldn't run at all I don't think. I have tried defining the connection outside the lambda handler (as many suggest), inside the handler, closing the connection after every run; now, I have the main code running in a separate helper function the lambda handler calls. The connection is created, used, closed, and even explicitly deleted within a try/except block. Still, the intermittent connection issue persists. At this point, I am not sure what else to try. Any help would be appreciated; thank you! Code is below:
import pandas as pd
import sqlalchemy
from sqlalchemy import event as eventSA
# This function is something I added to fix an issue with writing floats
def add_own_encoders(conn, cursor, query, *args):
cursor.connection.encoders[np.float64] = lambda value, encoders: float(value)
def writeTestRecord(event):
try:
connStr = "mysql+pymysql://user:pwd#server.host:3306/db"
engine = sqlalchemy.create_engine(connStr)
eventSA.listen(engine, "before_cursor_execute", add_own_encoders)
conn = engine.connect()
timeRecorded = pd.Timestamp.now().tz_localize("GMT").tz_convert("US/Eastern").tz_localize(None)
s = pd.Series([timeRecorded,])
s=s.rename("timeRecorded")
s.to_sql('lambdastest',conn,if_exists='append',index=False,dtype={'timeRecorded':sqlalchemy.types.DateTime()})
conn.close()
del conn
del engine
return {
'success' : 'true',
'dateTimeRecorded' : timeRecorded.strftime("%Y-%m-%d %H:%M:%S")
}
except:
conn.close()
del conn
del engine
return {
'success' : 'false'
}
def lambda_handler(event, context):
toReturn = writeTestRecord(event)
return toReturn

Share DB connection in a process pool

I have a Python 3 program that updates a large list of rows based on their ids (in a table in a Postgres 9.5 database).
I use multiprocessing to speed up the process. As Psycopg's connections can’t be shared across processes, I create a connection for each row, then close it.
Overall, multiprocessing is faster than single processing (5 times faster with 8 CPUs). However, creating a connection is slow: I'd like to create just a few connections and keep them open as long as required.
Since .map() chops ids_list into a number of chunks which it submits to the process pool, would it be possible to share a database connection for all ids in the same chunk/process?
Sample code:
from multiprocessing import Pool
import psycopg2
def create_db_connection():
conn = psycopg2.connect(database=database,
user=user,
password=password,
host=host)
return conn
def my_function(item_id):
conn = create_db_connection()
# Other CPU-intensive operations are done here
cur = conn.cursor()
cur.execute("""
UPDATE table
SET
my_column = 1
WHERE id = %s;
""",
(item_id, ))
cur.close()
conn.commit()
if __name__ == '__main__':
ids_list = [] # Long list of ids
pool = Pool() # os.cpu_count() processes
pool.map(my_function, ids_list)
Thanks for any help you can provide.
You can use the initializer parameter of the Pool constructor.
Setup the DB connection in the initializer function. Maybe pass the connection credentials as parameters.
Have a look at the docs: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool

Categories