How to use apply_parallel on db calls - python

I was using apply_parallel function from pandarallel library, the below snippet(Function call) iterates over rows and fetches data from mongo db. While executing the same throws me EOFError and a mongo client warning as given below
Mongo function:
def fetch_final_price(model_name, time, col_name):
collection = database['col_name']
price = collection.find({"$and":[{"Model":model_name},{'time':time}]})
price = price[0]['price']
return price
Function call:
final_df['Price'] = df1.parallel_apply(lambda x :fetch_final_price(x['model_name'],x['purchase_date'],collection_name), axis=1)
MongoClient config:
client = pymongo.MongoClient(host=host,username=username,port=port,password=password,tlsCAFile=sslCAFile,retryWrites=False)
Error:
EOFError: Ran out of input
Mongo client warning:
"MongoClient opened before fork. Create MongoClient only "
How to make db calls in parallel_apply??

First of all, "MongoClient opened before fork" warning also provides a link for the documentation, from which you can know that in multiprocessing (which pandarallel base on) you should create MongoClient inside your function (fetch_final_price), otherwise it likely leads to a deadlock:
def fetch_final_price(model_name, time, col_name):
client = pymongo.MongoClient(
host=host,
username=username,
port=port,
password=password,
tlsCAFile=sslCAFile,
retryWrites=False
)
collection = database['col_name']
price = collection.find({"$and": [{"Model": model_name}, {'time': time}]})
price = price[0]['price']
return price
The second mistake, that leads to the exception in the function and the following EOFError, is that you use the brackets operator to a find result, which is actually an iterator, not a list. Consider using find_one if you need only a first instance (alternatively, you can do next(price) instead of indexing operator, but it's not a good way to do this)
def fetch_final_price(model_name, time, col_name):
client = pymongo.MongoClient(
host=host,
username=username,
port=port,
password=password,
tlsCAFile=sslCAFile,
retryWrites=False
)
collection = database['col_name']
price = collection.find_one({"$and": [{"Model": model_name}, {'time': time}]})
price = price['price']
return price

Related

How to make the copy command continue its run in redshift even after the lambda function which initiated it has timed out?

I am trying to run a copy command which loads around 100 GB of data from S3 to redshift. I am using the lambda function to initiate this copy command every day. This is my current code
from datetime import datetime, timedelta
import dateutil.tz
import psycopg2
from config import *
def lambda_handler(event, context):
con = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port=port)
cur = con.cursor()
try:
query = """BEGIN TRANSACTION;
COPY """ + table_name + """ FROM '""" + intermediate_path + """' iam_role '""" + iam_role + """' FORMAT AS parquet;
END TRANSACTION;"""
print(query)
cur.execute(query)
except Exception as e:
subject = "Error emr copy: {}".format(str(datetime.now().date()))
body = "Exception occured " + str(e)
print(body)
con.close()
This function is running fine but the only problem is, after the 15 min timeout of the lambda function, the copy command also stops executing in reshift. Therefore, I cannot finish my copy loading from s3 to redshift.
I also tried to include the statement_timeout statement below after the begin statement and before the copy command. It didn't help.
SET statement_timeout to 18000000;
Can someone suggest how do I solve this issue?
The AWS documentation isn't explicit about what happens when timeout occurs. But I think it's safe to say that it transitions into the "Shutdown" phase, at which point the runtime container is forcibly terminated by the environment.
What this means is that the socket connection used by the database connection will be closed, and the Redshift process that is listening to that socket will receive an end-of-file -- a client disconnect. The normal behavior of any database in this situation is to terminate any outstanding queries and rollback their transactions.
The reason that I gave that description is to let you know that you can't extend the life of a query beyond the life of the Lambda that initiates that query. If you want to stick with using a database connection library, you will need to use a service that doesn't timeout: AWS Batch or ECS are two options.
But, there's a better option: the Redshift Data API, which is supported by Boto3.
This API operates asynchronously: you submit a query to Redshift, and get a token that can be used to check the query's operation. You can also instruct Redshift to send a message to AWS Eventbridge when the query completes/fails (so you can create another Lambda to take appropriate action).
I recommend using Redshift Data API in lambda to load data into Redshift from S3.
You can get rid of psycopgs2 package and use built-in boto3 package in lambda.
This will run copy query asynchronously and lambda function won't take more than a few seconds to run it.
I use sentry_sdk to get notifications of runtime error from lambda.
import boto3
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaaaa#aaaa.ingest.sentry.io/aaaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def execute_redshift_query(sql):
data_client = boto3.client('redshift-data')
data_client.execute_statement(
ClusterIdentifier='redshift-cluster-test',
Database='db',
DbUser='db_user',
Sql=sql,
StatementName='Test query',
WithEvent=True,
)
def handler(event, context):
query = """
copy schema.test_table
from 's3://test-bucket/test.csv'
IAM_ROLE 'arn:aws:iam::1234567890:role/TestRole'
region 'us-east-1'
ignoreheader 1 csv delimiter ','
"""
execute_redshift_query(query)
return True
And another lambda function to send error notification if copy query fails.
You can add EventBridge lambda trigger using the rule in screenshot below.
Here is lambda code to send error notification.
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaa#aaa.ingest.sentry.io/aaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def lambda_handler(event, context):
try:
if event["detail"]["state"] != "FINISHED":
raise ValueError(str(event))
except Exception as e:
sentry_sdk.capture_exception(e)
return True
You can identify which copy query failed by using StatementName defined in the first lambda function.
Hope it is helpful.

AWS Lambda Connection To MySQL DB on RDS Times Out Intermittently

To get my feet wet with deploying Lambda functions on AWS, I wrote what I thought was a simple one: calculate the current time and write it to a table in RDS. A separate process triggers the lambda every 5 minutes. For a few hours it will work as expected, but after some time the connection starts to hang. Then, after a few hours, it will magically work again. Error message is
(pymysql.err.OperationalError) (2003, \"Can't connect to MySQL server on '<server info redacted>' (timed out)\")
(Background on this error at: http://sqlalche.me/e/e3q8)
The issue is not with the VPC, or else the lambda wouldn't run at all I don't think. I have tried defining the connection outside the lambda handler (as many suggest), inside the handler, closing the connection after every run; now, I have the main code running in a separate helper function the lambda handler calls. The connection is created, used, closed, and even explicitly deleted within a try/except block. Still, the intermittent connection issue persists. At this point, I am not sure what else to try. Any help would be appreciated; thank you! Code is below:
import pandas as pd
import sqlalchemy
from sqlalchemy import event as eventSA
# This function is something I added to fix an issue with writing floats
def add_own_encoders(conn, cursor, query, *args):
cursor.connection.encoders[np.float64] = lambda value, encoders: float(value)
def writeTestRecord(event):
try:
connStr = "mysql+pymysql://user:pwd#server.host:3306/db"
engine = sqlalchemy.create_engine(connStr)
eventSA.listen(engine, "before_cursor_execute", add_own_encoders)
conn = engine.connect()
timeRecorded = pd.Timestamp.now().tz_localize("GMT").tz_convert("US/Eastern").tz_localize(None)
s = pd.Series([timeRecorded,])
s=s.rename("timeRecorded")
s.to_sql('lambdastest',conn,if_exists='append',index=False,dtype={'timeRecorded':sqlalchemy.types.DateTime()})
conn.close()
del conn
del engine
return {
'success' : 'true',
'dateTimeRecorded' : timeRecorded.strftime("%Y-%m-%d %H:%M:%S")
}
except:
conn.close()
del conn
del engine
return {
'success' : 'false'
}
def lambda_handler(event, context):
toReturn = writeTestRecord(event)
return toReturn

AWS Lambda Python/Boto3/psycopg2 Redshift temporary credentials

I'm pretty new to AWS so please let me know if what I'm trying to do is not a good idea, but the basic gist of it is that I have a Redshift cluster that I want to be able to query from Lambda (Python) using a combination of psycopg2 and boto3. I have assigned the Lambda function a role that allows it to get temporary credentials (get_cluster_credentials) from Redshift. I then use psycopg2 to pass those temporary credentials to create a connection. This works fine when I run interactively from my Python console locally, but I get the error:
OperationalError: FATAL: password authentication failed for user "IAMA:temp_user_cred:vbpread"
If I use the temporary credentials that Lambda produces directly in a connection statement from my python console they actually work (until expired). I think I'm missing something obvious. My code is:
import boto3
import psycopg2
print('Loading function')
def lambda_handler(event, context):
client = boto3.client('redshift')
dbname = 'medsynpuf'
dbuser = 'temp_user_cred'
response = client.describe_clusters(ClusterIdentifier=dbname)
pwresp = client.get_cluster_credentials(DbUser=dbuser,DbName=dbname,ClusterIdentifer=dbname,DurationSeconds=3600,AutoCreate=True, DbGroups=['vbpread'])
dbpw = pwresp['DbPassword']
dbusr = pwresp['DbUser']
endpoint = response['Clusters'][0]['Endpoint']['Address']
print(dbpw)
print(dbusr)
print(endpoint)
con = psycopg2.connect(dbname=dbname, host=endpoint, port='5439', user=dbusr, password=dbpw)
cur = con.cursor()
query1 = open("001_copd_yearly_count.sql","r")
cur.execute(query1.read())
query1_results = cur.fetchall()
result = query1_results
return result
I'm using Python 3.6.
Thanks!
Gerry
I was using a Windows compiled version of psycopg2 and needed Linux. Swapped it out for the one here: https://github.com/jkehler/awslambda-psycopg2

Objects created in a thread can only be used in that same thread

I can't find the problem:
#app.route('/register', methods=['GET', 'POST'])
def register():
form = RegisterForm(request.form)
if request.method=='POST' and form.validate():
name = form.name.data
email = form.email.data
username = form.username.data
password = sha256_crypt.encrypt(str(form.password.data))
c.execute("INSERT INTO users(name,email,username,password)
VALUES(?,?,?,?)", (name, email, username, password))
conn.commit
conn.close()
Error:
File "C:\Users\app.py", line 59, in register c.execute("INSERT INTO
users(name,email,username,password) VALUES(?,?,?,?)", (name, email,
username, password)) ProgrammingError: SQLite objects created in a
thread can only be used in that same thread.The object was created
in thread id 23508 and this is thread id 22640
Does this mean I can't use the name, email username & password in an HTML file? How do I solve this?
Where you make your connection to the database add the following.
conn = sqlite3.connect('your.db', check_same_thread=False)
Your cursor 'c' is not created in the same thread; it was probably initialized when the Flask app was run.
You probably want to generate SQLite objects (the conneciton, and the cursor) in the same method, such as:
#app.route('/')
def dostuff():
with sql.connect("database.db") as con:
name = "bob"
cur = con.cursor()
cur.execute("INSERT INTO students (name) VALUES (?)",(name))
con.commit()
msg = "Done"
engine = create_engine(
'sqlite:///restaurantmenu.db',
connect_args={'check_same_thread': False}
)
Works for me
You can try this:
engine=create_engine('sqlite:///data.db', echo=True, connect_args={"check_same_thread": False})
It worked for me
In my case, I have the same issue with two python files creating sqlite engine and therefore possibly operating on different threads. Reading SQLAlchemy doc here, it seems it is better to use singleton technique in both files:
# maintain the same connection per thread
from sqlalchemy.pool import SingletonThreadPool
engine = create_engine('sqlite:///mydb.db',
poolclass=SingletonThreadPool)
It does not solve all cases, meaning I occasionally getting the same error, but i can easily overcome it, refreshing the browser page. Since I'm only using this to debug my code, this is OK for me. For more permanent solution, should probably choose another database, like PostgreSQL or other database
As mentioned in https://docs.python.org/3/library/sqlite3.html and pointed out by #Snidhi Sofpro in a comment
By default, check_same_thread is True and only the creating thread may use the connection. If set False, the returned connection may be shared across multiple threads. When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption.
One way to achieve serialization:
import threading
import sqlite3
import queue
import traceback
import time
import random
work_queue = queue.Queue()
def sqlite_worker():
con = sqlite3.connect(':memory:', check_same_thread=False)
cur = con.cursor()
cur.execute('''
CREATE TABLE IF NOT EXISTS test (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT,
source INTEGER,
seq INTEGER
)
''')
while True:
try:
(sql, params), result_queue = work_queue.get()
res = cur.execute(sql, params)
con.commit()
result_queue.put(res)
except Exception as e:
traceback.print_exc()
threading.Thread(target=sqlite_worker, daemon=True).start()
def execute_in_worker(sql, params):
# you might not really need the results if you only use this
# for writing unless you use something like https://www.sqlite.org/lang_returning.html
result_queue = queue.Queue()
work_queue.put(((sql, params), result_queue))
return result_queue.get(timeout=5)
def insert_test_data(seq):
time.sleep(random.randint(0, 100) / 100)
execute_in_worker(
'INSERT INTO test (text, source, seq) VALUES (?, ?, ?)',
['foo', threading.get_ident(), seq]
)
threads = []
for i in range(10):
thread = threading.Thread(target=insert_test_data, args=(i,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
for res in execute_in_worker('SELECT * FROM test', []):
print(res)
# (1, 'foo', 139949462500928, 9)
# (2, 'foo', 139949496071744, 5)
# (3, 'foo', 139949479286336, 7)
# (4, 'foo', 139949487679040, 6)
# (5, 'foo', 139949854099008, 3)
# (6, 'foo', 139949470893632, 8)
# (7, 'foo', 139949862491712, 2)
# (8, 'foo', 139949845706304, 4)
# (9, 'foo', 139949879277120, 0)
# (10, 'foo', 139949870884416, 1)
As you can see, the data is inserted out of order but it's still all handled one by one in a while loop.
https://docs.python.org/3/library/queue.html
https://docs.python.org/3/library/threading.html#threading.Thread.join
https://docs.python.org/3/library/threading.html#threading.get_ident
I had the same problem and I fixed it by closing my connection after every call:
results = session.query(something, something).all()
session.close()
The error doesn't lie on the variables called in your .execute(), but rather the object instances that SQLite uses to access the DB.
I assume that you have:
conn = sqlite3.connect('your_database.db')
c = conn.cursor()
somewhere at the top of the Flask script, & this would be initialized when you first run the script. When the register function is called, a new thread, different from the initial script run handles the process. Thus, in this new thread, you're utilizing object instances that are from a different thread, which SQLite captures as an error: rightfully so, because this may lead to data corruption if you anticipate for your DB to be accessed by different threads during the app run.
So a different method, instead of disabling the check-same-thread SQLite functionality, you could try initializing your DB connection & cursor within the HTTP Methods that are being called.
With this, the SQLite objects & utilization will be on the same thread at runtime.
The code would be redundant, but it might save you in situations where the data is being accessed asynchronously, & will also prevent data corruption.
I was having this problem and I just use the answer in this post. Which I will repost here:
creator = lambda: sqlite3.connect('file::memory:?cache=shared', uri=True)
engine = sqlalchemy.create_engine('sqlite://', creator=creator)
engine.connect()
Which bypasses the problem that you can't give this string "file::memory:?cache=shared" as URL to sqlalchemy. I seen a lot of answers but this solved all my problems of using a SQLite inmemory database that is shared among multiple threads. I initialize the database by creating two tables with two threads for speed. Before this the only way I could do this was with an file backed DB. However that was giving me latency issues in a Cloud deployment.
Create "database.py":
import sqlite3
def dbcon():
return sqlite3.connect("your.db")
Then import:
from database import dbcon
db = dbcon()
db.execute("INSERT INTO users(name,email,username,password)
VALUES(?,?,?,?)", (name, email, username, password))
You probably won't need to close it because the thread will be killed right away.

mySQLdb connection Returns a Truncated Output

I'm trying to connect to a sql server remotely that runs a store procedure and returns a huge file as an output.
When I run the file locally on the SQLbox its fine and returns ~800,000 rows as expected, but when I try to run it using the mySQLdb library from python, it receives a truncated output of only ~6000 rows.
It runs fine for smaller data, so I'm guessing there's some result limit that's coming into play.
I'm sure there's some property that needs to be changed somewhere but there doesn't seem to be any documentation on the pypi library regarding the same.
For explanatory purposes, I've included my code below:
import MySQLdb
import pandas as pd
connection = MySQLdb.connect(sql_server,sql_admin,sql_pw,sql_db)
sql_command = """call function(4)"""
return pd.read_sql(sql_command, connection)
I was able to solve this using cursors. The approach I took is shown below and hopefully should help anyone else.
connection = MySQLdb.connect (host = sql_server, user = sql_admin, passwd = sql_pw, db = sql_db)
cursor = connection.cursor ()
cursor.execute("""call function(4)""")
data = cursor.fetchall()
frame = []
for row in data:
frame.append(row)
return pd.DataFrame(frame)
cursor.close ()
# close the connection
connection.close ()

Categories