AWS Lambda Python/Boto3/psycopg2 Redshift temporary credentials - python

I'm pretty new to AWS so please let me know if what I'm trying to do is not a good idea, but the basic gist of it is that I have a Redshift cluster that I want to be able to query from Lambda (Python) using a combination of psycopg2 and boto3. I have assigned the Lambda function a role that allows it to get temporary credentials (get_cluster_credentials) from Redshift. I then use psycopg2 to pass those temporary credentials to create a connection. This works fine when I run interactively from my Python console locally, but I get the error:
OperationalError: FATAL: password authentication failed for user "IAMA:temp_user_cred:vbpread"
If I use the temporary credentials that Lambda produces directly in a connection statement from my python console they actually work (until expired). I think I'm missing something obvious. My code is:
import boto3
import psycopg2
print('Loading function')
def lambda_handler(event, context):
client = boto3.client('redshift')
dbname = 'medsynpuf'
dbuser = 'temp_user_cred'
response = client.describe_clusters(ClusterIdentifier=dbname)
pwresp = client.get_cluster_credentials(DbUser=dbuser,DbName=dbname,ClusterIdentifer=dbname,DurationSeconds=3600,AutoCreate=True, DbGroups=['vbpread'])
dbpw = pwresp['DbPassword']
dbusr = pwresp['DbUser']
endpoint = response['Clusters'][0]['Endpoint']['Address']
print(dbpw)
print(dbusr)
print(endpoint)
con = psycopg2.connect(dbname=dbname, host=endpoint, port='5439', user=dbusr, password=dbpw)
cur = con.cursor()
query1 = open("001_copd_yearly_count.sql","r")
cur.execute(query1.read())
query1_results = cur.fetchall()
result = query1_results
return result
I'm using Python 3.6.
Thanks!
Gerry

I was using a Windows compiled version of psycopg2 and needed Linux. Swapped it out for the one here: https://github.com/jkehler/awslambda-psycopg2

Related

How to use python to write to s3 bucket

I have a postgres database in aws that I can query from just fine using python and psycopg2. My issue is writing to an s3 bucket. I do not know how to do that. Supposable, you have to use boto3 and aws-lambda but I am not familiar with that. I've been trying to find something online that outlines the code but one link doesn't seem to have asked the question correctly: how do I send query data from postgres in AWS to an s3 bucket using python?. And the other, I don't understand how this example works A way to export psql table (or query) directly to AWS S3 as file (csv, json).
Here is what my code looks like at the moment:
import psycopg2
import boto3
import os
import io
#setting values to read
os.environ['AWS_ACCESS_KEY_ID'] = "XXXXXXXXXXXXX"
os.environ['AWS_SECRET_ACCESS_KEY'] = "XXXXXXXXXXXX"
endpoint = "rds_endpoint"
username = 'user_name'
etl_password = 'stored_pass'
database_name = 'db_name'
resource = boto3.resource('s3')
file_name = 'daily_export'
bucket = "my s3 bucket"
copy_query = '''select parent.brand as business_type
, app.business_name as business_name
from hdsn_rsp parent
join apt_ds app
on parent.id = app.id'''
def handle(event, context):
try:
connection = psycopg2.connect(user= username
, password= etl_password
, host= endpoint
, port="5432"
, database= database_name)
cursor = connection.cursor()
cursor.execute(copy_query)
file = io.StringIO()
#cursor.copy_expert(copy_query, file)
#resource.Object(bucket, file_name+'.csv').put(Body=file.getvalue())
except(Exception, psycopg2.Error) as error:
print("Error connecting to postgres instance", error)
finally:
if connection:
cursor.close()
connection.close()
return("query has executed and file in bucket")
If leave the commented part in then my code executes just fine(running from local machine), but when I uncomment it, remove the handler and put it in a function, I get my success back but nothing is in my s3 bucket. I thought I had permissions issue so I created a new user and didn't give permissions to that bucket so it would fail and it did, so permissions aren't an issue. But, I don't get what is going on with #cursor.copy_expert(copy_query, file) #resource.Object(bucket, file_name+'.csv').put(Body=file.getvalue()) because it's not placing the file in my bucket.
I'm new here and new to writing code in aws so please be patient with me as I am not entirely sure how to ask the question properly. I know this is a big ask, but I am so confused as to what to do. Could someone please assist me on what corrections I need to make?

How to make the copy command continue its run in redshift even after the lambda function which initiated it has timed out?

I am trying to run a copy command which loads around 100 GB of data from S3 to redshift. I am using the lambda function to initiate this copy command every day. This is my current code
from datetime import datetime, timedelta
import dateutil.tz
import psycopg2
from config import *
def lambda_handler(event, context):
con = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port=port)
cur = con.cursor()
try:
query = """BEGIN TRANSACTION;
COPY """ + table_name + """ FROM '""" + intermediate_path + """' iam_role '""" + iam_role + """' FORMAT AS parquet;
END TRANSACTION;"""
print(query)
cur.execute(query)
except Exception as e:
subject = "Error emr copy: {}".format(str(datetime.now().date()))
body = "Exception occured " + str(e)
print(body)
con.close()
This function is running fine but the only problem is, after the 15 min timeout of the lambda function, the copy command also stops executing in reshift. Therefore, I cannot finish my copy loading from s3 to redshift.
I also tried to include the statement_timeout statement below after the begin statement and before the copy command. It didn't help.
SET statement_timeout to 18000000;
Can someone suggest how do I solve this issue?
The AWS documentation isn't explicit about what happens when timeout occurs. But I think it's safe to say that it transitions into the "Shutdown" phase, at which point the runtime container is forcibly terminated by the environment.
What this means is that the socket connection used by the database connection will be closed, and the Redshift process that is listening to that socket will receive an end-of-file -- a client disconnect. The normal behavior of any database in this situation is to terminate any outstanding queries and rollback their transactions.
The reason that I gave that description is to let you know that you can't extend the life of a query beyond the life of the Lambda that initiates that query. If you want to stick with using a database connection library, you will need to use a service that doesn't timeout: AWS Batch or ECS are two options.
But, there's a better option: the Redshift Data API, which is supported by Boto3.
This API operates asynchronously: you submit a query to Redshift, and get a token that can be used to check the query's operation. You can also instruct Redshift to send a message to AWS Eventbridge when the query completes/fails (so you can create another Lambda to take appropriate action).
I recommend using Redshift Data API in lambda to load data into Redshift from S3.
You can get rid of psycopgs2 package and use built-in boto3 package in lambda.
This will run copy query asynchronously and lambda function won't take more than a few seconds to run it.
I use sentry_sdk to get notifications of runtime error from lambda.
import boto3
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaaaa#aaaa.ingest.sentry.io/aaaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def execute_redshift_query(sql):
data_client = boto3.client('redshift-data')
data_client.execute_statement(
ClusterIdentifier='redshift-cluster-test',
Database='db',
DbUser='db_user',
Sql=sql,
StatementName='Test query',
WithEvent=True,
)
def handler(event, context):
query = """
copy schema.test_table
from 's3://test-bucket/test.csv'
IAM_ROLE 'arn:aws:iam::1234567890:role/TestRole'
region 'us-east-1'
ignoreheader 1 csv delimiter ','
"""
execute_redshift_query(query)
return True
And another lambda function to send error notification if copy query fails.
You can add EventBridge lambda trigger using the rule in screenshot below.
Here is lambda code to send error notification.
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaa#aaa.ingest.sentry.io/aaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def lambda_handler(event, context):
try:
if event["detail"]["state"] != "FINISHED":
raise ValueError(str(event))
except Exception as e:
sentry_sdk.capture_exception(e)
return True
You can identify which copy query failed by using StatementName defined in the first lambda function.
Hope it is helpful.

Error in connecting Azure SQL database from Azure Machine Learning Service using python

I am trying to connect Azure SQL Database from Azure Machine Learning service, but I got the below error.
Please check Error: -
**('IM002', '[IM002] [unixODBC][Driver Manager]Data source name not found and no default driver specified (0) (SQLDriverConnect)')**
Please Check the below code that I have used for database connection: -
import pyodbc
class DbConnect:
# This class is used for azure database connection using pyodbc
def __init__(self):
try:
self.sql_db = pyodbc.connect(SERVER=<servername>;PORT=1433;DATABASE=<databasename>;UID=<username>;PWD=<password>')
get_name_query = "select name from contacts"
names = self.sql_db.execute(get_name_query)
for name in names:
print(name)
except Exception as e:
print("Error in azure sql server database connection : ", e)
sys.exit()
if __name__ == "__main__":
class_obj = DbConnect()
Is there any way to solve the above error? Please let me know if there is any way.
I'd consider using azureml.dataprep over pyodbc for this task (the API may change, but this worked last time I tried):
import azureml.dataprep as dprep
ds = dprep.MSSQLDataSource(server_name=<server-name,port>,
database_name=<database-name>,
user_name=<username>,
password=<password>)
You should then be able to collect the result of an SQL query in pandas e.g. via
dataflow = dprep.read_sql(ds, "SELECT top 100 * FROM [dbo].[MYTABLE]")
dataflow.to_pandas_dataframe()
Alternatively you can create SQL datastore and create a dataset from the SQL datastore.
Learn how:
https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets#create-tabulardatasets
Sample code:
from azureml.core import Dataset, Datastore
# create tabular dataset from a SQL database in datastore
sql_datastore = Datastore.get(workspace, 'mssql')
sql_ds = Dataset.Tabular.from_sql_query((sql_datastore, 'SELECT * FROM my_table'))
#AkshayGodase Any particular reason that you want to use pyodbc?

How to query a (Postgres) RDS DB through an AWS Jupyter Notebook?

I'm trying to query an RDS (Postgres) database through Python, more specifically a Jupyter Notebook. Overall, what I've been trying for now is:
import boto3
client = boto3.client('rds-data')
response = client.execute_sql(
awsSecretStoreArn='string',
database='string',
dbClusterOrInstanceArn='string',
schema='string',
sqlStatements='string'
)
The error I've been receiving is:
BadRequestException: An error occurred (BadRequestException) when calling the ExecuteSql operation: ERROR: invalid cluster id: arn:aws:rds:us-east-1:839600708595:db:zprime
In the end, it was much simpler than I thought, nothing fancy or specific. It was basically a solution I had used before when accessing one of my local DBs. Simply import a specific library for your database type (Postgres, MySQL, etc) and then connect to it in order to execute queries through python.
I don't know if it will be the best solution since making queries through python will probably be much slower than doing them directly, but it's what works for now.
import psycopg2
conn = psycopg2.connect(database = 'database_name',
user = 'user',
password = 'password',
host = 'host',
port = 'port')
cur = conn.cursor()
cur.execute('''
SELECT *
FROM table;
''')
cur.fetchall()

"Invalid credentials" error when accessing Redshift from Python

I am trying to write a Python script to access Amazon Redshift to create a table in Redshift and copy data from S3 to the Redshift table.
My code is:
import psycopg2
import os
#import pandas as pd
import requests
requests.packages.urllib3.disable_warnings()
redshift_endpoint = os.getenv("END-point")
redshift_user = os.getenv("user")
redshift_pass = os.getenv("PASSWORD")
port = 5439
dbname = 'DBNAME'
conn = psycopg2.connect(
host="",
user='',
port=5439,
password='',
dbname='')
cur = conn.cursor()
aws_key = os.getenv("access_key") # needed to access S3 Sample Data
aws_secret = os.getenv("secret_key")
#aws_iam_role= os.getenv('iam_role') #tried using this too
base_copy_string= """copy %s from 's3://mypath/%s'.csv
credentials 'aws_access_key_id= %s aws_access_secrect_key= %s'
delimiter '%s';""" # the base COPY string that we'll be using
#easily generate each table that we'll need to COPY data from
tables = ["employee"]
data_files = ["test"]
delimiters = [","]
#the generated COPY statements we'll be using to load data;
copy_statements = []
for tab, f, delim in zip(tables, data_files, delimiters):
copy_statements.append(base_copy_string % (tab, f, aws_key, aws_secret, delim)%)
#create Table
cur.execute(""" create table employee(empname varchar(30),empno integer,phoneno integer,email varchar(30))""")
for copy_statement in copy_statements: # execute each COPY statement
cur.execute(copy_statement)
conn.commit()
for table in tables + ["employee"]:
cur.execute("select count(*) from %s;" % (table,))
print(cur.fetchone())
conn.commit() # make sure data went through and commit our statements permanently.
When I run this command I getting an Error at cur.execute(copy_statement)
**Error:** error: Invalid credentials. Must be of the format: credentials 'aws_iam_role=...' or 'aws_access_key_id=...;aws_secre
t_access_key=...[;token=...]'
code: 8001
context:
query: 582
location: aws_credentials_parser.cpp:114
process: padbmaster [pid=18692]
Is there a problem in my code? Or is it is an AWS access_key problem?
I even tried using an iam_role but I get an error:
IAM role cannot assume role even in Redshift
I have a managed IAM role permission by attaching S3FullAccess policy.
There are some errors in your script.
1) Change base_copy_string as below:
base_copy_string= """copy %s from 's3://mypath/%s.csv' credentials
'aws_access_key_id=%s;aws_secret_access_key=%s' delimiter '%s';""" #
the base COPY string that we'll be using
There must be a ; added in credentials and also other formatting issues with single-quotes. It is aws_secret_access_key and not aws_access_secrect_key.
check this link for detailed info: http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html#copy-usage_notes-iam-permissions
I suggest you use iam-roles instead of credentials.
http://docs.aws.amazon.com/redshift/latest/dg/loading-data-access-permissions.html
2) change copy_statements.append as below(remove extra % in the end):
copy_statements.append(base_copy_string % (tab, f, aws_key,
aws_secret, delim))
Correct these and try again.
To start with, NEVER, NEVER, NEVER hardcode access keys and secret keys in your code. So that rules out your first query. Now coming to right way of implementing things. You are right, IAM Role is the right way of doing it. Unfortunately, I can't get the exact error and use case from your description. As far as I understand, you are trying to run this python file from your computer(local machine). Hence, you need to attach permission with your IAM user to have access to RedShift(and all other services your code is touching). Please correct me if my assumption is wrong.
Just in case if you missed
Install AWS CLI
Run
aws configure
Put your credentials and region
Hope this helps.

Categories