I am trying to write a Python script to access Amazon Redshift to create a table in Redshift and copy data from S3 to the Redshift table.
My code is:
import psycopg2
import os
#import pandas as pd
import requests
requests.packages.urllib3.disable_warnings()
redshift_endpoint = os.getenv("END-point")
redshift_user = os.getenv("user")
redshift_pass = os.getenv("PASSWORD")
port = 5439
dbname = 'DBNAME'
conn = psycopg2.connect(
host="",
user='',
port=5439,
password='',
dbname='')
cur = conn.cursor()
aws_key = os.getenv("access_key") # needed to access S3 Sample Data
aws_secret = os.getenv("secret_key")
#aws_iam_role= os.getenv('iam_role') #tried using this too
base_copy_string= """copy %s from 's3://mypath/%s'.csv
credentials 'aws_access_key_id= %s aws_access_secrect_key= %s'
delimiter '%s';""" # the base COPY string that we'll be using
#easily generate each table that we'll need to COPY data from
tables = ["employee"]
data_files = ["test"]
delimiters = [","]
#the generated COPY statements we'll be using to load data;
copy_statements = []
for tab, f, delim in zip(tables, data_files, delimiters):
copy_statements.append(base_copy_string % (tab, f, aws_key, aws_secret, delim)%)
#create Table
cur.execute(""" create table employee(empname varchar(30),empno integer,phoneno integer,email varchar(30))""")
for copy_statement in copy_statements: # execute each COPY statement
cur.execute(copy_statement)
conn.commit()
for table in tables + ["employee"]:
cur.execute("select count(*) from %s;" % (table,))
print(cur.fetchone())
conn.commit() # make sure data went through and commit our statements permanently.
When I run this command I getting an Error at cur.execute(copy_statement)
**Error:** error: Invalid credentials. Must be of the format: credentials 'aws_iam_role=...' or 'aws_access_key_id=...;aws_secre
t_access_key=...[;token=...]'
code: 8001
context:
query: 582
location: aws_credentials_parser.cpp:114
process: padbmaster [pid=18692]
Is there a problem in my code? Or is it is an AWS access_key problem?
I even tried using an iam_role but I get an error:
IAM role cannot assume role even in Redshift
I have a managed IAM role permission by attaching S3FullAccess policy.
There are some errors in your script.
1) Change base_copy_string as below:
base_copy_string= """copy %s from 's3://mypath/%s.csv' credentials
'aws_access_key_id=%s;aws_secret_access_key=%s' delimiter '%s';""" #
the base COPY string that we'll be using
There must be a ; added in credentials and also other formatting issues with single-quotes. It is aws_secret_access_key and not aws_access_secrect_key.
check this link for detailed info: http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html#copy-usage_notes-iam-permissions
I suggest you use iam-roles instead of credentials.
http://docs.aws.amazon.com/redshift/latest/dg/loading-data-access-permissions.html
2) change copy_statements.append as below(remove extra % in the end):
copy_statements.append(base_copy_string % (tab, f, aws_key,
aws_secret, delim))
Correct these and try again.
To start with, NEVER, NEVER, NEVER hardcode access keys and secret keys in your code. So that rules out your first query. Now coming to right way of implementing things. You are right, IAM Role is the right way of doing it. Unfortunately, I can't get the exact error and use case from your description. As far as I understand, you are trying to run this python file from your computer(local machine). Hence, you need to attach permission with your IAM user to have access to RedShift(and all other services your code is touching). Please correct me if my assumption is wrong.
Just in case if you missed
Install AWS CLI
Run
aws configure
Put your credentials and region
Hope this helps.
Related
I continue to get and error "ProgrammingError: 002003 (42502): SQL compilation error: Object 'Table' does not exist or not authorized. I am using the following code:
con = snowflake.connector.connect(
user = "user.name",
authenticator="externalbrowser",
warehouse = "ware house name",
database = "db name",
schema = "schema name"
)
cur.con.cursor()
sql = "select * from Table"
cur.execute(sql)
df = cur.fetch_pandas_all()
When I execute the code in Jupyter Notebook the browser window opens and authenticates my creds but when it gets to the sql execute line the error rises and tells me that the table does not exist. When I open up Snowflake in my browser I can see that the table does exist in the correct warehouse, database and schema I have in my code.
Has anyone else ever experienced this? Do I need to authorize my user to be able to access this table via Python and Jupyter Notebook?
It's likely your session doesn't have a role assigned to it (current role).
You can add the role in your list of connection session paramters,
e.g. add something like the following
role = 'RICH_ROLE',
You might want to consider setting a default role for your user.
ALTER USER userNameHere SET DEFAULT_ROLE = 'THE_BEST_ROLE';
docs link: https://docs.snowflake.com/en/sql-reference/sql/alter-user.html
Also, when all else fails, use the fully qualified table name, note this won't help much if the role isn't set:
sql = "select * from databaseName.schemaName.TableName"
I have a postgres database in aws that I can query from just fine using python and psycopg2. My issue is writing to an s3 bucket. I do not know how to do that. Supposable, you have to use boto3 and aws-lambda but I am not familiar with that. I've been trying to find something online that outlines the code but one link doesn't seem to have asked the question correctly: how do I send query data from postgres in AWS to an s3 bucket using python?. And the other, I don't understand how this example works A way to export psql table (or query) directly to AWS S3 as file (csv, json).
Here is what my code looks like at the moment:
import psycopg2
import boto3
import os
import io
#setting values to read
os.environ['AWS_ACCESS_KEY_ID'] = "XXXXXXXXXXXXX"
os.environ['AWS_SECRET_ACCESS_KEY'] = "XXXXXXXXXXXX"
endpoint = "rds_endpoint"
username = 'user_name'
etl_password = 'stored_pass'
database_name = 'db_name'
resource = boto3.resource('s3')
file_name = 'daily_export'
bucket = "my s3 bucket"
copy_query = '''select parent.brand as business_type
, app.business_name as business_name
from hdsn_rsp parent
join apt_ds app
on parent.id = app.id'''
def handle(event, context):
try:
connection = psycopg2.connect(user= username
, password= etl_password
, host= endpoint
, port="5432"
, database= database_name)
cursor = connection.cursor()
cursor.execute(copy_query)
file = io.StringIO()
#cursor.copy_expert(copy_query, file)
#resource.Object(bucket, file_name+'.csv').put(Body=file.getvalue())
except(Exception, psycopg2.Error) as error:
print("Error connecting to postgres instance", error)
finally:
if connection:
cursor.close()
connection.close()
return("query has executed and file in bucket")
If leave the commented part in then my code executes just fine(running from local machine), but when I uncomment it, remove the handler and put it in a function, I get my success back but nothing is in my s3 bucket. I thought I had permissions issue so I created a new user and didn't give permissions to that bucket so it would fail and it did, so permissions aren't an issue. But, I don't get what is going on with #cursor.copy_expert(copy_query, file) #resource.Object(bucket, file_name+'.csv').put(Body=file.getvalue()) because it's not placing the file in my bucket.
I'm new here and new to writing code in aws so please be patient with me as I am not entirely sure how to ask the question properly. I know this is a big ask, but I am so confused as to what to do. Could someone please assist me on what corrections I need to make?
I am trying to run a copy command which loads around 100 GB of data from S3 to redshift. I am using the lambda function to initiate this copy command every day. This is my current code
from datetime import datetime, timedelta
import dateutil.tz
import psycopg2
from config import *
def lambda_handler(event, context):
con = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port=port)
cur = con.cursor()
try:
query = """BEGIN TRANSACTION;
COPY """ + table_name + """ FROM '""" + intermediate_path + """' iam_role '""" + iam_role + """' FORMAT AS parquet;
END TRANSACTION;"""
print(query)
cur.execute(query)
except Exception as e:
subject = "Error emr copy: {}".format(str(datetime.now().date()))
body = "Exception occured " + str(e)
print(body)
con.close()
This function is running fine but the only problem is, after the 15 min timeout of the lambda function, the copy command also stops executing in reshift. Therefore, I cannot finish my copy loading from s3 to redshift.
I also tried to include the statement_timeout statement below after the begin statement and before the copy command. It didn't help.
SET statement_timeout to 18000000;
Can someone suggest how do I solve this issue?
The AWS documentation isn't explicit about what happens when timeout occurs. But I think it's safe to say that it transitions into the "Shutdown" phase, at which point the runtime container is forcibly terminated by the environment.
What this means is that the socket connection used by the database connection will be closed, and the Redshift process that is listening to that socket will receive an end-of-file -- a client disconnect. The normal behavior of any database in this situation is to terminate any outstanding queries and rollback their transactions.
The reason that I gave that description is to let you know that you can't extend the life of a query beyond the life of the Lambda that initiates that query. If you want to stick with using a database connection library, you will need to use a service that doesn't timeout: AWS Batch or ECS are two options.
But, there's a better option: the Redshift Data API, which is supported by Boto3.
This API operates asynchronously: you submit a query to Redshift, and get a token that can be used to check the query's operation. You can also instruct Redshift to send a message to AWS Eventbridge when the query completes/fails (so you can create another Lambda to take appropriate action).
I recommend using Redshift Data API in lambda to load data into Redshift from S3.
You can get rid of psycopgs2 package and use built-in boto3 package in lambda.
This will run copy query asynchronously and lambda function won't take more than a few seconds to run it.
I use sentry_sdk to get notifications of runtime error from lambda.
import boto3
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaaaa#aaaa.ingest.sentry.io/aaaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def execute_redshift_query(sql):
data_client = boto3.client('redshift-data')
data_client.execute_statement(
ClusterIdentifier='redshift-cluster-test',
Database='db',
DbUser='db_user',
Sql=sql,
StatementName='Test query',
WithEvent=True,
)
def handler(event, context):
query = """
copy schema.test_table
from 's3://test-bucket/test.csv'
IAM_ROLE 'arn:aws:iam::1234567890:role/TestRole'
region 'us-east-1'
ignoreheader 1 csv delimiter ','
"""
execute_redshift_query(query)
return True
And another lambda function to send error notification if copy query fails.
You can add EventBridge lambda trigger using the rule in screenshot below.
Here is lambda code to send error notification.
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaa#aaa.ingest.sentry.io/aaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def lambda_handler(event, context):
try:
if event["detail"]["state"] != "FINISHED":
raise ValueError(str(event))
except Exception as e:
sentry_sdk.capture_exception(e)
return True
You can identify which copy query failed by using StatementName defined in the first lambda function.
Hope it is helpful.
I'm pretty new to AWS so please let me know if what I'm trying to do is not a good idea, but the basic gist of it is that I have a Redshift cluster that I want to be able to query from Lambda (Python) using a combination of psycopg2 and boto3. I have assigned the Lambda function a role that allows it to get temporary credentials (get_cluster_credentials) from Redshift. I then use psycopg2 to pass those temporary credentials to create a connection. This works fine when I run interactively from my Python console locally, but I get the error:
OperationalError: FATAL: password authentication failed for user "IAMA:temp_user_cred:vbpread"
If I use the temporary credentials that Lambda produces directly in a connection statement from my python console they actually work (until expired). I think I'm missing something obvious. My code is:
import boto3
import psycopg2
print('Loading function')
def lambda_handler(event, context):
client = boto3.client('redshift')
dbname = 'medsynpuf'
dbuser = 'temp_user_cred'
response = client.describe_clusters(ClusterIdentifier=dbname)
pwresp = client.get_cluster_credentials(DbUser=dbuser,DbName=dbname,ClusterIdentifer=dbname,DurationSeconds=3600,AutoCreate=True, DbGroups=['vbpread'])
dbpw = pwresp['DbPassword']
dbusr = pwresp['DbUser']
endpoint = response['Clusters'][0]['Endpoint']['Address']
print(dbpw)
print(dbusr)
print(endpoint)
con = psycopg2.connect(dbname=dbname, host=endpoint, port='5439', user=dbusr, password=dbpw)
cur = con.cursor()
query1 = open("001_copd_yearly_count.sql","r")
cur.execute(query1.read())
query1_results = cur.fetchall()
result = query1_results
return result
I'm using Python 3.6.
Thanks!
Gerry
I was using a Windows compiled version of psycopg2 and needed Linux. Swapped it out for the one here: https://github.com/jkehler/awslambda-psycopg2
I am unable to connect to RDS using an Lambda Function via the test example they provide
This is the code:
import sys
import logging
import rds_config
import pymysql
#rds settings
rds_host = "connection_link"
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name
logger = logging.getLogger()
logger.setLevel(logging.INFO)
try:
conn = pymysql.connect(rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
except:
logger.error("ERROR: Unexpected error: Could not connect to MySql instance.")
sys.exit()
logger.info("SUCCESS: Connection to RDS mysql instance succeeded")
def handler(event, context):
"""
This function fetches content from mysql RDS instance
"""
item_count = 0
with conn.cursor() as cur:
cur.execute("create table Employee3 ( EmpID int NOT NULL, Name varchar(255) NOT NULL, PRIMARY KEY (EmpID))")
cur.execute('insert into Employee3 (EmpID, Name) values(1, "Joe")')
cur.execute('insert into Employee3 (EmpID, Name) values(2, "Bob")')
cur.execute('insert into Employee3 (EmpID, Name) values(3, "Mary")')
conn.commit()
cur.execute("select * from Employee3")
for row in cur:
item_count += 1
logger.info(row)
#print(row)
return "Added %d items from RDS MySQL table" %(item_count)
This is the structure of my deployment package
app/pymysql/...
app/app.py
app/rds_config.py
app/PyMySQL-0.7.11.dist-info/...
I have packaged all the files inside the app folder in a zip file.
This is the error is get
"errorMessage": "RequestId: 96fb4cd2-79c1-11e7-a2dc-f97407196dbb Process exited before completing request"
I have already checkedmy RDS connection on MYSQL Workbench its working fine
Update:
Let's assume that your actual Python code is actually indented correctly unlike the code you posted above.
For some reason, your function cannot connect to your database. And instead of returning an error to the user, you basically told it to sys.exit(1) so that's the reason why Lambda says "Process exited before completing the request".
-- Original Answer --
That does not look like an AWS lambda handler.
You're supposed to write a function handler that accepts the lambda event and context as arguments.
Please read more about it from the AWS Lambda documentation.
As #MarkB mentioned in the comments, For connectivity you need to set VPC, Subnets and Security Group in your Lambda Function same as your RDS instance:
and also you need to check Protocol, Port and Source for Inbound and Outbound in security group to make sure it is open for your IP and port range.
I just ran into the same problem, and it turns out it is because I did not specify a name for the database on creation. Easy Create doesn't give you this option, so you will have to go with Standard Create, where one can specify the name under Additional Configuration. This name is what you should specify for db_name in rds_config.py.
Alternatively, you can connect with your tool of choice without a database name, perform a CREATE DATABASE xxx; where xxx is the name of the database, and then use can use that database going forward.
Source: https://serverfault.com/a/996423
This one is also relevant: Why don't I have access to the database from aws lambda but have from a local computer with the same login data?
The problem with your zip file.
Im also using the same lambda function.
Please follow the below steps.
1. your app.py and rds_config.py files are good.
2. Then download the pymysql https://pypi.python.org/pypi/PyMySQL
3. Extract it.
4. Copy the pymysql to somewhere. (Note: Dont add all contents inside the PyMySQL-0.7.11 folder, we just need pymysql only.)
5. Then create a zip file with app.py, rds_config.py and pymysql folder.