How to use python to write to s3 bucket - python

I have a postgres database in aws that I can query from just fine using python and psycopg2. My issue is writing to an s3 bucket. I do not know how to do that. Supposable, you have to use boto3 and aws-lambda but I am not familiar with that. I've been trying to find something online that outlines the code but one link doesn't seem to have asked the question correctly: how do I send query data from postgres in AWS to an s3 bucket using python?. And the other, I don't understand how this example works A way to export psql table (or query) directly to AWS S3 as file (csv, json).
Here is what my code looks like at the moment:
import psycopg2
import boto3
import os
import io
#setting values to read
os.environ['AWS_ACCESS_KEY_ID'] = "XXXXXXXXXXXXX"
os.environ['AWS_SECRET_ACCESS_KEY'] = "XXXXXXXXXXXX"
endpoint = "rds_endpoint"
username = 'user_name'
etl_password = 'stored_pass'
database_name = 'db_name'
resource = boto3.resource('s3')
file_name = 'daily_export'
bucket = "my s3 bucket"
copy_query = '''select parent.brand as business_type
, app.business_name as business_name
from hdsn_rsp parent
join apt_ds app
on parent.id = app.id'''
def handle(event, context):
try:
connection = psycopg2.connect(user= username
, password= etl_password
, host= endpoint
, port="5432"
, database= database_name)
cursor = connection.cursor()
cursor.execute(copy_query)
file = io.StringIO()
#cursor.copy_expert(copy_query, file)
#resource.Object(bucket, file_name+'.csv').put(Body=file.getvalue())
except(Exception, psycopg2.Error) as error:
print("Error connecting to postgres instance", error)
finally:
if connection:
cursor.close()
connection.close()
return("query has executed and file in bucket")
If leave the commented part in then my code executes just fine(running from local machine), but when I uncomment it, remove the handler and put it in a function, I get my success back but nothing is in my s3 bucket. I thought I had permissions issue so I created a new user and didn't give permissions to that bucket so it would fail and it did, so permissions aren't an issue. But, I don't get what is going on with #cursor.copy_expert(copy_query, file) #resource.Object(bucket, file_name+'.csv').put(Body=file.getvalue()) because it's not placing the file in my bucket.
I'm new here and new to writing code in aws so please be patient with me as I am not entirely sure how to ask the question properly. I know this is a big ask, but I am so confused as to what to do. Could someone please assist me on what corrections I need to make?

Related

How to make the copy command continue its run in redshift even after the lambda function which initiated it has timed out?

I am trying to run a copy command which loads around 100 GB of data from S3 to redshift. I am using the lambda function to initiate this copy command every day. This is my current code
from datetime import datetime, timedelta
import dateutil.tz
import psycopg2
from config import *
def lambda_handler(event, context):
con = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port=port)
cur = con.cursor()
try:
query = """BEGIN TRANSACTION;
COPY """ + table_name + """ FROM '""" + intermediate_path + """' iam_role '""" + iam_role + """' FORMAT AS parquet;
END TRANSACTION;"""
print(query)
cur.execute(query)
except Exception as e:
subject = "Error emr copy: {}".format(str(datetime.now().date()))
body = "Exception occured " + str(e)
print(body)
con.close()
This function is running fine but the only problem is, after the 15 min timeout of the lambda function, the copy command also stops executing in reshift. Therefore, I cannot finish my copy loading from s3 to redshift.
I also tried to include the statement_timeout statement below after the begin statement and before the copy command. It didn't help.
SET statement_timeout to 18000000;
Can someone suggest how do I solve this issue?
The AWS documentation isn't explicit about what happens when timeout occurs. But I think it's safe to say that it transitions into the "Shutdown" phase, at which point the runtime container is forcibly terminated by the environment.
What this means is that the socket connection used by the database connection will be closed, and the Redshift process that is listening to that socket will receive an end-of-file -- a client disconnect. The normal behavior of any database in this situation is to terminate any outstanding queries and rollback their transactions.
The reason that I gave that description is to let you know that you can't extend the life of a query beyond the life of the Lambda that initiates that query. If you want to stick with using a database connection library, you will need to use a service that doesn't timeout: AWS Batch or ECS are two options.
But, there's a better option: the Redshift Data API, which is supported by Boto3.
This API operates asynchronously: you submit a query to Redshift, and get a token that can be used to check the query's operation. You can also instruct Redshift to send a message to AWS Eventbridge when the query completes/fails (so you can create another Lambda to take appropriate action).
I recommend using Redshift Data API in lambda to load data into Redshift from S3.
You can get rid of psycopgs2 package and use built-in boto3 package in lambda.
This will run copy query asynchronously and lambda function won't take more than a few seconds to run it.
I use sentry_sdk to get notifications of runtime error from lambda.
import boto3
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaaaa#aaaa.ingest.sentry.io/aaaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def execute_redshift_query(sql):
data_client = boto3.client('redshift-data')
data_client.execute_statement(
ClusterIdentifier='redshift-cluster-test',
Database='db',
DbUser='db_user',
Sql=sql,
StatementName='Test query',
WithEvent=True,
)
def handler(event, context):
query = """
copy schema.test_table
from 's3://test-bucket/test.csv'
IAM_ROLE 'arn:aws:iam::1234567890:role/TestRole'
region 'us-east-1'
ignoreheader 1 csv delimiter ','
"""
execute_redshift_query(query)
return True
And another lambda function to send error notification if copy query fails.
You can add EventBridge lambda trigger using the rule in screenshot below.
Here is lambda code to send error notification.
import sentry_sdk
from sentry_sdk.integrations.aws_lambda import AwsLambdaIntegration
sentry_sdk.init(
"https://aaaa#aaa.ingest.sentry.io/aaaaa",
integrations=[AwsLambdaIntegration(timeout_warning=True)],
traces_sample_rate=0
)
def lambda_handler(event, context):
try:
if event["detail"]["state"] != "FINISHED":
raise ValueError(str(event))
except Exception as e:
sentry_sdk.capture_exception(e)
return True
You can identify which copy query failed by using StatementName defined in the first lambda function.
Hope it is helpful.

Import MongoDB output result into S3 bucket using Python

MongoDB Database Name :- testdb,
Collection Name :- test_collection
MongoDB command that I want to execute :-
db.getCollection('test_collection').find({ request_time: { $gte: new Date('2018-06-22'), $lt: new Date('2018-06-26') }});
In the documents of test_collection, there is a key called request_time. I want to fetch the documents in the time range ('2018-06-22') and ('2018-06-26')
MongoDB username :- user
MongoDB Password :- password
MongoDB is running on port 27017.
I need help in two things. I can connect into database but how to provide username and password in case of authentication. This is my Python code,
from pymongo import Connection
connection = Connection()
connection = Connection('localhost', 27017)
db = connection.testdb
collection = db.testcollection
for post in collection.find():
print post
Another thing is,
I have a S3 bucket called, mongodoc . I want to query that mongo command and import the result documents into S3 bucket.
I can connect to S3 bucket by using a library called Boto ,
from boto.s3.connection import S3Connection
conn = S3Connection(AWS_KEY, AWS_SECRET)
bucket = conn.get_bucket(mongodoc)
destination = bucket.new_key()
destination.name = filename
destination.set_contents_from_file(myfile)
destination.make_public()
What is the recommended way to achieve this ?
In case of providing authentication, you have to provide the username and password along with the hostname,
connection=Connection(hostname="",username="",password="")
And for s3 connection, try using boto3 rather than using boto. boto3 provides a wide variety of functionality available for s3 client as well as resources. Once queried your mongodb results can be uploaded to your s3 buckets in the form of files.

AWS Lambda Python/Boto3/psycopg2 Redshift temporary credentials

I'm pretty new to AWS so please let me know if what I'm trying to do is not a good idea, but the basic gist of it is that I have a Redshift cluster that I want to be able to query from Lambda (Python) using a combination of psycopg2 and boto3. I have assigned the Lambda function a role that allows it to get temporary credentials (get_cluster_credentials) from Redshift. I then use psycopg2 to pass those temporary credentials to create a connection. This works fine when I run interactively from my Python console locally, but I get the error:
OperationalError: FATAL: password authentication failed for user "IAMA:temp_user_cred:vbpread"
If I use the temporary credentials that Lambda produces directly in a connection statement from my python console they actually work (until expired). I think I'm missing something obvious. My code is:
import boto3
import psycopg2
print('Loading function')
def lambda_handler(event, context):
client = boto3.client('redshift')
dbname = 'medsynpuf'
dbuser = 'temp_user_cred'
response = client.describe_clusters(ClusterIdentifier=dbname)
pwresp = client.get_cluster_credentials(DbUser=dbuser,DbName=dbname,ClusterIdentifer=dbname,DurationSeconds=3600,AutoCreate=True, DbGroups=['vbpread'])
dbpw = pwresp['DbPassword']
dbusr = pwresp['DbUser']
endpoint = response['Clusters'][0]['Endpoint']['Address']
print(dbpw)
print(dbusr)
print(endpoint)
con = psycopg2.connect(dbname=dbname, host=endpoint, port='5439', user=dbusr, password=dbpw)
cur = con.cursor()
query1 = open("001_copd_yearly_count.sql","r")
cur.execute(query1.read())
query1_results = cur.fetchall()
result = query1_results
return result
I'm using Python 3.6.
Thanks!
Gerry
I was using a Windows compiled version of psycopg2 and needed Linux. Swapped it out for the one here: https://github.com/jkehler/awslambda-psycopg2

"Invalid credentials" error when accessing Redshift from Python

I am trying to write a Python script to access Amazon Redshift to create a table in Redshift and copy data from S3 to the Redshift table.
My code is:
import psycopg2
import os
#import pandas as pd
import requests
requests.packages.urllib3.disable_warnings()
redshift_endpoint = os.getenv("END-point")
redshift_user = os.getenv("user")
redshift_pass = os.getenv("PASSWORD")
port = 5439
dbname = 'DBNAME'
conn = psycopg2.connect(
host="",
user='',
port=5439,
password='',
dbname='')
cur = conn.cursor()
aws_key = os.getenv("access_key") # needed to access S3 Sample Data
aws_secret = os.getenv("secret_key")
#aws_iam_role= os.getenv('iam_role') #tried using this too
base_copy_string= """copy %s from 's3://mypath/%s'.csv
credentials 'aws_access_key_id= %s aws_access_secrect_key= %s'
delimiter '%s';""" # the base COPY string that we'll be using
#easily generate each table that we'll need to COPY data from
tables = ["employee"]
data_files = ["test"]
delimiters = [","]
#the generated COPY statements we'll be using to load data;
copy_statements = []
for tab, f, delim in zip(tables, data_files, delimiters):
copy_statements.append(base_copy_string % (tab, f, aws_key, aws_secret, delim)%)
#create Table
cur.execute(""" create table employee(empname varchar(30),empno integer,phoneno integer,email varchar(30))""")
for copy_statement in copy_statements: # execute each COPY statement
cur.execute(copy_statement)
conn.commit()
for table in tables + ["employee"]:
cur.execute("select count(*) from %s;" % (table,))
print(cur.fetchone())
conn.commit() # make sure data went through and commit our statements permanently.
When I run this command I getting an Error at cur.execute(copy_statement)
**Error:** error: Invalid credentials. Must be of the format: credentials 'aws_iam_role=...' or 'aws_access_key_id=...;aws_secre
t_access_key=...[;token=...]'
code: 8001
context:
query: 582
location: aws_credentials_parser.cpp:114
process: padbmaster [pid=18692]
Is there a problem in my code? Or is it is an AWS access_key problem?
I even tried using an iam_role but I get an error:
IAM role cannot assume role even in Redshift
I have a managed IAM role permission by attaching S3FullAccess policy.
There are some errors in your script.
1) Change base_copy_string as below:
base_copy_string= """copy %s from 's3://mypath/%s.csv' credentials
'aws_access_key_id=%s;aws_secret_access_key=%s' delimiter '%s';""" #
the base COPY string that we'll be using
There must be a ; added in credentials and also other formatting issues with single-quotes. It is aws_secret_access_key and not aws_access_secrect_key.
check this link for detailed info: http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html#copy-usage_notes-iam-permissions
I suggest you use iam-roles instead of credentials.
http://docs.aws.amazon.com/redshift/latest/dg/loading-data-access-permissions.html
2) change copy_statements.append as below(remove extra % in the end):
copy_statements.append(base_copy_string % (tab, f, aws_key,
aws_secret, delim))
Correct these and try again.
To start with, NEVER, NEVER, NEVER hardcode access keys and secret keys in your code. So that rules out your first query. Now coming to right way of implementing things. You are right, IAM Role is the right way of doing it. Unfortunately, I can't get the exact error and use case from your description. As far as I understand, you are trying to run this python file from your computer(local machine). Hence, you need to attach permission with your IAM user to have access to RedShift(and all other services your code is touching). Please correct me if my assumption is wrong.
Just in case if you missed
Install AWS CLI
Run
aws configure
Put your credentials and region
Hope this helps.

AWS Lambda function to connect to RDS error

I am unable to connect to RDS using an Lambda Function via the test example they provide
This is the code:
import sys
import logging
import rds_config
import pymysql
#rds settings
rds_host = "connection_link"
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name
logger = logging.getLogger()
logger.setLevel(logging.INFO)
try:
conn = pymysql.connect(rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
except:
logger.error("ERROR: Unexpected error: Could not connect to MySql instance.")
sys.exit()
logger.info("SUCCESS: Connection to RDS mysql instance succeeded")
def handler(event, context):
"""
This function fetches content from mysql RDS instance
"""
item_count = 0
with conn.cursor() as cur:
cur.execute("create table Employee3 ( EmpID int NOT NULL, Name varchar(255) NOT NULL, PRIMARY KEY (EmpID))")
cur.execute('insert into Employee3 (EmpID, Name) values(1, "Joe")')
cur.execute('insert into Employee3 (EmpID, Name) values(2, "Bob")')
cur.execute('insert into Employee3 (EmpID, Name) values(3, "Mary")')
conn.commit()
cur.execute("select * from Employee3")
for row in cur:
item_count += 1
logger.info(row)
#print(row)
return "Added %d items from RDS MySQL table" %(item_count)
This is the structure of my deployment package
app/pymysql/...
app/app.py
app/rds_config.py
app/PyMySQL-0.7.11.dist-info/...
I have packaged all the files inside the app folder in a zip file.
This is the error is get
"errorMessage": "RequestId: 96fb4cd2-79c1-11e7-a2dc-f97407196dbb Process exited before completing request"
I have already checkedmy RDS connection on MYSQL Workbench its working fine
Update:
Let's assume that your actual Python code is actually indented correctly unlike the code you posted above.
For some reason, your function cannot connect to your database. And instead of returning an error to the user, you basically told it to sys.exit(1) so that's the reason why Lambda says "Process exited before completing the request".
-- Original Answer --
That does not look like an AWS lambda handler.
You're supposed to write a function handler that accepts the lambda event and context as arguments.
Please read more about it from the AWS Lambda documentation.
As #MarkB mentioned in the comments, For connectivity you need to set VPC, Subnets and Security Group in your Lambda Function same as your RDS instance:
and also you need to check Protocol, Port and Source for Inbound and Outbound in security group to make sure it is open for your IP and port range.
I just ran into the same problem, and it turns out it is because I did not specify a name for the database on creation. Easy Create doesn't give you this option, so you will have to go with Standard Create, where one can specify the name under Additional Configuration. This name is what you should specify for db_name in rds_config.py.
Alternatively, you can connect with your tool of choice without a database name, perform a CREATE DATABASE xxx; where xxx is the name of the database, and then use can use that database going forward.
Source: https://serverfault.com/a/996423
This one is also relevant: Why don't I have access to the database from aws lambda but have from a local computer with the same login data?
The problem with your zip file.
Im also using the same lambda function.
Please follow the below steps.
1. your app.py and rds_config.py files are good.
2. Then download the pymysql https://pypi.python.org/pypi/PyMySQL
3. Extract it.
4. Copy the pymysql to somewhere. (Note: Dont add all contents inside the PyMySQL-0.7.11 folder, we just need pymysql only.)
5. Then create a zip file with app.py, rds_config.py and pymysql folder.

Categories