Import MongoDB output result into S3 bucket using Python - python

MongoDB Database Name :- testdb,
Collection Name :- test_collection
MongoDB command that I want to execute :-
db.getCollection('test_collection').find({ request_time: { $gte: new Date('2018-06-22'), $lt: new Date('2018-06-26') }});
In the documents of test_collection, there is a key called request_time. I want to fetch the documents in the time range ('2018-06-22') and ('2018-06-26')
MongoDB username :- user
MongoDB Password :- password
MongoDB is running on port 27017.
I need help in two things. I can connect into database but how to provide username and password in case of authentication. This is my Python code,
from pymongo import Connection
connection = Connection()
connection = Connection('localhost', 27017)
db = connection.testdb
collection = db.testcollection
for post in collection.find():
print post
Another thing is,
I have a S3 bucket called, mongodoc . I want to query that mongo command and import the result documents into S3 bucket.
I can connect to S3 bucket by using a library called Boto ,
from boto.s3.connection import S3Connection
conn = S3Connection(AWS_KEY, AWS_SECRET)
bucket = conn.get_bucket(mongodoc)
destination = bucket.new_key()
destination.name = filename
destination.set_contents_from_file(myfile)
destination.make_public()
What is the recommended way to achieve this ?

In case of providing authentication, you have to provide the username and password along with the hostname,
connection=Connection(hostname="",username="",password="")
And for s3 connection, try using boto3 rather than using boto. boto3 provides a wide variety of functionality available for s3 client as well as resources. Once queried your mongodb results can be uploaded to your s3 buckets in the form of files.

Related

How to use python to write to s3 bucket

I have a postgres database in aws that I can query from just fine using python and psycopg2. My issue is writing to an s3 bucket. I do not know how to do that. Supposable, you have to use boto3 and aws-lambda but I am not familiar with that. I've been trying to find something online that outlines the code but one link doesn't seem to have asked the question correctly: how do I send query data from postgres in AWS to an s3 bucket using python?. And the other, I don't understand how this example works A way to export psql table (or query) directly to AWS S3 as file (csv, json).
Here is what my code looks like at the moment:
import psycopg2
import boto3
import os
import io
#setting values to read
os.environ['AWS_ACCESS_KEY_ID'] = "XXXXXXXXXXXXX"
os.environ['AWS_SECRET_ACCESS_KEY'] = "XXXXXXXXXXXX"
endpoint = "rds_endpoint"
username = 'user_name'
etl_password = 'stored_pass'
database_name = 'db_name'
resource = boto3.resource('s3')
file_name = 'daily_export'
bucket = "my s3 bucket"
copy_query = '''select parent.brand as business_type
, app.business_name as business_name
from hdsn_rsp parent
join apt_ds app
on parent.id = app.id'''
def handle(event, context):
try:
connection = psycopg2.connect(user= username
, password= etl_password
, host= endpoint
, port="5432"
, database= database_name)
cursor = connection.cursor()
cursor.execute(copy_query)
file = io.StringIO()
#cursor.copy_expert(copy_query, file)
#resource.Object(bucket, file_name+'.csv').put(Body=file.getvalue())
except(Exception, psycopg2.Error) as error:
print("Error connecting to postgres instance", error)
finally:
if connection:
cursor.close()
connection.close()
return("query has executed and file in bucket")
If leave the commented part in then my code executes just fine(running from local machine), but when I uncomment it, remove the handler and put it in a function, I get my success back but nothing is in my s3 bucket. I thought I had permissions issue so I created a new user and didn't give permissions to that bucket so it would fail and it did, so permissions aren't an issue. But, I don't get what is going on with #cursor.copy_expert(copy_query, file) #resource.Object(bucket, file_name+'.csv').put(Body=file.getvalue()) because it's not placing the file in my bucket.
I'm new here and new to writing code in aws so please be patient with me as I am not entirely sure how to ask the question properly. I know this is a big ask, but I am so confused as to what to do. Could someone please assist me on what corrections I need to make?

AWS Glue read from RDS database that's in VPC

I have an RDS database that is sitting in a VPC. My ultimate goal is to run a nightly job that takes the data from RDS and stores it in Redshift. I am currently doing this using Glue and Glue connections. I am able to write to RDS/Redshift using connections with the following line:
datasource2 = DynamicFrame.fromDF(dfFinal, glueContext, "scans")
output = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource2, catalog_connection = "MPtest", connection_options = {"database" : "app", "dbtable" : "scans"})
Where dfFinal is my final data frame after a bunch of transformations that are not essential to this post. That code works fine, however I would like to modify it so I could read a table from RDS into a data frame.
Since the RDS database is in a VPC, I would like to use the catalog_connection parameter, but the DynamicFrameReader class has no from_jdbc_conf method and thus no obvious way to use my glue connection.
I have seen posts that say you could use a method like this:
url = "jdbc:postgresql://host/dbName"
properties = {
"user" : "user",
"password" : "password"
}
df = spark.read.jdbc(url=url, table="table", properties=properties)
But when I try that it times out because it's not a publicly accessible database. Any suggestions?
You are on the right track with using Glue connections.
Define Glue connection of Type JDBC for your Postgres instance
Type JDBC
JDBC URL jdbc:postgresql://<RDS ip>:<RDS port>/<database_name>
VPC Id <VPC of RDS instance>
Subnet <subnet of RDS instance>
Security groups <Security Group allowed to connect to RDS>
Edit Glue Job, and select the Glue Connection so it appears under "Required Connections"
Create connections options dictionary as
options = {'url': connection.jdbc_url,
'user': connection.username,
'password': connection.password,
'dbtable': table
}
Use the options dictionary in the below to create a Dynamic frame to read from the table
table_ddf = glueContext.create_dynamic_frame.from_options(
connection_type='postgresql',
connection_options=options,
transformation_ctx=transformation_ctx
)

Azure SQL Database Manipulation

Is it possible to create a table in an Azure sql database using Python? I am pulling a list of things from an API and then want to push them to a table in my Azure sql db but can not find a tutorial or guide on how to do so. Googling for it led me to tutorials on how to pull data from my db. Thanks
If you are using Azure SQL database, you could follow the Azure official tutorial which # Mohamed Elrashid provided for you: Azure SQL Database libraries for Python:
Example:
Create a SQL Database resource and restrict access to a range of IP addresses using a firewall rule.
from azure.common.client_factory import get_client_from_cli_profile
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.sql import SqlManagementClient
RESOURCE_GROUP = 'YOUR_RESOURCE_GROUP_NAME'
LOCATION = 'eastus' # example Azure availability zone, should match resource group
SQL_SERVER = 'yourvirtualsqlserver'
SQL_DB = 'YOUR_SQLDB_NAME'
USERNAME = 'YOUR_USERNAME'
PASSWORD = 'YOUR_PASSWORD'
# create resource client
resource_client = get_client_from_cli_profile(ResourceManagementClient)
# create resource group
resource_client.resource_groups.create_or_update(RESOURCE_GROUP, {'location': LOCATION})
sql_client = get_client_from_cli_profile(SqlManagementClient)
# Create a SQL server
server = sql_client.servers.create_or_update(
RESOURCE_GROUP,
SQL_SERVER,
{
'location': LOCATION,
'version': '12.0', # Required for create
'administrator_login': USERNAME, # Required for create
'administrator_login_password': PASSWORD # Required for create
}
)
# Create a SQL database in the Basic tier
database = sql_client.databases.create_or_update(
RESOURCE_GROUP,
SQL_SERVER,
SQL_DB,
{
'location': LOCATION,
'collation': 'SQL_Latin1_General_CP1_CI_AS',
'create_mode': 'default',
'requested_service_objective_name': 'Basic'
}
)
# Open access to this server for IPs
firewall_rule = sql_client.firewall_rules.create_or_update(
RESOURCE_GROUP,
SQL_DB,
"firewall_rule_name_123.123.123.123",
"123.123.123.123", # Start ip range
"167.220.0.235" # End ip range
)
If you are using Azure Database for MySQL, please ref this Azure tutorial Python + Azure Database for MySQL.
Azure Database for MySQL and Python can be used together for data analysis – MySQL as database engine and Python as statistical tool. When dealing with large datasets that potentially exceed the memory of your machine it is recommended to push the data into database engine, where you can query the data in smaller digestible chunks.
In this article we will learn how to use Python to perform the following tasks:
Create Azure Database for MySQL using azure python sdk
Hope this helps.

Use Managed Identity to authenticate Azure App Service to SQL Database

I am trying to connect a Python Flask app running in Azure App Service Web App to an Azure SQL Database.
The works just fine when I use SQL authentication with username and password.
Now I want to move to using the Web Apps managed identity.
I have activated the system-assigned managed identity, created a user for it in SQL and added it to the db_datareader role.
I am connecting with SqlAlchemy using a connection string like this
params = urllib.parse.quote_plus(os.environ['SQL_CONNECTION_STRING'])
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine_azure = db.create_engine(conn_str,echo=True)
The connection string is stored as an application setting, and its value is
"Driver={ODBC Driver 17 for SQL Server};Server=tcp:<server>.database.windows.net,1433;Database=<database>;Authentication=ActiveDirectoryMsi;"
I expected this to be all I need to do, but now my app is not starting.
The logs report a timeout when connecting to the database.
How can I fix this?
I know this is quite an old post, but it may help people like me who are looking for a solution.
You could modify the connection string by adding "Authentication" parameters as "ActiveDirectoryMsi", no need to use endpoint and headers.
(Works with Azure SQL, for other databases like Postgress you may need to use the struct token)
import pyodbc
pyodbc.connect(
"Driver="
+ driver
+ ";Server="
+ server
+ ";PORT=1433;Database="
+ database
+ ";Authentication=ActiveDirectoryMsi")
I wrote a quick article for those who are interested in Azure MSI:
https://hedihargam.medium.com/python-sql-database-access-with-managed-identity-from-azure-web-app-functions-14566e5a0f1a
If you want to connect Azure SQL database with Azure MSI in python application, we can use the SDK pyodbc to implement it.
For example
Enable system-assigned identity for your Azure app service
Add the MSi as contained database users in your database
a. Connect your SQL database with Azure SQL AD admin (I use SSMS to do it)
b. run the following the script in your database
CREATE USER <your app service name> FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER <your app service name>
ALTER ROLE db_datawriter ADD MEMBER <your app service name>
ALTER ROLE db_ddladmin ADD MEMBER <your app service name>
Code
import os
import pyodbc
import requests
import struct
#get access token
identity_endpoint = os.environ["IDENTITY_ENDPOINT"]
identity_header = os.environ["IDENTITY_HEADER"]
resource_uri="https://database.windows.net/"
token_auth_uri = f"{identity_endpoint}?resource={resource_uri}&api-version=2019-08-01"
head_msi = {'X-IDENTITY-HEADER':identity_header}
resp = requests.get(token_auth_uri, headers=head_msi)
access_token = resp.json()['access_token']
accessToken = bytes(access_token, 'utf-8');
exptoken = b"";
for i in accessToken:
exptoken += bytes({i});
exptoken += bytes(1);
tokenstruct = struct.pack("=i", len(exptoken)) + exptoken;
conn = pyodbc.connect("Driver={ODBC Driver 17 for SQL Server};Server=tcp:andyserver.database.windows.net,1433;Database=database2", attrs_before = { 1256:bytearray(tokenstruct) });
cursor = conn.cursor()
cursor.execute("select ##version")
row = cursor.fetchall()
For more details, please refer to the
https://github.com/AzureAD/azure-activedirectory-library-for-python/wiki/Connect-to-Azure-SQL-Database
https://learn.microsoft.com/en-us/azure/app-service/overview-managed-identity
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-aad-authentication-configure

AWS Lambda Python/Boto3/psycopg2 Redshift temporary credentials

I'm pretty new to AWS so please let me know if what I'm trying to do is not a good idea, but the basic gist of it is that I have a Redshift cluster that I want to be able to query from Lambda (Python) using a combination of psycopg2 and boto3. I have assigned the Lambda function a role that allows it to get temporary credentials (get_cluster_credentials) from Redshift. I then use psycopg2 to pass those temporary credentials to create a connection. This works fine when I run interactively from my Python console locally, but I get the error:
OperationalError: FATAL: password authentication failed for user "IAMA:temp_user_cred:vbpread"
If I use the temporary credentials that Lambda produces directly in a connection statement from my python console they actually work (until expired). I think I'm missing something obvious. My code is:
import boto3
import psycopg2
print('Loading function')
def lambda_handler(event, context):
client = boto3.client('redshift')
dbname = 'medsynpuf'
dbuser = 'temp_user_cred'
response = client.describe_clusters(ClusterIdentifier=dbname)
pwresp = client.get_cluster_credentials(DbUser=dbuser,DbName=dbname,ClusterIdentifer=dbname,DurationSeconds=3600,AutoCreate=True, DbGroups=['vbpread'])
dbpw = pwresp['DbPassword']
dbusr = pwresp['DbUser']
endpoint = response['Clusters'][0]['Endpoint']['Address']
print(dbpw)
print(dbusr)
print(endpoint)
con = psycopg2.connect(dbname=dbname, host=endpoint, port='5439', user=dbusr, password=dbpw)
cur = con.cursor()
query1 = open("001_copd_yearly_count.sql","r")
cur.execute(query1.read())
query1_results = cur.fetchall()
result = query1_results
return result
I'm using Python 3.6.
Thanks!
Gerry
I was using a Windows compiled version of psycopg2 and needed Linux. Swapped it out for the one here: https://github.com/jkehler/awslambda-psycopg2

Categories