I am writing a Python script that runs a query through Athena, outputs it to S3 and downloads it into my computer. I am able to run my query through Athena and output the result into S3. So my next step that I can’t seem to figure out is how to download it to my computer without knowing the key name?
Is there a way to lookup the object key within my python script after outputting it to Athena?
What I have completed:
# Output location and DB
s3_output = ‘s3_output_here’
database = ‘database_here’
# Function to run Athena query
def run_query(query, database, s3_output):
while True:
try:
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
return response
break
except client.exceptions.TooManyRequestsException as e:
print('Too many requests, trying again after sleep')
time.sleep(100)
# Our SQL Query
query = """
SELECT *
FROM test
”””
print("Running query to Athena...")
res = run_query(query, database, s3_output)
I understand how to download a file with this code:
try:
s3.Bucket(BUCKET_NAME).download_file(KEY, ‘KEY_HERE’)
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise
So how can I read the key name after running my first completed code?
You can get the key using the get_key command provided by the boto library. This is how I download things from s3:
with open("path/aws-credentials.json") as f:
data= json.load(f)
conn = boto.connect_s3(data["accessKeyId"], data["secretAccessKey"])
bucket = conn.get_bucket('your_bucket')
file_path = bucket.get_key('path/to/s3/file')
file_path.get_contents_to_filename('path/on/local/computer/filename')
You can hardcode your credentials into the code if you are just testing something out, but if you are planning on putting this into production, it's best to store your credentials externally in something like a json file.
Related
I am working with athena from within my python code, using boto3, as follows:
def query_athena(query, output_path):
client = boto3.client('athena')
client.start_query_execution(
ResultConfiguration={'OutputLocation': output_path},
QueryString=query
)
As stated in the docs, start_query_execution may raise InternalServerException, InvalidRequestException or TooManyRequestsException. I'd like to treat this as follows:
def query_athena(query, output_path):
client = boto3.client('athena')
try:
client.start_query_execution(
ResultConfiguration={'OutputLocation': output_path},
QueryString=query
)
except <AthenaException> as e:
deal with e
where <AthenaException> being one of the three exceptions I mentioned or, better yet, their superclass.
My question is how do I import these exceptions? The docs show them as Athena.Client.exceptions.InternalServerException, but I can't seem to find this Athena.Client in any boto3 module.
I ran into the same confusion, but figured it out. The exceptions listed in the docs aren't internal to boto3, but rather contained in the response when boto3 throws a client error.
My first shot at a solution looks like this. It assumes you've handled s3 output location, a boto3 session, etc already:
import boto3
from botocore.exceptions import ClientError
try:
client = session.client('athena')
response = client.start_query_execution(
QueryString=q,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
filename = response['QueryExecutionId']
print('Execution ID: ' + response['QueryExecutionId'])
except ClientError as e:
response = e.response
code = response['Error']['Code']
message = response['Error']['Message']
if code == 'InvalidRequestException':
print(f'Error in query, {code}:\n{message}')
raise e
elif code == 'InternalServerException':
print(f'AWS {code}:\n{message}')
raise e
elif code == 'TooManyRequestsException':
# Handle a wait, retry, etc here
pass
SITUATION: I verified a domain name in AWS and set up an S3 bucket that recieves emails.
These emails contain a .csv file and are delivered to this bucket on a daily basis. I can verify the presence of the attachment by manually exploring the raw email. No problems.
DESIRED OUTCOME: I want to parse these emails, extract the attached .csv and send the .csv file to a destination S3 bucket (or destination folder within the same S3 bucket) so that I can later process it using a seperate Python script
ISSUE: I have written the Lambda function in Python and the logs show this executes successfully when testing yet no files appear in the destination folder.
There is an ObjectCreated trigger enabled on the source bucket which I believe should activate the function on arrival of a new email but this does not have any effect on the execution of the function.
See Lambda function code below:
import json
import urllib
import boto3
import os
import email
import base64
FILE_MIMETYPE = 'text/csv'
# destination folder
S3_OUTPUT_BUCKETNAME = 's3-bucketname/folder'
print('Loading function')
s3 = boto3.client('s3')
def lambda_handler(event, context):
#source email bucket
inBucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.quote(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
response = s3.get_object(Bucket=inBucket, Key=key)
msg = email.message_from_string(response['Body'].read().decode('utf-8'))
except Exception as e:
print(e)
print('Error retrieving object {} from source bucket {}. Verify existence and ensure bucket is in same region as function.'.format(key, inBucket))
raise e
attachment_list = []
try:
#scan each part of email
for message in msg.get_payload():
# Check filename and email MIME type
if (msg.get_filename() != None and msg.get_content_type() == FILE_MIMETYPE):
attachment_list.append ({'original_msg_key':key, 'attachment_filename':msg.get_filename(), 'body': base64.b64decode(message.get_payload()) })
except Exception as e:
print(e)
print ('Error processing email for CSV attachments')
raise e
# if multiple attachments send all to bucket
for attachment in attachment_list:
try:
s3.put_object(Bucket=S3_OUTPUT_BUCKETNAME, Key=attachment['original_msg_key'] +'-'+attachment['attachment_filename'] , Body=attachment['body'])
except Exception as e:
print(e)
print ('Error sending object {} to destination bucket {}. Verify existence and ensure bucket is in same region as function.'.format(attachment['attachment_filename'], S3_OUTPUT_BUCKETNAME))
raise e
return event
Unfamiliar territory so please let me know if further information is required.
EDIT
As per comments I have checked the logs. Seems that function is being invoked but the attachment is not being parsed and sent to destination folder. It's possible that there's an error in the Python file itself.
I am trying to detect labels of multiple images using AWS Rekognition in Python.
This process requires around 3 seconds for an image to get labelled. Is there any way I can label these images in parallel?
Since I have restrained using boto3 sessions, please provide the code snippet, if possible.
The best thing you can do is, instead of running your code in local machine, run it in the cloud as a function. With AWS Lambda you can do this easily. Just add s3 object upload as a trigger to your lambda , whenever any image will be uploaded to your s3 bucket , it will trigger your lambda function and it will detect_labels and then you can use those labels the way you want, you can even store them to a dynamodb table for later reference and fetch from that table.
Best thing is if you upload multiple images simulataneously, then each image will be parallely executed as lambda is highly scalable and you get all results at same time.
Example Code for the same :
from __future__ import print_function
import boto3
from decimal import Decimal
import json
import urllib
print('Loading function')
rekognition = boto3.client('rekognition')
# --------------- Helper Functions to call Rekognition APIs ------------------
def detect_labels(bucket, key):
response = rekognition.detect_labels(Image={"S3Object": {"Bucket": bucket, "Name": key}})
# Sample code to write response to DynamoDB table 'MyTable' with 'PK' as Primary Key.
# Note: role used for executing this Lambda function should have write access to the table.
#table = boto3.resource('dynamodb').Table('MyTable')
#labels = [{'Confidence': Decimal(str(label_prediction['Confidence'])), 'Name': label_prediction['Name']} for label_prediction in response['Labels']]
#table.put_item(Item={'PK': key, 'Labels': labels})
return response
# --------------- Main handler ------------------
def lambda_handler(event, context):
# Get the object from the event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
#Calls rekognition DetectLabels API to detect labels in S3 object
response = detect_labels(bucket, key)
print(response)
return response
except Exception as e:
print(e)
print("Error processing object {} from bucket {}. ".format(key, bucket) +
"Make sure your object and bucket exist and your bucket is in the same region as this function.")
raise e
I have a trial account with Azure and have uploaded some JSON files into CosmosDB. I am creating a python program to review the data but I am having trouble doing so. This is the code I have so far:
import pydocumentdb.documents as documents
import pydocumentdb.document_client as document_client
import pydocumentdb.errors as errors
url = 'https://ronyazrak.documents.azure.com:443/'
key = '' # primary key
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(url, {'masterKey': key})
collection_link = '/dbs/test1/colls/test1'
collection = client.ReadCollection(collection_link)
result_iterable = client.QueryDocuments(collection)
query = { 'query': 'SELECT * FROM server s' }
I read somewhere that the key would be my primary key that I can find in my Azure account Keys. I have filled the key string with my primary key shown in the image but key here is empty just for privacy purposes.
I also read somewhere that the collection_link should be '/dbs/test1/colls/test1' if my data is in collection 'test1' Collections.
My code gets an error at the function client.ReadCollection().
That's the error I have "pydocumentdb.errors.HTTPFailure: Status code: 401
{"code":"Unauthorized","message":"The input authorization token can't serve the request. Please check that the expected payload is built as per the protocol, and check the key being used. Server used the following payload to sign: 'get\ncolls\ndbs/test1/colls/test1\nmon, 29 may 2017 19:47:28 gmt\n\n'\r\nActivityId: 03e13e74-8db4-4661-837a-f8d81a2804cc"}"
Once this error is fixed, what is there left to do? I want to get the JSON files as a big dictionary so that I can review the data.
Am I in the right path? Am I approaching this the wrong way? How can I read documents that are in my database? Thanks.
According to your error information, it seems to be caused by the authentication failed with your key as the offical explaination said below from here.
So please check your key, but I think the key point is using pydocumentdb incorrectly. These id of Database, Collection & Document are different from their links. These APIs ReadCollection, QueryDocuments need to be pass related link. You need to retrieve all resource in Azure CosmosDB via resource link, not resource id.
According to your description, I think you want to list all documents under the collection id path /dbs/test1/colls/test1. As reference, here is my sample code as below.
from pydocumentdb import document_client
uri = 'https://ronyazrak.documents.azure.com:443/'
key = '<your-primary-key>'
client = document_client.DocumentClient(uri, {'masterKey': key})
db_id = 'test1'
db_query = "select * from r where r.id = '{0}'".format(db_id)
db = list(client.QueryDatabases(db_query))[0]
db_link = db['_self']
coll_id = 'test1'
coll_query = "select * from r where r.id = '{0}'".format(coll_id)
coll = list(client.QueryCollections(db_link, coll_query))[0]
coll_link = coll['_self']
docs = client.ReadDocuments(coll_link)
print list(docs)
Please see the details of DocumentDB Python SDK from here.
For those using azure-cosmos, the current library (2019) I opened a doc bug and provided a sample in GitHub
Sample
from azure.cosmos import cosmos_client
import json
CONFIG = {
"ENDPOINT": "ENDPOINT_FROM_YOUR_COSMOS_ACCOUNT",
"PRIMARYKEY": "KEY_FROM_YOUR_COSMOS_ACCOUNT",
"DATABASE": "DATABASE_ID", # Prolly looks more like a name to you
"CONTAINER": "YOUR_CONTAINER_ID" # Prolly looks more like a name to you
}
CONTAINER_LINK = f"dbs/{CONFIG['DATABASE']}/colls/{CONFIG['CONTAINER']}"
FEEDOPTIONS = {}
FEEDOPTIONS["enableCrossPartitionQuery"] = True
# There is also a partitionKey Feed Option, but I was unable to figure out how to us it.
QUERY = {
"query": f"SELECT * from c"
}
# Initialize the Cosmos client
client = cosmos_client.CosmosClient(
url_connection=CONFIG["ENDPOINT"], auth={"masterKey": CONFIG["PRIMARYKEY"]}
)
# Query for some data
results = client.QueryItems(CONTAINER_LINK, QUERY, FEEDOPTIONS)
# Look at your data
print(list(results))
# You can also use the list as JSON
json.dumps(list(results), indent=4)
I'm building web application for uploading and downloading files to and from MongoDb using Flask. First I'll search MongoDb database in particular collection for matching string and if there is a matching string in any document, then I need to create dynamic URL(clickable from search page) to download using the ObjectId. Once I click the dynamic URL, it should retrieve file stored in MongoDb for that particular ObjectId and download it. I tried changing response.headers['Content-Type'] and response.headers["Content-Dispostion"] to original values, but for some reason the download is not working as expected.
route.py
#app.route('/download/<fileId>', methods = ['GET', 'POST'])
def download(fileId):
connection = pymongo.MongoClient()
#get a handle to the test database
db = connection.test
uploads = db.uploads
try:
query = {'_id': ObjectId(fileId)}
cursor = uploads.find(query)
for doc in cursor:
fileName = doc['fileName']
response = make_response(doc['binFile'])
response.headers['Content-Type'] = doc['fileType']
response.headers['Content-Dispostion'] = "attachment; filename="+fileName
print response.headers
return response
except Exception as e:
return render_template('Unsuccessful.html')
What should I do so that I can download file(retrieved from MongoDB-working as expected) with same file name and data as I uploaded earlier?
Below is the log from recent run.
The file(in this case "Big Data Workflows presentation 1.pptx") retrieved from MongoDb is downloading with ObjectId file name even though I'm changing file name to original file name.
Please let me know if I'm missing any detail. I'll update the post accordingly.
Thanks in advance,
Thank you #Bartek Jablonski for your input.
Finally I made this work with tweaking code a little bit and creating new collection in MongoDB (I got lucky this time, I guess).
#app.route('/download/<fileId>', methods = ['GET', 'POST'])
def download(fileId):
connection = pymongo.MongoClient()
#get a handle to the nrdc database
db = connection.nrdc
uploads = db.uploads
try:
query = {'_id': ObjectId(fileId)}
cursor = uploads.find(query)
for doc in cursor:
fileName = doc['fileName']
response = make_response(doc['binFile'])
response.headers['Content-Type'] = doc['fileType']
response.headers["Content-Dispostion"] = "attachment; filename=\"%s\"" %fileName
return response
except Exception as e:
# self.errorList.append("No results found." + type(e))
return False