how do i paginate my query in Athena using Lambda and Boto3 - python

I am querying my data in Athena from lambda using Boto3.
My result is json format.
when I run my lambda function I get the whole record.
Now how can I paginate this data.
I only want to get fewer data per page and
send that small dataset to the UI to display.
Here is my Python code:
def lambda_handler(event, context):
athena = boto3.client('athena')
s3 = boto3.client('s3')
query = event['query']
# Execution
query_id = athena.start_query_execution(
QueryString=query,
QueryExecutionContext={'Database': DATABASE},
ResultConfiguration = {'OutputLocation': output}
)['QueryExecutionId']
I use postman to pass my query to get data and
I am aware of the SQl query LIMIT and OFFSET
but want to know if there is any other better way to pass LIMIT and OFFSET parameter in my function.
Please help me in this case.
Thanks.

A quick google search and found this answer in the Athena docs, which seems to be promising. Example from the docs
response_iterator = paginator.paginate(
QueryExecutionId='string',
PaginationConfig={
'MaxItems': 123,
'PageSize': 123,
'StartingToken': 'string'
})
I hope this helps!

Related

Boto3 S3 Paginator not returning filtered results

I'm trying to list the items in my S3 bucket from the last few months. I'm able to get results from the normal paginator (page_iterator), but the filtered_iterator isn't yielding anything when I iterate over it.
I've been referencing the documentation here. I've double checked my filter string both using JmesPath site and the AWS CLI, and it works in both places. I'm at a loss at this point on what I need to do.
Current Code:
client = boto3.client('s3', region_name='us-west-2')
paginator = client.get_paginator('list_objects_v2')
operation_parameters = {'Bucket': self.bucket_name,
'Prefix': file_path_prefix}
page_iterator = paginator.paginate(**operation_parameters)
filtered_iterator = page_iterator.search("Contents[?LastModified>='2022-10-31'][]")
for key_data in filtered_iterator:
print('page2', key_data)

in aws orgazination, use paginate to list all accounts

This should be a generic question about the usage paginate in boto3.
In this case, when I get a lot of accounts (100+) under AWS Orgazinations, use list_account() directly without with NextToken, you can't list all accounts.
response = client.list_accounts()
I knew the correct way is adding NextToken and MaxResults, but that needs more coding.
response = client.list_accounts(
NextToken='string',
MaxResults=123
)
So I switch to use another method, called paginate, ref class Organizations.Paginator.ListAccounts. It reports more accounts than list_accounts(), but still can't list all of them.
the Request Syntax has similar MaxItems and PageSize as in list_accounts()
response_iterator = paginator.paginate(
PaginationConfig={
'MaxItems': 123,
'PageSize': 123,
'StartingToken': 'string'
}
)
So two questions from me:
if paginate can't list all accounts, what's the point to create it.
How can I list all accounts with painate , any sample codes for me?
Will be appreciated.
So i am wrong, maybe at beginning.
the paginator does handle the loop automatically. I can't run test on Organization list_accounts(), because I don't have so many accounts to run the test, but I did a test on s3 bucket objects.
aws cli to get s3 objects
$ aws s3api list-objects --bucket bucket-demo |jq '.Contents|length'
8696
So I can confirm there are 8000+ objects in this bucket.
Get s3 object by python sdk with paginator
>>> import boto3
>>> client = boto3.client('s3')
>>> paginator = client.get_paginator('list_objects')
>>> response_iterator = paginator.paginate(Bucket="bucket-demo")
>>> for i in response_iterator:
... print(len(i['Contents']))
...
1000
1000
1000
1000
1000
1000
1000
1000
696
it proves the paginator does the loop automatically.
show the problem why we need paginator
>>> import boto3
>>> client = boto3.client('s3')
>>> response = client.list_objects(Bucket="bucket-demo")
>>> len(response['Contents'])
1000
So paginator can similify your codes a lot and avoid to develop own loop with normal way.

Multiprocessing multiple images using Rekognition in Python

I am trying to detect labels of multiple images using AWS Rekognition in Python.
This process requires around 3 seconds for an image to get labelled. Is there any way I can label these images in parallel?
Since I have restrained using boto3 sessions, please provide the code snippet, if possible.
The best thing you can do is, instead of running your code in local machine, run it in the cloud as a function. With AWS Lambda you can do this easily. Just add s3 object upload as a trigger to your lambda , whenever any image will be uploaded to your s3 bucket , it will trigger your lambda function and it will detect_labels and then you can use those labels the way you want, you can even store them to a dynamodb table for later reference and fetch from that table.
Best thing is if you upload multiple images simulataneously, then each image will be parallely executed as lambda is highly scalable and you get all results at same time.
Example Code for the same :
from __future__ import print_function
import boto3
from decimal import Decimal
import json
import urllib
print('Loading function')
rekognition = boto3.client('rekognition')
# --------------- Helper Functions to call Rekognition APIs ------------------
def detect_labels(bucket, key):
response = rekognition.detect_labels(Image={"S3Object": {"Bucket": bucket, "Name": key}})
# Sample code to write response to DynamoDB table 'MyTable' with 'PK' as Primary Key.
# Note: role used for executing this Lambda function should have write access to the table.
#table = boto3.resource('dynamodb').Table('MyTable')
#labels = [{'Confidence': Decimal(str(label_prediction['Confidence'])), 'Name': label_prediction['Name']} for label_prediction in response['Labels']]
#table.put_item(Item={'PK': key, 'Labels': labels})
return response
# --------------- Main handler ------------------
def lambda_handler(event, context):
# Get the object from the event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
#Calls rekognition DetectLabels API to detect labels in S3 object
response = detect_labels(bucket, key)
print(response)
return response
except Exception as e:
print(e)
print("Error processing object {} from bucket {}. ".format(key, bucket) +
"Make sure your object and bucket exist and your bucket is in the same region as this function.")
raise e

How to read data from Azure's CosmosDB in python

I have a trial account with Azure and have uploaded some JSON files into CosmosDB. I am creating a python program to review the data but I am having trouble doing so. This is the code I have so far:
import pydocumentdb.documents as documents
import pydocumentdb.document_client as document_client
import pydocumentdb.errors as errors
url = 'https://ronyazrak.documents.azure.com:443/'
key = '' # primary key
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(url, {'masterKey': key})
collection_link = '/dbs/test1/colls/test1'
collection = client.ReadCollection(collection_link)
result_iterable = client.QueryDocuments(collection)
query = { 'query': 'SELECT * FROM server s' }
I read somewhere that the key would be my primary key that I can find in my Azure account Keys. I have filled the key string with my primary key shown in the image but key here is empty just for privacy purposes.
I also read somewhere that the collection_link should be '/dbs/test1/colls/test1' if my data is in collection 'test1' Collections.
My code gets an error at the function client.ReadCollection().
That's the error I have "pydocumentdb.errors.HTTPFailure: Status code: 401
{"code":"Unauthorized","message":"The input authorization token can't serve the request. Please check that the expected payload is built as per the protocol, and check the key being used. Server used the following payload to sign: 'get\ncolls\ndbs/test1/colls/test1\nmon, 29 may 2017 19:47:28 gmt\n\n'\r\nActivityId: 03e13e74-8db4-4661-837a-f8d81a2804cc"}"
Once this error is fixed, what is there left to do? I want to get the JSON files as a big dictionary so that I can review the data.
Am I in the right path? Am I approaching this the wrong way? How can I read documents that are in my database? Thanks.
According to your error information, it seems to be caused by the authentication failed with your key as the offical explaination said below from here.
So please check your key, but I think the key point is using pydocumentdb incorrectly. These id of Database, Collection & Document are different from their links. These APIs ReadCollection, QueryDocuments need to be pass related link. You need to retrieve all resource in Azure CosmosDB via resource link, not resource id.
According to your description, I think you want to list all documents under the collection id path /dbs/test1/colls/test1. As reference, here is my sample code as below.
from pydocumentdb import document_client
uri = 'https://ronyazrak.documents.azure.com:443/'
key = '<your-primary-key>'
client = document_client.DocumentClient(uri, {'masterKey': key})
db_id = 'test1'
db_query = "select * from r where r.id = '{0}'".format(db_id)
db = list(client.QueryDatabases(db_query))[0]
db_link = db['_self']
coll_id = 'test1'
coll_query = "select * from r where r.id = '{0}'".format(coll_id)
coll = list(client.QueryCollections(db_link, coll_query))[0]
coll_link = coll['_self']
docs = client.ReadDocuments(coll_link)
print list(docs)
Please see the details of DocumentDB Python SDK from here.
For those using azure-cosmos, the current library (2019) I opened a doc bug and provided a sample in GitHub
Sample
from azure.cosmos import cosmos_client
import json
CONFIG = {
"ENDPOINT": "ENDPOINT_FROM_YOUR_COSMOS_ACCOUNT",
"PRIMARYKEY": "KEY_FROM_YOUR_COSMOS_ACCOUNT",
"DATABASE": "DATABASE_ID", # Prolly looks more like a name to you
"CONTAINER": "YOUR_CONTAINER_ID" # Prolly looks more like a name to you
}
CONTAINER_LINK = f"dbs/{CONFIG['DATABASE']}/colls/{CONFIG['CONTAINER']}"
FEEDOPTIONS = {}
FEEDOPTIONS["enableCrossPartitionQuery"] = True
# There is also a partitionKey Feed Option, but I was unable to figure out how to us it.
QUERY = {
"query": f"SELECT * from c"
}
# Initialize the Cosmos client
client = cosmos_client.CosmosClient(
url_connection=CONFIG["ENDPOINT"], auth={"masterKey": CONFIG["PRIMARYKEY"]}
)
# Query for some data
results = client.QueryItems(CONTAINER_LINK, QUERY, FEEDOPTIONS)
# Look at your data
print(list(results))
# You can also use the list as JSON
json.dumps(list(results), indent=4)

Python BigQuery API: how to get data asynchronously?

I am getting started with the BigQuery API in Python, following the documentation.
This is my code, adapted from an example:
credentials = GoogleCredentials.get_application_default()
bigquery_service = build('bigquery', 'v2', credentials=credentials)
try:
query_request = bigquery_service.jobs()
query_data = {
'query': (
'SELECT * FROM [mytable] LIMIT 10;"
)
}
query_response = query_request.query(
projectId=project_id,
body=query_data).execute()
for row in query_response['rows']:
print('\t'.join(field['v'] for field in row['f']))
The problem I'm having is that I keep getting the response:
{u'kind': u'bigquery#queryResponse',
u'jobComplete': False,
u'jobReference': {u'projectId': 'myproject', u'jobId': u'xxxx'}}
So it has no rows field. Looking at the docs, I guess I need to take the jobId field and use it to check when the job is complete, and then get the data.
The problem I'm having is that the docs are a bit scattered and confusing, and I don't know how to do this.
I think I need to use this method to check the status of the job, but how do I adapt it for Python? And how often should I check / how long should I wait?
Could anyone give me an example?
There is code to do what you want here.
If you want more background on what it is doing, check out Google BigQuery Analytics chapter 7 (the relevant snippet is available here.)
TL;DR:
Your initial jobs.query() call is returning before the query completes; to wait for the job to be done you'll need to poll on jobs.getQueryResults(). You can then page through the results of that call.

Categories