Boto3 Athena are not showing all tables - python

trying to get a list of table names in Athena Table using BOTO3 python.
this is my code; I think my attempts to do paginator is not correct. Any help is appreciated
import boto3
client = boto3.client('glue')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
if "dbName_" in databaseName:
print '\ndatabaseName: ' + databaseName
responseGetTables = client.get_tables( DatabaseName = databaseName )
paginator = client.get_paginator(['TableList'])
for page in paginator:
tableList = responseGetTables['TableList']
for tables in tableList:
print tables['Name']

The get_paginator function parameter must be the name of the operation. It looks like you're trying to paginate on the get_tables function so
paginator = client.get_paginator(['TableList'])
should be:
paginator = client.get_paginator('get_tables')
Once you have the paginator object, you need to call paginator.paginate to retrieve the iterator. You can send your database parameters like so:
page_iterator = paginator.paginate(
DatabaseName=databaseDict['Name'],
PaginationConfig={
'MaxItems': 123,
'PageSize': 123,
'StartingToken': 'string'
}
)
See the documention for this function here.
Now that you have the iterator, you can call a for loop by enumerating on it:
for page_index, page in enumerate(page_iterator):

Here's a full working example on how to do it using paginator.
Remember to provide region_name and database_name.
import boto3
region_name = '<PROVIDE_AWS_REGION_NAME>'
database_name = '<PROVIDE_YOUR_DATABASE_NAME>'
catalog_name = 'AwsDataCatalog'
athena = boto3.client('athena', region_name=region_name)
paginator = athena.get_paginator('list_table_metadata')
response_iterator = paginator.paginate(
CatalogName=catalog_name,
DatabaseName=database_name
)
table_names = []
for page in response_iterator:
table_names.extend(
(i['Name'] for i in page['TableMetadataList'])
)
print(table_names)

Related

How can I bulk upload JSON records to AWS OpenSearch index using a python client library?

I have a sufficiently large dataset that I would like to bulk index the JSON objects in AWS OpenSearch.
I cannot see how to achieve this using any of: boto3, awswrangler, opensearch-py, elasticsearch, elasticsearch-py.
Is there a way to do this without using a python request (PUT/POST) directly?
Note that this is not for: ElasticSearch, AWS ElasticSearch.
Many thanks!
I finally found a way to do it using opensearch-py, as follows.
First establish the client,
# First fetch credentials from environment defaults
# If you can get this far you probably know how to tailor them
# For your particular situation. Otherwise SO is a safe bet :)
import boto3
credentials = boto3.Session().get_credentials()
region='eu-west-2' # for example
auth = AWSV4SignerAuth(credentials, region)
# Now set up the AWS 'Signer'
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
auth = AWSV4SignerAuth(credentials, region)
# And finally the OpenSearch client
host=f"...{region}.es.amazonaws.com" # fill in your hostname (minus the https://) here
client = OpenSearch(
hosts = [{'host': host, 'port': 443}],
http_auth = auth,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection
)
Phew! Let's create the data now:
# Spot the deliberate mistake(s) :D
document1 = {
"title": "Moneyball",
"director": "Bennett Miller",
"year": "2011"
}
document2 = {
"title": "Apollo 13",
"director": "Richie Cunningham",
"year": "1994"
}
data = [document1, document2]
TIP! Create the index if you need to -
my_index = 'my_index'
try:
response = client.indices.create(my_index)
print('\nCreating index:')
print(response)
except Exception as e:
# If, for example, my_index already exists, do not much!
print(e)
This is where things go a bit nutty. I hadn't realised that every single bulk action needs an, er, action e.g. "index", "search" etc. - so let's define that now
action={
"index": {
"_index": my_index
}
}
You can read all about the bulk REST API, there.
The next quirk is that the OpenSearch bulk API requires Newline Delimited JSON (see https://www.ndjson.org), which is basically JSON serialized as strings and separated by newlines. Someone wrote on SO that this "bizarre" API looked like one designed by a data scientist - far from taking offence, I think that rocks. (I agree ndjson is weird though.)
Hideously, now let's build up the full JSON string, combining the data and actions. A helper fn is at hand!
def payload_constructor(data,action):
# "All my own work"
action_string = json.dumps(action) + "\n"
payload_string=""
for datum in data:
payload_string += action_string
this_line = json.dumps(datum) + "\n"
payload_string += this_line
return payload_string
OK so now we can finally invoke the bulk API. I suppose you could mix in all sorts of actions (out of scope here) - go for it!
response=client.bulk(body=payload_constructor(data,action),index=my_index)
That's probably the most boring punchline ever but there you have it.
You can also just get (geddit) .bulk() to just use index= and set the action to:
action={"index": {}}
Hey presto!
Now, choose your poison - the other solution looks crazily shorter and neater.
PS The well-hidden opensearch-py documentation on this are located here.
conn = wr.opensearch.connect(
host=self.hosts, # URL
port=443,
username=self.username,
password=self.password
)
def insert_index_data(data, index_name='stocks', delete_index_data=False):
""" Bulk Create
args: body [{doc1}{doc2}....]
"""
if delete_index_data:
index_name = 'symbol'
self.delete_es_index(index_name)
resp = wr.opensearch.index_documents(
self.conn,
documents=data,
index=index_name
)
print(resp)
return resp
I have used below code to bulk insert records from postgres into OpenSearch ( ES 7.2 )
import sqlalchemy as sa
from sqlalchemy import text
import pandas as pd
import numpy as np
from opensearchpy import OpenSearch
from opensearchpy.helpers import bulk
import json
engine = sa.create_engine('postgresql+psycopg2://postgres:postgres#127.0.0.1:5432/postgres')
host = 'search-xxxxxxxxxx.us-east-1.es.amazonaws.com'
port = 443
auth = ('username', 'password') # For testing only. Don't store credentials in code.
# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
hosts = [{'host': host, 'port': port}],
http_compress = True,
http_auth = auth,
use_ssl = True,
verify_certs = True,
ssl_assert_hostname = False,
ssl_show_warn = False
)
with engine.connect() as connection:
result = connection.execute(text("select * from account_1_study_1.stg_pred where domain='LB'"))
records = []
for row in result:
record = dict(row)
record.update(record['item_dataset'])
del record['item_dataset']
records.append(record)
df = pd.DataFrame(records)
#df['Date'] = df['Date'].astype(str)
df = df.fillna("null")
print(df.keys)
documents = df.to_dict(orient='records')
#bulk(es ,documents, index='search-irl-poc-dump', raise_on_error=True)\
#response=client.bulk(body=documents,index='sample-index')
bulk(client, documents, index='search-irl-poc-dump', raise_on_error=True, refresh=True)

Create Lambda - DynamoDB Count Function

I am creating a SAM web app, with the backend being an API in front of a Python Lambda function with a DynamoDB table that maintains a count of the number of HTTP calls to the API. The API must also return this number. The yaml code itself loads normally. My problem is writing the Lambda function to iterate and return the count. Here is my code:
def lambda_handler(event, context):
dynamodb = boto3.resource("dynamodb")
ddbTableName = os.environ["databaseName"]
table = dynamodb.Table(ddbTableName)
# Update item in table or add if doesn't exist
ddbResponse = table.update_item(
Key={"id": "VisitorCount"},
UpdateExpression="SET count = count + :value",
ExpressionAttributeValues={":value": Decimal(context)},
ReturnValues="UPDATED_NEW",
)
# Format dynamodb response into variable
responseBody = json.dumps({"VisitorCount": ddbResponse["Attributes"]["count"]})
# Create api response object
apiResponse = {"isBase64Encoded": False, "statusCode": 200, "body": responseBody}
# Return api response object
return apiResponse
I can get VisitorCount to be a string, but not a number. I get this error: [ERROR] TypeError: lambda_handler() missing 1 required positional argument: 'cou    response = request_handler(event, lambda_context)le_event_request
What is going on?
[UPDATE] I found the original error, which was that the function was not properly received by the SAM app. Changing the name fixed this, and it is now being read. Now I have to troubleshoot the actual Python. New Code:
import json
import boto3
import os
dynamodb = boto3.resource("dynamodb")
ddbTableName = os.environ["databaseName"]
table = dynamodb.Table(ddbTableName)
Key = {"VisitorCount": { "N" : "0" }}
def handler(event, context):
# Update item in table or add if doesn't exist
ddbResponse = table.update_item(
UpdateExpression= "set VisitorCount = VisitorCount + :val",
ExpressionAttributeValues={":val": {"N":"1"}},
ReturnValues="UPDATED_NEW",
)
# Format dynamodb response into variable
responseBody = json.dumps({"VisitorCount": ddbResponse["Attributes"]["count"]})
# Create api response object
apiResponse = {"isBase64Encoded": False, "statusCode": 200,"body": responseBody}
# Return api response object
return apiResponse
I am getting a syntax error on Line 13, which is
UpdateExpression= "set VisitorCount = VisitorCount + :val",
But I can't tell where I am going wrong on this. It should update the DynamoDB table to increase the count by 1. Looking at the AWS guide it appears to be the correct syntax.
Not sure what the exact error is but ddbResponse will be like this:
ddbResponse = table.update_item(
Key={
'key1': aaa,
'key2': bbb
},
UpdateExpression= "set VisitorCount = VisitorCount + :val",
ExpressionAttributeValues={":val": Decimal(1)},
ReturnValues="UPDATED_NEW",
)
Specify item to be updated with Key (one item for one Lambda call)
Set Decimal(1) for ExpressionAttributeValues
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.03.html#GettingStarted.Python.03.04

Paginator.getPaginate() not working in Python Boto3 for Redshift Data API

I am using Boto3 in python to get the data. I have created an API with the help of Flask in Python. I am firing a query and trying to retrieve the data, but it is too large hence trying to use Paginator but it returns NULL everytime.
paginator = DataClient.get_paginator('get_statement_result')
PaginatorResponse = paginator.paginate(
Id=request.args.get('Id'),
PaginationConfig={
'MaxItems': request.args.get('max-items'),
'StartingToken': request.args.get('start-token')
}
)
RESPONSE['redshift'] = PaginatorResponse
response = app.response_class(
response = json.dumps(RESPONSE ,sort_keys=True,indent=1,default=default),
status = STATUSCODE,
mimetype = 'application/json'
)
This is the paginator code which returns NULL on same query ID which I use in the get_statement_result directly. Here is the code for the same:
RecordsResponse = DataClient.get_statement_result(Id=request.args.get('Id'))
RESPONSE['redshift'] = RecordsResponse
RESPONSE['records'] = convertRecords(RecordsResponse)
response = app.response_class(
response = json.dumps(RESPONSE ,sort_keys=True,indent=1,default=default),
status = STATUSCODE,
mimetype = 'application/json'
)
Result ScreenShot from Get Statement Result
Result ScreenShot from Get Paginator
Not able to understand where I am going wrong, and how to get it done.

Add limit in Lambda DynamoDB function

I need to limit the number of results returned for the query
My lambda function is
import json
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('users')
def lambda_handler(event, context):
response = table.scan(
FilterExpression=Attr('name').eq('test')
)
items = response['Items']
return items
Please help me how can i add a limit to the number of results it will return
You can try with pagination.
import boto3
def lambda_handler(event, context):
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': 'foo',
'FilterExpression': 'foo > :x',
'ExpressionAttributeValues': {}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
# do something
boto3 source source2
If you want just limiting techinique take a look at this solution but it uses query method not scan.
def full_query(table, **kwargs):
response = table.query(**kwargs)
items = response['Items']
while 'LastEvaluatedKey' in response:
response = table.query(ExclusiveStartKey=response['LastEvaluatedKey'], **kwards)
items.extend(response['Items'])
return items
full_query(Limit=37, KeyConditions={...})
Source : https://stackoverflow.com/a/56468281/9592801

Complete scan of dynamoDb with boto3

My table is around 220mb with 250k records within it. I'm trying to pull all of this data into python. I realize this needs to be a chunked batch process and looped through, but I'm not sure how I can set the batches to start where the previous left off.
Is there some way to filter my scan? From what I read that filtering occurs after loading and the loading stops at 1mb so I wouldn't actually be able to scan in new objects.
Any assistance would be appreciated.
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token = aws_session_token,
aws_access_key_id = aws_access_key_id,
aws_secret_access_key = aws_secret_access_key,
region_name = region
)
table = dynamodb.Table('widgetsTableName')
data = table.scan()
I think the Amazon DynamoDB documentation regarding table scanning answers your question.
In short, you'll need to check for LastEvaluatedKey in the response. Here is an example using your code:
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token=aws_session_token,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region
)
table = dynamodb.Table('widgetsTableName')
response = table.scan()
data = response['Items']
while 'LastEvaluatedKey' in response:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
DynamoDB limits the scan method to 1mb of data per scan.
Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan
Here is an example loop to get all the data from a DynamoDB table using LastEvaluatedKey:
import boto3
client = boto3.client('dynamodb')
def dump_table(table_name):
results = []
last_evaluated_key = None
while True:
if last_evaluated_key:
response = client.scan(
TableName=table_name,
ExclusiveStartKey=last_evaluated_key
)
else:
response = client.scan(TableName=table_name)
last_evaluated_key = response.get('LastEvaluatedKey')
results.extend(response['Items'])
if not last_evaluated_key:
break
return results
# Usage
data = dump_table('your-table-name')
# do something with data
boto3 offers paginators that handle all the pagination details for you. Here is the doc page for the scan paginator. Basically, you would use it like so:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
for page in paginator.paginate():
# do something
Riffing off of Jordon Phillips's answer, here's how you'd pass a FilterExpression in with the pagination:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': 'foo',
'FilterExpression': 'bar > :x AND bar < :y',
'ExpressionAttributeValues': {
':x': {'S': '2017-01-31T01:35'},
':y': {'S': '2017-01-31T02:08'},
}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
# do something
Code for deleting dynamodb format type as #kungphu mentioned.
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('query')
service_model = client._service_model.operation_model('Query')
trans = TransformationInjector(deserializer = TypeDeserializer())
for page in paginator.paginate():
trans.inject_attribute_value_output(page, service_model)
Turns out that Boto3 captures the "LastEvaluatedKey" as part of the returned response. This can be used as the start point for a scan:
data= table.scan(
ExclusiveStartKey=data['LastEvaluatedKey']
)
I plan on building a loop around this until the returned data is only the ExclusiveStartKey
The 2 approaches suggested above both have problems: Either writing lengthy and repetitive code that handles paging explicitly in a loop, or using Boto paginators with low-level sessions, and foregoing the advantages of higher-level Boto objects.
A solution using Python functional code to provide a high-level abstraction allows higher-level Boto methods to be used, while hiding the complexity of AWS paging:
import itertools
import typing
def iterate_result_pages(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Generator:
"""A wrapper for functions using AWS paging, that returns a generator which yields a sequence of items for
every response
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
A generator which yields the 'Items' field of the result for every response
"""
response = function_returning_response(*args, **kwargs)
yield response["Items"]
while "LastEvaluatedKey" in response:
kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
response = function_returning_response(*args, **kwargs)
yield response["Items"]
return
def iterate_paged_results(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Iterator:
"""A wrapper for functions using AWS paging, that returns an iterator of all the items in the responses.
Items are yielded to the caller as soon as they are received.
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
An iterator which yields one response item at a time
"""
return itertools.chain.from_iterable(iterate_result_pages(function_returning_response, *args, **kwargs))
# Example, assuming 'table' is a Boto DynamoDB table object:
all_items = list(iterate_paged_results(ProjectionExpression = 'my_field'))
I had some problems with Vincent's answer related to the transformation being applied to the LastEvaluatedKey and messing up the pagination. Solved as follows:
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_model = client._service_model.operation_model('Scan')
trans = TransformationInjector(deserializer = TypeDeserializer())
operation_parameters = {
'TableName': 'tablename',
}
items = []
for page in paginator.paginate(**operation_parameters):
has_last_key = 'LastEvaluatedKey' in page
if has_last_key:
last_key = page['LastEvaluatedKey'].copy()
trans.inject_attribute_value_output(page, operation_model)
if has_last_key:
page['LastEvaluatedKey'] = last_key
items.extend(page['Items'])
If you are landing here looking for a paginated scan with some filtering expression(s):
def scan(table, **kwargs):
response = table.scan(**kwargs)
yield from response['Items']
while response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'], **kwargs)
yield from response['Items']
Example usage:
table = boto3.Session(...).resource('dynamodb').Table('widgetsTableName')
items = list(scan(table, FilterExpression=Attr('name').contains('foo')))
I can't work out why Boto3 provides high-level resource abstraction but doesn't provide pagination. When it does provide pagination, it's hard to use!
The other answers to this question were good but I wanted a super simple way to wrap the boto3 methods and provide memory-efficient paging using generators:
import typing
import boto3
import boto3.dynamodb.conditions
def paginate_dynamodb_response(dynamodb_action: typing.Callable, **kwargs) -> typing.Generator[dict, None, None]:
# Using the syntax from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/dynamodb/GettingStarted/scenario_getting_started_movies.py
keywords = kwargs
done = False
start_key = None
while not done:
if start_key:
keywords['ExclusiveStartKey'] = start_key
response = dynamodb_action(**keywords)
start_key = response.get('LastEvaluatedKey', None)
done = start_key is None
for item in response.get("Items", []):
yield item
## Usage ##
dynamodb_res = boto3.resource('dynamodb')
dynamodb_table = dynamodb_res.Table('my-table')
query = paginate_dynamodb_response(
dynamodb_table.query, # The boto3 method. E.g. query or scan
# Regular Query or Scan parameters
#
# IndexName='myindex' # If required
KeyConditionExpression=boto3.dynamodb.conditions.Key('id').eq('1234')
)
for x in query:
print(x)```

Categories