Complete scan of dynamoDb with boto3

Complete scan of dynamoDb with boto3 - python

My table is around 220mb with 250k records within it. I'm trying to pull all of this data into python. I realize this needs to be a chunked batch process and looped through, but I'm not sure how I can set the batches to start where the previous left off.
Is there some way to filter my scan? From what I read that filtering occurs after loading and the loading stops at 1mb so I wouldn't actually be able to scan in new objects.
Any assistance would be appreciated.
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token = aws_session_token,
aws_access_key_id = aws_access_key_id,
aws_secret_access_key = aws_secret_access_key,
region_name = region
)
table = dynamodb.Table('widgetsTableName')
data = table.scan()

I think the Amazon DynamoDB documentation regarding table scanning answers your question.
In short, you'll need to check for LastEvaluatedKey in the response. Here is an example using your code:
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token=aws_session_token,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region
)
table = dynamodb.Table('widgetsTableName')
response = table.scan()
data = response['Items']
while 'LastEvaluatedKey' in response:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])

DynamoDB limits the scan method to 1mb of data per scan.
Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan
Here is an example loop to get all the data from a DynamoDB table using LastEvaluatedKey:
import boto3
client = boto3.client('dynamodb')
def dump_table(table_name):
results = []
last_evaluated_key = None
while True:
if last_evaluated_key:
response = client.scan(
TableName=table_name,
ExclusiveStartKey=last_evaluated_key
)
else:
response = client.scan(TableName=table_name)
last_evaluated_key = response.get('LastEvaluatedKey')
results.extend(response['Items'])
if not last_evaluated_key:
break
return results
# Usage
data = dump_table('your-table-name')
# do something with data

boto3 offers paginators that handle all the pagination details for you. Here is the doc page for the scan paginator. Basically, you would use it like so:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
for page in paginator.paginate():
# do something

Riffing off of Jordon Phillips's answer, here's how you'd pass a FilterExpression in with the pagination:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': 'foo',
'FilterExpression': 'bar > :x AND bar < :y',
'ExpressionAttributeValues': {
':x': {'S': '2017-01-31T01:35'},
':y': {'S': '2017-01-31T02:08'},
}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
# do something

Code for deleting dynamodb format type as #kungphu mentioned.
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('query')
service_model = client._service_model.operation_model('Query')
trans = TransformationInjector(deserializer = TypeDeserializer())
for page in paginator.paginate():
trans.inject_attribute_value_output(page, service_model)

Turns out that Boto3 captures the "LastEvaluatedKey" as part of the returned response. This can be used as the start point for a scan:
data= table.scan(
ExclusiveStartKey=data['LastEvaluatedKey']
)
I plan on building a loop around this until the returned data is only the ExclusiveStartKey

The 2 approaches suggested above both have problems: Either writing lengthy and repetitive code that handles paging explicitly in a loop, or using Boto paginators with low-level sessions, and foregoing the advantages of higher-level Boto objects.
A solution using Python functional code to provide a high-level abstraction allows higher-level Boto methods to be used, while hiding the complexity of AWS paging:
import itertools
import typing
def iterate_result_pages(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Generator:
"""A wrapper for functions using AWS paging, that returns a generator which yields a sequence of items for
every response
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
A generator which yields the 'Items' field of the result for every response
"""
response = function_returning_response(*args, **kwargs)
yield response["Items"]
while "LastEvaluatedKey" in response:
kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
response = function_returning_response(*args, **kwargs)
yield response["Items"]
return
def iterate_paged_results(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Iterator:
"""A wrapper for functions using AWS paging, that returns an iterator of all the items in the responses.
Items are yielded to the caller as soon as they are received.
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
An iterator which yields one response item at a time
"""
return itertools.chain.from_iterable(iterate_result_pages(function_returning_response, *args, **kwargs))
# Example, assuming 'table' is a Boto DynamoDB table object:
all_items = list(iterate_paged_results(ProjectionExpression = 'my_field'))

I had some problems with Vincent's answer related to the transformation being applied to the LastEvaluatedKey and messing up the pagination. Solved as follows:
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_model = client._service_model.operation_model('Scan')
trans = TransformationInjector(deserializer = TypeDeserializer())
operation_parameters = {
'TableName': 'tablename',
}
items = []
for page in paginator.paginate(**operation_parameters):
has_last_key = 'LastEvaluatedKey' in page
if has_last_key:
last_key = page['LastEvaluatedKey'].copy()
trans.inject_attribute_value_output(page, operation_model)
if has_last_key:
page['LastEvaluatedKey'] = last_key
items.extend(page['Items'])

If you are landing here looking for a paginated scan with some filtering expression(s):
def scan(table, **kwargs):
response = table.scan(**kwargs)
yield from response['Items']
while response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'], **kwargs)
yield from response['Items']
Example usage:
table = boto3.Session(...).resource('dynamodb').Table('widgetsTableName')
items = list(scan(table, FilterExpression=Attr('name').contains('foo')))

I can't work out why Boto3 provides high-level resource abstraction but doesn't provide pagination. When it does provide pagination, it's hard to use!
The other answers to this question were good but I wanted a super simple way to wrap the boto3 methods and provide memory-efficient paging using generators:
import typing
import boto3
import boto3.dynamodb.conditions
def paginate_dynamodb_response(dynamodb_action: typing.Callable, **kwargs) -> typing.Generator[dict, None, None]:
# Using the syntax from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/dynamodb/GettingStarted/scenario_getting_started_movies.py
keywords = kwargs
done = False
start_key = None
while not done:
if start_key:
keywords['ExclusiveStartKey'] = start_key
response = dynamodb_action(**keywords)
start_key = response.get('LastEvaluatedKey', None)
done = start_key is None
for item in response.get("Items", []):
yield item
## Usage ##
dynamodb_res = boto3.resource('dynamodb')
dynamodb_table = dynamodb_res.Table('my-table')
query = paginate_dynamodb_response(
dynamodb_table.query, # The boto3 method. E.g. query or scan
# Regular Query or Scan parameters
#
# IndexName='myindex' # If required
KeyConditionExpression=boto3.dynamodb.conditions.Key('id').eq('1234')
)
for x in query:
print(x)```

Related

NotImplementedError: The analyze_document has not been Implemented

I'm trying to mock textract using moto
I have lambda_function:
from textractcaller.t_call import call_textract, Textract_Features
def lambda_function(event,context)
s3InputDocPath = "s3://test"
jsonObject = call_textract(
input_document = s3InputDocPath,
features = [Textract_Features.FORMS, Textract_Features.Tables],
job_done_polling_interval=5,
)
textract.py
#pytest.fixtures
def callTextract()
textract = boto3.client("textract", region_name="us-east-1")
bucket = "inputBucket"
textract.analyze_document(
Document = {"S3Object":{"Bucket":bucket, "Name" = "test_doc.pdf"}},
FeatureTypes=["TABLES", "FORMS"]
)
finally my testfile:
test_lambda_function.py
#mock_textract
#mock_s3
def test_lambda_function(event, callTextract, s3_object)
callTextract()
s3_object()
result = lambda_function(event, None)
But textract calls are not getting mocked
I'm getting below error
NotImplementedError: The analyze_document has not been Implemented
Anyone can help please?

You can patch unsupported features like this:
import boto3
import botocore
from unittest.mock import patch
orig = botocore.client.BaseClient._make_api_call
def mock_make_api_call(self, operation_name, kwarg):
if operation_name == 'AnalyzeDocument':
# Make whatever changes you expect to happen during this operation
return {
"expected": "response",
"for this": "operation"
}
# If we don't want to patch the API call, call the original API
return orig(self, operation_name, kwarg)
#mock_textract
#mock_s3
def test_lambda_function(event, callTextract, s3_object)
with patch('botocore.client.BaseClient._make_api_call', new=mock_make_api_call):
callTextract()
s3_object()
result = lambda_function(event, None)
This allows you to completely customize the boto3 behaviour, on a per-method basis.
Taken (and modified to fit your usecase) from the Moto documentation here: http://docs.getmoto.org/en/latest/docs/services/patching_other_services.html

How can I bulk upload JSON records to AWS OpenSearch index using a python client library?

I have a sufficiently large dataset that I would like to bulk index the JSON objects in AWS OpenSearch.
I cannot see how to achieve this using any of: boto3, awswrangler, opensearch-py, elasticsearch, elasticsearch-py.
Is there a way to do this without using a python request (PUT/POST) directly?
Note that this is not for: ElasticSearch, AWS ElasticSearch.
Many thanks!

I finally found a way to do it using opensearch-py, as follows.
First establish the client,
# First fetch credentials from environment defaults
# If you can get this far you probably know how to tailor them
# For your particular situation. Otherwise SO is a safe bet :)
import boto3
credentials = boto3.Session().get_credentials()
region='eu-west-2' # for example
auth = AWSV4SignerAuth(credentials, region)
# Now set up the AWS 'Signer'
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
auth = AWSV4SignerAuth(credentials, region)
# And finally the OpenSearch client
host=f"...{region}.es.amazonaws.com" # fill in your hostname (minus the https://) here
client = OpenSearch(
hosts = [{'host': host, 'port': 443}],
http_auth = auth,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection
)
Phew! Let's create the data now:
# Spot the deliberate mistake(s) :D
document1 = {
"title": "Moneyball",
"director": "Bennett Miller",
"year": "2011"
}
document2 = {
"title": "Apollo 13",
"director": "Richie Cunningham",
"year": "1994"
}
data = [document1, document2]
TIP! Create the index if you need to -
my_index = 'my_index'
try:
response = client.indices.create(my_index)
print('\nCreating index:')
print(response)
except Exception as e:
# If, for example, my_index already exists, do not much!
print(e)
This is where things go a bit nutty. I hadn't realised that every single bulk action needs an, er, action e.g. "index", "search" etc. - so let's define that now
action={
"index": {
"_index": my_index
}
}
You can read all about the bulk REST API, there.
The next quirk is that the OpenSearch bulk API requires Newline Delimited JSON (see https://www.ndjson.org), which is basically JSON serialized as strings and separated by newlines. Someone wrote on SO that this "bizarre" API looked like one designed by a data scientist - far from taking offence, I think that rocks. (I agree ndjson is weird though.)
Hideously, now let's build up the full JSON string, combining the data and actions. A helper fn is at hand!
def payload_constructor(data,action):
# "All my own work"
action_string = json.dumps(action) + "\n"
payload_string=""
for datum in data:
payload_string += action_string
this_line = json.dumps(datum) + "\n"
payload_string += this_line
return payload_string
OK so now we can finally invoke the bulk API. I suppose you could mix in all sorts of actions (out of scope here) - go for it!
response=client.bulk(body=payload_constructor(data,action),index=my_index)
That's probably the most boring punchline ever but there you have it.
You can also just get (geddit) .bulk() to just use index= and set the action to:
action={"index": {}}
Hey presto!
Now, choose your poison - the other solution looks crazily shorter and neater.
PS The well-hidden opensearch-py documentation on this are located here.

conn = wr.opensearch.connect(
host=self.hosts, # URL
port=443,
username=self.username,
password=self.password
)
def insert_index_data(data, index_name='stocks', delete_index_data=False):
""" Bulk Create
args: body [{doc1}{doc2}....]
"""
if delete_index_data:
index_name = 'symbol'
self.delete_es_index(index_name)
resp = wr.opensearch.index_documents(
self.conn,
documents=data,
index=index_name
)
print(resp)
return resp

I have used below code to bulk insert records from postgres into OpenSearch ( ES 7.2 )
import sqlalchemy as sa
from sqlalchemy import text
import pandas as pd
import numpy as np
from opensearchpy import OpenSearch
from opensearchpy.helpers import bulk
import json
engine = sa.create_engine('postgresql+psycopg2://postgres:postgres#127.0.0.1:5432/postgres')
host = 'search-xxxxxxxxxx.us-east-1.es.amazonaws.com'
port = 443
auth = ('username', 'password') # For testing only. Don't store credentials in code.
# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
hosts = [{'host': host, 'port': port}],
http_compress = True,
http_auth = auth,
use_ssl = True,
verify_certs = True,
ssl_assert_hostname = False,
ssl_show_warn = False
)
with engine.connect() as connection:
result = connection.execute(text("select * from account_1_study_1.stg_pred where domain='LB'"))
records = []
for row in result:
record = dict(row)
record.update(record['item_dataset'])
del record['item_dataset']
records.append(record)
df = pd.DataFrame(records)
#df['Date'] = df['Date'].astype(str)
df = df.fillna("null")
print(df.keys)
documents = df.to_dict(orient='records')
#bulk(es ,documents, index='search-irl-poc-dump', raise_on_error=True)\
#response=client.bulk(body=documents,index='sample-index')
bulk(client, documents, index='search-irl-poc-dump', raise_on_error=True, refresh=True)

Add limit in Lambda DynamoDB function

I need to limit the number of results returned for the query
My lambda function is
import json
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('users')
def lambda_handler(event, context):
response = table.scan(
FilterExpression=Attr('name').eq('test')
)
items = response['Items']
return items
Please help me how can i add a limit to the number of results it will return

You can try with pagination.
import boto3
def lambda_handler(event, context):
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': 'foo',
'FilterExpression': 'foo > :x',
'ExpressionAttributeValues': {}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
# do something
boto3 source source2
If you want just limiting techinique take a look at this solution but it uses query method not scan.
def full_query(table, **kwargs):
response = table.query(**kwargs)
items = response['Items']
while 'LastEvaluatedKey' in response:
response = table.query(ExclusiveStartKey=response['LastEvaluatedKey'], **kwards)
items.extend(response['Items'])
return items
full_query(Limit=37, KeyConditions={...})
Source : https://stackoverflow.com/a/56468281/9592801

How to query dynamoDB with lambda function with multiple filters passed from API

I am creating a lambda function with dynamodb to list items and trying to pass parameters to filter the data. Without passing any filter i am able to get the data with scan method but when passing the filter, getting error. Below is the code that i am trying
from __future__ import print_function
import json
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('table-all')
def scan_table_allpages(self, table_name, filter_key=None, filter_value=None):
table = self.dynamodb_resource.Table(table)
if filter_key and filter_value:
filtering_exp = Key(filter_key).eq(filter_value)
response = table.scan(FilterExpression=filtering_exp)
else:
response = table.scan()
items = response['Items']
while True:
print(len(response['Items']))
if response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
items += response['Items']
else:
break
return items
# def lambda_handler(event, context):
# print(table.creation_date_time)
# response = table.get_items(
# Key={
# 'Country':event['pathParameters']['USA']
# }
# )
# #response = table.query()
# #print(response)
# return response
scan_table_allpages(self.table,filter_key="Country",filter_value='USA')

Boto3 Athena are not showing all tables

trying to get a list of table names in Athena Table using BOTO3 python.
this is my code; I think my attempts to do paginator is not correct. Any help is appreciated
import boto3
client = boto3.client('glue')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
if "dbName_" in databaseName:
print '\ndatabaseName: ' + databaseName
responseGetTables = client.get_tables( DatabaseName = databaseName )
paginator = client.get_paginator(['TableList'])
for page in paginator:
tableList = responseGetTables['TableList']
for tables in tableList:
print tables['Name']

The get_paginator function parameter must be the name of the operation. It looks like you're trying to paginate on the get_tables function so
paginator = client.get_paginator(['TableList'])
should be:
paginator = client.get_paginator('get_tables')
Once you have the paginator object, you need to call paginator.paginate to retrieve the iterator. You can send your database parameters like so:
page_iterator = paginator.paginate(
DatabaseName=databaseDict['Name'],
PaginationConfig={
'MaxItems': 123,
'PageSize': 123,
'StartingToken': 'string'
}
)
See the documention for this function here.
Now that you have the iterator, you can call a for loop by enumerating on it:
for page_index, page in enumerate(page_iterator):

Here's a full working example on how to do it using paginator.
Remember to provide region_name and database_name.
import boto3
region_name = '<PROVIDE_AWS_REGION_NAME>'
database_name = '<PROVIDE_YOUR_DATABASE_NAME>'
catalog_name = 'AwsDataCatalog'
athena = boto3.client('athena', region_name=region_name)
paginator = athena.get_paginator('list_table_metadata')
response_iterator = paginator.paginate(
CatalogName=catalog_name,
DatabaseName=database_name
)
table_names = []
for page in response_iterator:
table_names.extend(
(i['Name'] for i in page['TableMetadataList'])
)
print(table_names)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Complete scan of dynamoDb with boto3 - python

boto3 offers paginators that handle all the pagination details for you. Here is the doc page for the scan paginator. Basically, you would use it like so: import boto3 client = boto3.client('dynamodb') paginator = client.get_paginator('scan') for page in paginator.paginate(): # do something

Turns out that Boto3 captures the "LastEvaluatedKey" as part of the returned response. This can be used as the start point for a scan: data= table.scan( ExclusiveStartKey=data['LastEvaluatedKey'] ) I plan on building a loop around this until the returned data is only the ExclusiveStartKey

Related

NotImplementedError: The analyze_document has not been Implemented

How can I bulk upload JSON records to AWS OpenSearch index using a python client library?

Add limit in Lambda DynamoDB function

How to query dynamoDB with lambda function with multiple filters passed from API

Boto3 Athena are not showing all tables

Categories

Resources