Kinesis Firehose Lambda Transformation and Dynamic partition - python

The following data presented is from the faker library. i am trying to learn and implement
dynamic partition in kinesis Firehose
Sample payload Input
{
"name":"Dr. Nancy Mcmillan",
"phone_numbers":"8XXXXX",
"city":"Priscillaport",
"address":"908 Mitchell Views SXXXXXXXX 42564",
"date":"1980-07-11",
"customer_id":"3"
}
Sample Input code
def main():
import boto3
import json
AWS_ACCESS_KEY = "XXXXX"
AWS_SECRET_KEY = "XXX"
AWS_REGION_NAME = "us-east-1"
for i in range(1,13):
faker = Faker()
json_data = {
"name": faker.name(),
"phone_numbers": faker.phone_number(),
"city": faker.city(),
"address": faker.address(),
"date": str(faker.date()),
"customer_id": str(random.randint(1, 5))
}
print(json_data)
hasher = MyHasher(key=json_data)
res = hasher.get()
client = boto3.client(
"kinesis",
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY,
region_name=AWS_REGION_NAME,
)
response = client.put_record(
StreamName='XXX',
Data=json.dumps(json_data),
PartitionKey='test',
)
print(response)
Here is lambda code which work fine
try:
import json
import boto3
import base64
from dateutil import parser
except Exception as e:
pass
class MyHasher(object):
def __init__(self, key):
self.key = key
def get(self):
keys = str(self.key).encode("UTF-8")
keys = base64.b64encode(keys)
keys = keys.decode("UTF-8")
return keys
def lambda_handler(event, context):
print("Event")
print(event)
output = []
for record in event["records"]:
dat = base64.b64decode(record["data"])
serialize_payload = json.loads(dat)
print("serialize_payload", serialize_payload)
json_new_line = str(serialize_payload) + "\n"
hasherHelper = MyHasher(key=json_new_line)
hash = hasherHelper.get()
partition_keys = {"customer_id": serialize_payload.get("customer_id")}
_ = {
"recordId": record["recordId"],
"result": "Ok",
"data": hash,
'metadata': {
'partitionKeys':
partition_keys
}
}
print(_)
output.append(_)
print("*****************")
print(output)
return {"records": output}
Sample screenshots show works fine
Here are setting on firehose for dynamic partition
some reason on AWS S3 I see an error folder and all my messages go into that
I have successfully implemented lambda transformation and have made a video which can be found below I am currently stuck on the dynamic partition I have tried reading several posts but that didn't help
https://www.youtube.com/watch?v=6wot9Z93vAY&t=231s
Thank you again and looking forward to hearing from you guys
Refernecs
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
https://www.youtube.com/watch?v=HcOVAFn-KhM
https://www.youtube.com/watch?v=PoaKgHdJgCE
https://medium.com/#bv_subhash/kinesis-firehose-performs-partitioning-based-on-timestamps-and-creates-files-in-s3-but-they-would-13efd51f6d39
https://www.amazonaws.cn/en/new/2021/s3-analytics-dynamic-partitioning-kinesis-data-firehose/

There are two prefix options for dynamic partitioning. 1) partitionKeyFromQuery 2) partitionKeyFromLambda. If you want firehose to parse record and get partition key then use first option. If you want to provide partition key after performing transformation use second option.
As per your firehose config, you are using lambda to provide partition key (second option) but prefix is provided for first option. To resolve this issue either disable inline parsing and add second option to firehose prefix !{partitionKeyFromLambda:customer_id}/ or remove lambda transformation and keep inline parsing

Related

How can I bulk upload JSON records to AWS OpenSearch index using a python client library?

I have a sufficiently large dataset that I would like to bulk index the JSON objects in AWS OpenSearch.
I cannot see how to achieve this using any of: boto3, awswrangler, opensearch-py, elasticsearch, elasticsearch-py.
Is there a way to do this without using a python request (PUT/POST) directly?
Note that this is not for: ElasticSearch, AWS ElasticSearch.
Many thanks!
I finally found a way to do it using opensearch-py, as follows.
First establish the client,
# First fetch credentials from environment defaults
# If you can get this far you probably know how to tailor them
# For your particular situation. Otherwise SO is a safe bet :)
import boto3
credentials = boto3.Session().get_credentials()
region='eu-west-2' # for example
auth = AWSV4SignerAuth(credentials, region)
# Now set up the AWS 'Signer'
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
auth = AWSV4SignerAuth(credentials, region)
# And finally the OpenSearch client
host=f"...{region}.es.amazonaws.com" # fill in your hostname (minus the https://) here
client = OpenSearch(
hosts = [{'host': host, 'port': 443}],
http_auth = auth,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection
)
Phew! Let's create the data now:
# Spot the deliberate mistake(s) :D
document1 = {
"title": "Moneyball",
"director": "Bennett Miller",
"year": "2011"
}
document2 = {
"title": "Apollo 13",
"director": "Richie Cunningham",
"year": "1994"
}
data = [document1, document2]
TIP! Create the index if you need to -
my_index = 'my_index'
try:
response = client.indices.create(my_index)
print('\nCreating index:')
print(response)
except Exception as e:
# If, for example, my_index already exists, do not much!
print(e)
This is where things go a bit nutty. I hadn't realised that every single bulk action needs an, er, action e.g. "index", "search" etc. - so let's define that now
action={
"index": {
"_index": my_index
}
}
You can read all about the bulk REST API, there.
The next quirk is that the OpenSearch bulk API requires Newline Delimited JSON (see https://www.ndjson.org), which is basically JSON serialized as strings and separated by newlines. Someone wrote on SO that this "bizarre" API looked like one designed by a data scientist - far from taking offence, I think that rocks. (I agree ndjson is weird though.)
Hideously, now let's build up the full JSON string, combining the data and actions. A helper fn is at hand!
def payload_constructor(data,action):
# "All my own work"
action_string = json.dumps(action) + "\n"
payload_string=""
for datum in data:
payload_string += action_string
this_line = json.dumps(datum) + "\n"
payload_string += this_line
return payload_string
OK so now we can finally invoke the bulk API. I suppose you could mix in all sorts of actions (out of scope here) - go for it!
response=client.bulk(body=payload_constructor(data,action),index=my_index)
That's probably the most boring punchline ever but there you have it.
You can also just get (geddit) .bulk() to just use index= and set the action to:
action={"index": {}}
Hey presto!
Now, choose your poison - the other solution looks crazily shorter and neater.
PS The well-hidden opensearch-py documentation on this are located here.
conn = wr.opensearch.connect(
host=self.hosts, # URL
port=443,
username=self.username,
password=self.password
)
def insert_index_data(data, index_name='stocks', delete_index_data=False):
""" Bulk Create
args: body [{doc1}{doc2}....]
"""
if delete_index_data:
index_name = 'symbol'
self.delete_es_index(index_name)
resp = wr.opensearch.index_documents(
self.conn,
documents=data,
index=index_name
)
print(resp)
return resp
I have used below code to bulk insert records from postgres into OpenSearch ( ES 7.2 )
import sqlalchemy as sa
from sqlalchemy import text
import pandas as pd
import numpy as np
from opensearchpy import OpenSearch
from opensearchpy.helpers import bulk
import json
engine = sa.create_engine('postgresql+psycopg2://postgres:postgres#127.0.0.1:5432/postgres')
host = 'search-xxxxxxxxxx.us-east-1.es.amazonaws.com'
port = 443
auth = ('username', 'password') # For testing only. Don't store credentials in code.
# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
hosts = [{'host': host, 'port': port}],
http_compress = True,
http_auth = auth,
use_ssl = True,
verify_certs = True,
ssl_assert_hostname = False,
ssl_show_warn = False
)
with engine.connect() as connection:
result = connection.execute(text("select * from account_1_study_1.stg_pred where domain='LB'"))
records = []
for row in result:
record = dict(row)
record.update(record['item_dataset'])
del record['item_dataset']
records.append(record)
df = pd.DataFrame(records)
#df['Date'] = df['Date'].astype(str)
df = df.fillna("null")
print(df.keys)
documents = df.to_dict(orient='records')
#bulk(es ,documents, index='search-irl-poc-dump', raise_on_error=True)\
#response=client.bulk(body=documents,index='sample-index')
bulk(client, documents, index='search-irl-poc-dump', raise_on_error=True, refresh=True)

Fastest way to make 800+ get requests using same url but passing different ids everytime

What would be the fastest way to load Device IDs from an excel sheet which contains 800+ Device IDs and pass these Device IDs in a http get request.
I'm fetching Device IDs from the excel sheet, making http get request to get the relevant data and dump it into a list and then save it in an excel file using : -
if __name__ == '__main__':
excel_file = openpyxl.load_workbook("D:\mypath\Book1.xlsx")
active_sheet = excel_file.get_sheet_by_name("Sheet4")
def iter_rows(active_sheet):
for row in active_sheet.iter_rows():
yield [cell.value for cell in row]
res = iter_rows(active_sheet)
keys = next(res)
final_data_to_dump = []
failed_data_dump = []
for new in res:
inventory_data = dict(zip(keys, new))
if None in inventory_data.values():
pass
else:
url_get_event = 'https://some_url&source={}'.format(inventory_data['DeviceID'])
header_events = {
'Authorization': 'Basic authkey_here'}
print(inventory_data['DeviceID'])
try:
r3 = requests.get(url_get_event, headers=header_events)
r3_json = json.loads(r3.content)
if r3_json['events']:
for object in r3_json['events']:
dict_excel_data = {
"DeviceID":object['source']['id'],
"Device Name":object['source']['name'],
"Start 1":object['Start1'],
"Start 2":object['Start2'],
"Watering Mode":object['WateringMode'],
"Duration":object['ActuationDetails']['Duration'],
"Type":object['type'],
"Creation Time":object['creationTime']
}
final_data_to_dump.append(dict_excel_data)
else:
no_dict_excel_data = {
"DeviceID":inventory_data["DeviceID"],
"Device Name":inventory_data["DeviceName"],
"Start 1":"",
"Start 2":"",
"Watering Mode":"",
"Duration":"",
"Type":"",
"Creation Time":""
}
final_data_to_dump.append(no_dict_excel_data)
except requests.ConnectionError:
failed_dict_excel_data = {
"DeviceID":inventory_data['DeviceID'],
"Device Name":inventory_data["DeviceName"],
"Status":"Connection Error"
}
failed_data_dump.append(failed_dict_excel_data)
df = pd.DataFrame.from_dict(final_data_to_dump)
df2 = pd.DataFrame.from_dict(failed_data_dump)
df.to_excel('D:\mypath\ReportReceived_10Apr.xlsx',sheet_name='Sheet1',index=False)
df2.to_excel('D:\mypath\Failed_ReportReceived_10Apr.xlsx',sheet_name='Sheet1',index=False)
But this can take upwards of 10-15 mins as there are 800+ devices in the Book1 sheet and it's likely to increase. How can I make this process faster?
You can use an async library, but the easiest solution here would be to do something like
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as exc:
responses = exc.map(get, device_ids)
def get(device_id):
url_get_event = 'https://some_url&source={}'.format(device_id)
return requests.get(url_get_event)
If the other part of your code is small you may want to submit the functions to the executor and use as_completed to handle them in the main thread while waiting for other requests to run too.

Create Lambda - DynamoDB Count Function

I am creating a SAM web app, with the backend being an API in front of a Python Lambda function with a DynamoDB table that maintains a count of the number of HTTP calls to the API. The API must also return this number. The yaml code itself loads normally. My problem is writing the Lambda function to iterate and return the count. Here is my code:
def lambda_handler(event, context):
dynamodb = boto3.resource("dynamodb")
ddbTableName = os.environ["databaseName"]
table = dynamodb.Table(ddbTableName)
# Update item in table or add if doesn't exist
ddbResponse = table.update_item(
Key={"id": "VisitorCount"},
UpdateExpression="SET count = count + :value",
ExpressionAttributeValues={":value": Decimal(context)},
ReturnValues="UPDATED_NEW",
)
# Format dynamodb response into variable
responseBody = json.dumps({"VisitorCount": ddbResponse["Attributes"]["count"]})
# Create api response object
apiResponse = {"isBase64Encoded": False, "statusCode": 200, "body": responseBody}
# Return api response object
return apiResponse
I can get VisitorCount to be a string, but not a number. I get this error: [ERROR] TypeError: lambda_handler() missing 1 required positional argument: 'cou    response = request_handler(event, lambda_context)le_event_request
What is going on?
[UPDATE] I found the original error, which was that the function was not properly received by the SAM app. Changing the name fixed this, and it is now being read. Now I have to troubleshoot the actual Python. New Code:
import json
import boto3
import os
dynamodb = boto3.resource("dynamodb")
ddbTableName = os.environ["databaseName"]
table = dynamodb.Table(ddbTableName)
Key = {"VisitorCount": { "N" : "0" }}
def handler(event, context):
# Update item in table or add if doesn't exist
ddbResponse = table.update_item(
UpdateExpression= "set VisitorCount = VisitorCount + :val",
ExpressionAttributeValues={":val": {"N":"1"}},
ReturnValues="UPDATED_NEW",
)
# Format dynamodb response into variable
responseBody = json.dumps({"VisitorCount": ddbResponse["Attributes"]["count"]})
# Create api response object
apiResponse = {"isBase64Encoded": False, "statusCode": 200,"body": responseBody}
# Return api response object
return apiResponse
I am getting a syntax error on Line 13, which is
UpdateExpression= "set VisitorCount = VisitorCount + :val",
But I can't tell where I am going wrong on this. It should update the DynamoDB table to increase the count by 1. Looking at the AWS guide it appears to be the correct syntax.
Not sure what the exact error is but ddbResponse will be like this:
ddbResponse = table.update_item(
Key={
'key1': aaa,
'key2': bbb
},
UpdateExpression= "set VisitorCount = VisitorCount + :val",
ExpressionAttributeValues={":val": Decimal(1)},
ReturnValues="UPDATED_NEW",
)
Specify item to be updated with Key (one item for one Lambda call)
Set Decimal(1) for ExpressionAttributeValues
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.03.html#GettingStarted.Python.03.04

Microsoft LUIS: unable to set time zone (datetimeReference) for datetimeV2 entities

I am using the V3 API to get predictions from a LUIS endpoint and I need a way to tell LUIS my time zone, so that relative time expressions (e.g. "in the past two hours", "in 10 minutes") are resolved properly by the datetimeV2 entity.
Everything works perfectly if I use the V2 API with the timezoneOffset option, but I am unable to make the V3 API work with the new option datetimeReference (which is supposed to replace timezoneOffset). Actually, I could not even figure out which value I should set for datetimeReference (an integer number? A datetime?).
Here are my attempts with Python. Can anyone tell me if there is anything wrong?
from datetime import datetime
import requests
appId = # my app id
subscriptionKey = # my subscription key
query = "tra 10 minuti" # = "in 10 minutes" (my app speaks Italian)
# ATTEMPT 1
# based on https://learn.microsoft.com/en-us/azure/cognitive-services/luis/luis-concept-data-alteration?tabs=V2#change-time-zone-of-prebuilt-datetimev2-entity,
# assuming it works the same way as timezoneOffset
endpoint = 'https://westeurope.api.cognitive.microsoft.com/luis/prediction/v3.0/apps/{appId}/slots/staging/predict?datetimeReference=120&subscription-key={subscriptionKey}&query={query}'
endpoint = endpoint.format(appId = appId, subscriptionKey = subscriptionKey, query = query)
response = requests.get(endpoint)
# ATTEMPT 2
# according to https://learn.microsoft.com/en-us/azure/cognitive-services/luis/luis-migration-api-v3
endpoint = 'https://westeurope.api.cognitive.microsoft.com/luis/prediction/v3.0/apps/{appId}/slots/staging/predict?'
endpoint = endpoint.format(appId = appId)
json = {
"query" : query,
"options":{
"datetimeReference": datetime.now().strftime("%Y-%m-%dT%H:%M:%S"), # e.g. "2020-05-07T13:54:33". Not clear if that's what it wants
"preferExternalEntities": True
},
"externalEntities":[],
"dynamicLists":[]
}
response = requests.post(endpoint, json, headers = {'Ocp-Apim-Subscription-Key' : subscriptionKey})
UPDATE: the correct way of sending the request in ATTEMPT 2 is
response = requests.post(endpoint, json = json, headers = {'Ocp-Apim-Subscription-Key' : subscriptionKey})
As you've discovered, your JSON should go in the json argument and not the data argument:
response = requests.post(endpoint, json = json, headers = {'Ocp-Apim-Subscription-Key' : subscriptionKey})

Complete scan of dynamoDb with boto3

My table is around 220mb with 250k records within it. I'm trying to pull all of this data into python. I realize this needs to be a chunked batch process and looped through, but I'm not sure how I can set the batches to start where the previous left off.
Is there some way to filter my scan? From what I read that filtering occurs after loading and the loading stops at 1mb so I wouldn't actually be able to scan in new objects.
Any assistance would be appreciated.
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token = aws_session_token,
aws_access_key_id = aws_access_key_id,
aws_secret_access_key = aws_secret_access_key,
region_name = region
)
table = dynamodb.Table('widgetsTableName')
data = table.scan()
I think the Amazon DynamoDB documentation regarding table scanning answers your question.
In short, you'll need to check for LastEvaluatedKey in the response. Here is an example using your code:
import boto3
dynamodb = boto3.resource('dynamodb',
aws_session_token=aws_session_token,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region
)
table = dynamodb.Table('widgetsTableName')
response = table.scan()
data = response['Items']
while 'LastEvaluatedKey' in response:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
DynamoDB limits the scan method to 1mb of data per scan.
Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.scan
Here is an example loop to get all the data from a DynamoDB table using LastEvaluatedKey:
import boto3
client = boto3.client('dynamodb')
def dump_table(table_name):
results = []
last_evaluated_key = None
while True:
if last_evaluated_key:
response = client.scan(
TableName=table_name,
ExclusiveStartKey=last_evaluated_key
)
else:
response = client.scan(TableName=table_name)
last_evaluated_key = response.get('LastEvaluatedKey')
results.extend(response['Items'])
if not last_evaluated_key:
break
return results
# Usage
data = dump_table('your-table-name')
# do something with data
boto3 offers paginators that handle all the pagination details for you. Here is the doc page for the scan paginator. Basically, you would use it like so:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
for page in paginator.paginate():
# do something
Riffing off of Jordon Phillips's answer, here's how you'd pass a FilterExpression in with the pagination:
import boto3
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': 'foo',
'FilterExpression': 'bar > :x AND bar < :y',
'ExpressionAttributeValues': {
':x': {'S': '2017-01-31T01:35'},
':y': {'S': '2017-01-31T02:08'},
}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
# do something
Code for deleting dynamodb format type as #kungphu mentioned.
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('query')
service_model = client._service_model.operation_model('Query')
trans = TransformationInjector(deserializer = TypeDeserializer())
for page in paginator.paginate():
trans.inject_attribute_value_output(page, service_model)
Turns out that Boto3 captures the "LastEvaluatedKey" as part of the returned response. This can be used as the start point for a scan:
data= table.scan(
ExclusiveStartKey=data['LastEvaluatedKey']
)
I plan on building a loop around this until the returned data is only the ExclusiveStartKey
The 2 approaches suggested above both have problems: Either writing lengthy and repetitive code that handles paging explicitly in a loop, or using Boto paginators with low-level sessions, and foregoing the advantages of higher-level Boto objects.
A solution using Python functional code to provide a high-level abstraction allows higher-level Boto methods to be used, while hiding the complexity of AWS paging:
import itertools
import typing
def iterate_result_pages(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Generator:
"""A wrapper for functions using AWS paging, that returns a generator which yields a sequence of items for
every response
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
A generator which yields the 'Items' field of the result for every response
"""
response = function_returning_response(*args, **kwargs)
yield response["Items"]
while "LastEvaluatedKey" in response:
kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
response = function_returning_response(*args, **kwargs)
yield response["Items"]
return
def iterate_paged_results(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Iterator:
"""A wrapper for functions using AWS paging, that returns an iterator of all the items in the responses.
Items are yielded to the caller as soon as they are received.
Args:
function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey'
This could be a bound method of an object.
Returns:
An iterator which yields one response item at a time
"""
return itertools.chain.from_iterable(iterate_result_pages(function_returning_response, *args, **kwargs))
# Example, assuming 'table' is a Boto DynamoDB table object:
all_items = list(iterate_paged_results(ProjectionExpression = 'my_field'))
I had some problems with Vincent's answer related to the transformation being applied to the LastEvaluatedKey and messing up the pagination. Solved as follows:
import boto3
from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
client = boto3.client('dynamodb')
paginator = client.get_paginator('scan')
operation_model = client._service_model.operation_model('Scan')
trans = TransformationInjector(deserializer = TypeDeserializer())
operation_parameters = {
'TableName': 'tablename',
}
items = []
for page in paginator.paginate(**operation_parameters):
has_last_key = 'LastEvaluatedKey' in page
if has_last_key:
last_key = page['LastEvaluatedKey'].copy()
trans.inject_attribute_value_output(page, operation_model)
if has_last_key:
page['LastEvaluatedKey'] = last_key
items.extend(page['Items'])
If you are landing here looking for a paginated scan with some filtering expression(s):
def scan(table, **kwargs):
response = table.scan(**kwargs)
yield from response['Items']
while response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'], **kwargs)
yield from response['Items']
Example usage:
table = boto3.Session(...).resource('dynamodb').Table('widgetsTableName')
items = list(scan(table, FilterExpression=Attr('name').contains('foo')))
I can't work out why Boto3 provides high-level resource abstraction but doesn't provide pagination. When it does provide pagination, it's hard to use!
The other answers to this question were good but I wanted a super simple way to wrap the boto3 methods and provide memory-efficient paging using generators:
import typing
import boto3
import boto3.dynamodb.conditions
def paginate_dynamodb_response(dynamodb_action: typing.Callable, **kwargs) -> typing.Generator[dict, None, None]:
# Using the syntax from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/dynamodb/GettingStarted/scenario_getting_started_movies.py
keywords = kwargs
done = False
start_key = None
while not done:
if start_key:
keywords['ExclusiveStartKey'] = start_key
response = dynamodb_action(**keywords)
start_key = response.get('LastEvaluatedKey', None)
done = start_key is None
for item in response.get("Items", []):
yield item
## Usage ##
dynamodb_res = boto3.resource('dynamodb')
dynamodb_table = dynamodb_res.Table('my-table')
query = paginate_dynamodb_response(
dynamodb_table.query, # The boto3 method. E.g. query or scan
# Regular Query or Scan parameters
#
# IndexName='myindex' # If required
KeyConditionExpression=boto3.dynamodb.conditions.Key('id').eq('1234')
)
for x in query:
print(x)```

Categories