How to push the data from dynamodb through stream

How to push the data from dynamodb through stream - python

Below is the json file
[
{
"year": 2013,
"title": "Rush",
"actors": [
"Daniel Bruhl",
"Chris Hemsworth",
"Olivia Wilde"
]
},
{
"year": 2013,
"title": "Prisoners",
"actors": [
"Hugh Jackman",
"Jake Gyllenhaal",
"Viola Davis"
]
}
]
Below is the code to push to dynamodb. I have created testjsonbucket bucket name, moviedataten.json is the filename and saved above json.Create a dynamodb with Primary partition key as year (Number) and
Primary sort key as title (String).
import json
from decimal import Decimal
import json
import boto3
s3 = boto3.resource('s3')
obj = s3.Object('testjsonbucket', 'moviedataten.json')
body = obj.json
#def lambda_handler(event,context):
# print (body)
def load_movies(movies, dynamodb=None):
if not dynamodb:
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Movies')
for movie in movies:
year = int(movie['year'])
title = movie['title']
print("Adding movie:", year, title)
table.put_item(Item=movie)
def lambda_handler(event, context):
movie_list = json.loads(body, parse_float=Decimal)
load_movies(movie_list)
I want to push in to ElasticSearch from dynamodb.
I have created a Elastic Domain https://xx.x.x.com/testelas
I have gone through the link https://aws.amazon.com/blogs/compute/indexing-amazon-dynamodb-content-with-amazon-elasticsearch-service-using-aws-lambda/
I clicked Managestream also
My Requirement:
Any change in Dynamodb has to reflect in the Elasticsearch?

This lambda just writing the document to DynamoDb, and I will not recommend adding the code in this lambda to push the same object to Elastic search, as lambda function should perform a single task and pushing the same document to ELK should be managed as a DynamoDB stream.
What if ELK is down or not available how you will manage this in lambda?
What if you want to disable this in future? you will need to modify lambda instead of controlling this from AWS API or AWS console, all you need to just disable the stream when required no changes on above lambda side code
What if you want to move only modify or TTL item to elastic search?
So create Dyanodb Stream that pushes the document to another Lambda that is responsible to push the document to ELK, with this option you can also push old and new both items.
You can look into this article too that describe another approach data-streaming-from-dynamodb-to-elasticsearch
For above approach look into this GitHub project dynamodb-stream-elasticsearch.
const { pushStream } = require('dynamodb-stream-elasticsearch');
const { ES_ENDPOINT, INDEX, TYPE } = process.env;
function myHandler(event, context, callback) {
console.log('Received event:', JSON.stringify(event, null, 2));
pushStream({ event, endpoint: ES_ENDPOINT, index: INDEX, type: TYPE })
.then(() => {
callback(null, `Successfully processed ${event.Records.length} records.`);
})
.catch((e) => {
callback(`Error ${e}`, null);
});
}
exports.handler = myHandler;

DynamoDB has a built in feature (DynamoDB streams) that will handle the stream part of this question.
When you configure this you have the choice of the following configurations:
KEYS_ONLY — Only the key attributes of the modified item.
NEW_IMAGE — The entire item, as it appears after it was modified.
OLD_IMAGE — The entire item, as it appeared before it was modified.
NEW_AND_OLD_IMAGES — Both the new and the old images of the item.
This will produce an event that looks like the following
{
"Records":[
{
"eventID":"1",
"eventName":"INSERT",
"eventVersion":"1.0",
"eventSource":"aws:dynamodb",
"awsRegion":"us-east-1",
"dynamodb":{
"Keys":{
"Id":{
"N":"101"
}
},
"NewImage":{
"Message":{
"S":"New item!"
},
"Id":{
"N":"101"
}
},
"SequenceNumber":"111",
"SizeBytes":26,
"StreamViewType":"NEW_AND_OLD_IMAGES"
},
"eventSourceARN":"stream-ARN"
},
{
"eventID":"2",
"eventName":"MODIFY",
"eventVersion":"1.0",
"eventSource":"aws:dynamodb",
"awsRegion":"us-east-1",
"dynamodb":{
"Keys":{
"Id":{
"N":"101"
}
},
"NewImage":{
"Message":{
"S":"This item has changed"
},
"Id":{
"N":"101"
}
},
"OldImage":{
"Message":{
"S":"New item!"
},
"Id":{
"N":"101"
}
},
"SequenceNumber":"222",
"SizeBytes":59,
"StreamViewType":"NEW_AND_OLD_IMAGES"
},
"eventSourceARN":"stream-ARN"
},
{
"eventID":"3",
"eventName":"REMOVE",
"eventVersion":"1.0",
"eventSource":"aws:dynamodb",
"awsRegion":"us-east-1",
"dynamodb":{
"Keys":{
"Id":{
"N":"101"
}
},
"OldImage":{
"Message":{
"S":"This item has changed"
},
"Id":{
"N":"101"
}
},
"SequenceNumber":"333",
"SizeBytes":38,
"StreamViewType":"NEW_AND_OLD_IMAGES"
},
"eventSourceARN":"stream-ARN"
}
]
}
As you're already familiar with Lambda it makes sense to use a Lambda function to consume the records and then iterate through them to process them in the Elasticsearch format before adding them to your index.
When doing this make sure that you iterate through each record as there may be multiple depending on your configuration.
For more information on the steps required for the Lambda side of the function check out the Tutorial: Using AWS Lambda with Amazon DynamoDB streams page.

Related

How to save json data as it is without data type conversion in dynamo db using python

I want to store key-value JSON data in aws DynamoDB where key is a date string in YYYY-mm-dd format and value is entries which is a python dictionary. When I used boto3 client to save data there, it saved it as a data type object, which I don't want. My purpose is simple: Store JSON data against a key which is a date, so that later I will query the data by giving that date. I am struggling with this issue because I did not find any relevant link which says how to store JSON data and retrieve it without any conversion.
I need help to solve it in Python.
What I am doing now:
item = {
"entries": [
{
"path": [
{
"name": "test1",
"count": 1
},
{
"name": "test2",
"count": 2
}
],
"repo": "test3"
}
],
"date": "2022-10-11"
}
dynamodb_client = boto3.resource('dynamodb')
table = self.dynamodb_client.Table(table_name)
response = table.put_item(Item = item)
What actually saved:
[{"M":{"path":{"L":[{"M":{"name":{"S":"test1"},"count":{"N":"1"}}},{"M":{"name":{"S":"test2"},"count":{"N":"2"}}}]},"repo":{"S":"test3"}}}]
But I want to save exactly the same JSON data as it is, without any conversion at all.
When I retrieve it programmatically, you see the difference of single quote, count value change.
response = table.get_item(
Key={
"date": "2022-10-12"
}
)
Output
{'Item': {'entries': [{'path': [{'name': 'test1', 'count': Decimal('1')}, {'name': 'test2', 'count': Decimal('2')}], 'repo': 'test3'}], 'date': '2022-10-12} }
Sample picture:

Why not store it as a single attribute of type string? Then you’ll get out exactly what you put in, byte for byte.

When you store this in DynamoDB you get exactly what you want/have provided. Key is your date and you have a list of entries.
If you need it to store in a different format you need to provide the JSON which correlates with what you need. It's important to note that DynamoDB is a key-value store not a document store. You should also look up the differences in these.

I figured out how to solve this issue. I have two column name date and entries in my dynamo db (also visible in screenshot in ques).
I convert entries values from list to string then saved it in db. At the time of retrival, I do the same, create proper json response and return it.
I am also sharing sample code below so that anybody else dealing with the same situation can have atleast one option.
# While storing:
entries_string = json.dumps([
{
"path": [
{
"name": "test1",
"count": 1
},
{
"name": "test2",
"count": 2
}
],
"repo": "test3"
}
])
item = {
"entries": entries_string,
"date": "2022-10-12"
}
dynamodb_client = boto3.resource('dynamodb')
table = dynamodb_client.Table(<TABLE-NAME>)
-------------------------
# While fetching:
response = table.get_item(
Key={
"date": "2022-10-12"
}
)['Item']
entries_string=response['entries']
entries_dic = json.loads(entries_string)
response['entries'] = entries_dic
print(json.dumps(response))

Return empty columns in athena boto response object

When executing a select query against an athena table via boto3, the response object given is in the syntax:
{
"UpdateCount":0,
"ResultSet":{
"Rows":[
{
"Data":[
{
"VarCharValue":"site_name"
},
{
"VarCharValue":"volume_out_capacity"
},
{
"VarCharValue":"region"
},
{
"VarCharValue":"site_ref"
}
]
},
{
"Data":[
{
"VarCharValue":"ASSET 12"
},
{
"VarCharValue":"10"
},
{
"VarCharValue":"NORTH"
},
{
"VarCharValue":"RHW007777000138"
}
]
}
]
}
Is there an additional argument that can be passed so that the response object will contain columns that do not contain values? Something like:
{
"VarCharValue":"xyz"
}
]
},
{
"Data":[
{
"VarCharValue":None
}
I have looked through the documentation extensively but cannot find arguments that can describe how to format the response in get_query_results() or start_query_execution()

I do not see a option to get that data back directly from Athena. Alternatively if you use S3 Select instead of Athena, you'll get back a json object with all the columns whether they have data or are empty.
Sample Data:
name,number,city,state
coin,1,driggs,
michelle,,chicago,
shaniqua,2,,
marcos,3,stlouis,
S3 Select Result:
{"name":"coin","number":"1","city":"driggs","state":""}
{"name":"michelle","number":"","city":"chicago","state":""}
{"name":"shaniqua","number":"2","city":"","state":""}
{"name":"marcos","number":"3","city":"stlouis","state":""}
Code:
import boto3
session = boto3.session.Session(profile_name="<my-profile>")
s3 = session.client('s3')
resp = s3.select_object_content(
Bucket='<my-bucket>',
Key='<my-file>',
ExpressionType='SQL',
Expression="SELECT * FROM s3object",
InputSerialization={'CSV': {"FileHeaderInfo": "Use", "RecordDelimiter": '\r\n'}, 'CompressionType': 'NONE'},
OutputSerialization={'JSON': {}},
)
for event in resp['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
elif 'Stats' in event:
statsDetails = event['Stats']['Details']
print("Stats details bytesScanned: ")
print(statsDetails['BytesScanned'])
print("Stats details bytesProcessed: ")
print(statsDetails['BytesProcessed'])
print("Stats details bytesReturned: ")
print(statsDetails['BytesReturned'])

How to parse nested JSON object?

I am working on a new project in HubSpot that returns nested JSON like the sample below. I am trying to access the associated contacts id, but am struggling to reference it correctly (the id I am looking for is the value '201' in the example below). I've put together this script, but this script only returns the entire associations portion of the JSON and I only want the id. How do I reference the id correctly?
Here is the output from the script:
{'contacts': {'paging': None, 'results': [{'id': '201', 'type': 'ticket_to_contact'}]}}
And here is the script I put together:
import hubspot
from pprint import pprint
client = hubspot.Client.create(api_key="API_KEY")
try:
api_response = client.crm.tickets.basic_api.get_page(limit=2, associations=["contacts"], archived=False)
for x in range(2):
pprint(api_response.results[x].associations)
except ApiException as e:
print("Exception when calling basic_api->get_page: %s\n" % e)
Here is what the full JSON looks like ('contacts' property shortened for readability):
{
"results": [
{
"id": "34018123",
"properties": {
"content": "Hi xxxxx,\r\n\r\nCan you clarify on how the blocking of script happens? Is it because of any CSP (or) the script will decide run time for every URL’s getting triggered from browser?\r\n\r\nRegards,\r\nLogan",
"createdate": "2019-07-03T04:20:12.366Z",
"hs_lastmodifieddate": "2020-12-09T01:16:12.974Z",
"hs_object_id": "34018123",
"hs_pipeline": "0",
"hs_pipeline_stage": "4",
"hs_ticket_category": null,
"hs_ticket_priority": null,
"subject": "RE: call followup"
},
"createdAt": "2019-07-03T04:20:12.366Z",
"updatedAt": "2020-12-09T01:16:12.974Z",
"archived": false
},
{
"id": "34018892",
"properties": {
"content": "Hi Guys,\r\n\r\nI see that we were placed back on the staging and then removed again.",
"createdate": "2019-07-03T07:59:10.606Z",
"hs_lastmodifieddate": "2021-12-17T09:04:46.316Z",
"hs_object_id": "34018892",
"hs_pipeline": "0",
"hs_pipeline_stage": "3",
"hs_ticket_category": null,
"hs_ticket_priority": null,
"subject": "Re: Issue due to server"
},
"createdAt": "2019-07-03T07:59:10.606Z",
"updatedAt": "2021-12-17T09:04:46.316Z",
"archived": false,
"associations": {
"contacts": {
"results": [
{
"id": "201",
"type": "ticket_to_contact"
}
]
}
}
}
],
"paging": {
"next": {
"after": "35406270",
"link": "https://api.hubapi.com/crm/v3/objects/tickets?associations=contacts&archived=false&hs_static_app=developer-docs-ui&limit=2&after=35406270&hs_static_app_version=1.3488"
}
}
}

You can do api_response.results[x].associations["contacts"]["results"][0]["id"].

Sorted this out, posting in case anyone else is struggling with the response from the HubSpot v3 Api. The response schema for this call is:
Response schema type: Object
String results[].id
Object results[].properties
String results[].createdAt
String results[].updatedAt
Boolean results[].archived
String results[].archivedAt
Object results[].associations
Object paging
Object paging.next
String paging.next.after
String paging.next.linkResponse schema type: Object
String results[].id
Object results[].properties
String results[].createdAt
String results[].updatedAt
Boolean results[].archived
String results[].archivedAt
Object results[].associations
Object paging
Object paging.next
String paging.next.after
String paging.next.link
So to access the id of the contact associated with the ticket, you need to reference it using this notation:
api_response.results[1].associations["contacts"].results[0].id
notes:
results[x] - reference the result in the index
associations["contacts"] -
associations is a dictionary object, you can access the contacts item
by it's name
associations["contacts"].results is a list - reference
by the index []
id - is a string

In my case type was ModelProperty or CollectionResponseProperty couldn't reach dict anyhow.
For the record this got me to go through the results.
for result in list(api_response.results):
ID = result.id

Using the Firestore REST API to Update a Document Field

I've been searching for a pretty long time but I can't figure out how to update a field in a document using the Firestore REST API. I've looked on other questions but they haven't helped me since I'm getting a different error:
{'error': {'code': 400, 'message': 'Request contains an invalid argument.', 'status': 'INVALID_ARGUMENT', 'details': [{'#type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'field': 'oil', 'description': "Error expanding 'fields' parameter. Cannot find matching fields for path 'oil'."}]}]}}
I'm getting this error even though I know that the "oil" field exists in the document. I'm writing this in Python.
My request body (field is the field in a document and value is the value to set that field to, both strings received from user input):
{
"fields": {
field: {
"integerValue": value
}
}
}
My request (authorizationToken is from a different request, dir is also a string from user input which controls the directory):
requests.patch("https://firestore.googleapis.com/v1beta1/projects/aethia-resource-management/databases/(default)/documents/" + dir + "?updateMask.fieldPaths=" + field, data = body, headers = {"Authorization": "Bearer " + authorizationToken}).json()

Based on the the official docs (1,2, and 3), GitHub and a nice article, for the example you have provided you should use the following:
requests.patch("https://firestore.googleapis.com/v1beta1/projects{projectId}/databases/{databaseId}/documents/{document_path}?updateMask.fieldPaths=field")
Your request body should be:
{
"fields": {
"field": {
"integerValue": Value
}
}
}
Also keep in mind that if you want to update multiple fields and values you should specify each one separately.
Example:
https://firestore.googleapis.com/v1beta1/projects/{projectId}/databases/{databaseId}/documents/{document_path}?updateMask.fieldPaths=[Field1]&updateMask.fieldPaths=[Field2]
and the request body would have been:
{
"fields": {
"field": {
"integerValue": Value
},
"Field2": {
"stringValue": "Value2"
}
}
}
EDIT:
Here is a way I have tested which allows you to update some fields of a document without affecting the rest.
This sample code creates a document under collection users with 4 fields, then tries to update 3 out of 4 fields (which leaves the one not mentioned unaffected)
from google.cloud import firestore
db = firestore.Client()
#Creating a sample new Document “aturing” under collection “users”
doc_ref = db.collection(u'users').document(u'aturing')
doc_ref.set({
u'first': u'Alan',
u'middle': u'Mathison',
u'last': u'Turing',
u'born': 1912
})
#updating 3 out of 4 fields (so the last should remain unaffected)
doc_ref = db.collection(u'users').document(u'aturing')
doc_ref.update({
u'first': u'Alan',
u'middle': u'Mathison',
u'born': 2000
})
#printing the content of all docs under users
users_ref = db.collection(u'users')
docs = users_ref.stream()
for doc in docs:
print(u'{} => {}'.format(doc.id, doc.to_dict()))
EDIT: 10/12/2019
PATCH with REST API
I have reproduced your issue and it seems like you are not converting your request body to a json format properly.
You need to use json.dumps() to convert your request body to a valid json format.
A working example is the following:
import requests
import json
endpoint = "https://firestore.googleapis.com/v1/projects/[PROJECT_ID]/databases/(default)/documents/[COLLECTION]/[DOCUMENT_ID]?currentDocument.exists=true&updateMask.fieldPaths=[FIELD_1]"
body = {
"fields" : {
"[FIELD_1]" : {
"stringValue" : "random new value"
}
}
}
data = json.dumps(body)
headers = {"Authorization": "Bearer [AUTH_TOKEN]"}
print(requests.patch(endpoint, data=data, headers=headers).json())

I found the official documentation to not to be of much use since there was no example mentioned. This is the API end-point for your firestore database
PATCH https://firestore.googleapis.com/v1beta1/projects/{YOUR_PROJECT_ID}/databases/(default)/documents/{COLLECTION_NAME}/{DOCUMENT_NAME}
the following code is the body of your API request
{
"fields": {
"first_name": {
"stringValue":"Kurt"
},
"name": {
"stringValue":"Cobain"
},
"band": {
"stringValue":"Nirvana"
}
}
}
The response you should get upon successful update of the database should look like
{
"name": "projects/{YOUR_PROJECT_ID}/databases/(default)/documents/{COLLECTION_ID/{DOC_ID}",
{
"fields": {
"first_name": {
"stringValue":"Kurt"
},
"name": {
"stringValue":"Cobain"
},
"band": {
"stringValue":"Nirvana"
}
}
"createTime": "{CREATE_TIME}",
"updateTime": "{UPDATE_TIME}"
Note that performing the above action would result in a new document being created, meaning that any fields that existed previously but have NOT been mentioned in the "fields" body will be deleted. In order to preserve fields, you'll have to add
?updateMask.fieldPaths={FIELD_NAME} --> to the end of your API call (for each individual field that you want to preserve).
For example:
PATCH https://firestore.googleapis.com/v1beta1/projects/{YOUR_PROJECT_ID}/databases/(default)/documents/{COLLECTION_NAME}/{DOCUMENT_NAME}?updateMask.fieldPaths=name&updateMask.fieldPaths=band&updateMask.fieldPaths=age. --> and so on

How to share data in `AWS Step Functions` without passing it between the steps

I use AWS Step Functions and have the following workflow
initStep - It's a lambda function handler, that gets some data and sends it to SQS for external service.
activity = os.getenv('ACTIVITY')
queue_name = os.getenv('QUEUE_NAME')
def lambda_handler(event, context):
event['my_activity'] = activity
data = json.dumps(event)
# Retrieving a queue by its name
sqs = boto3.resource('sqs')
queue = sqs.get_queue_by_name(QueueName=queue_name)
queue.send_message(MessageBody=data, MessageGroupId='messageGroup1' + str(datetime.time(datetime.now())))
return event
validationWaiting - It's an activity that waits for an answer from the external service that include the data.
complete - It's a lambda function handler, that uses the data from the initStep.
def lambda_handler(event, context):
email = event['email'] if 'email' in event else None
data = event['data'] if 'data' in event else None
client = boto3.client(service_name='ses')
to = email.split(', ')
message_conrainer = {'Subject': {'Data': 'Email from step functions'},
'Body': {'Html': {
'Charset': "UTF-8",
'Data': """<html><body>
<p>""" + data """</p>
</body> </html> """
}}}
destination = {'ToAddresses': to,
'CcAddresses': [],
'BccAddresses': []}
return client.send_email(Source=from_addresses,
Destination=destination,
Message=message_container)
It does work, but the problem is that I'm sending full data from the initStep to external service, just to pass it later to complete. Potentially more steps can be added.
I believe it would be better to share it as some sort of global data (of current step function), that way I could add or remove steps and data would still be available for all.

You can make use of InputPath and ResultPath. In initStep you would only send necessary data to external service (probably along with some unique identifier of Execution). In the ValidaitonWaiting step you can set following properties (in State Machine definition):
InputPath: What data will be provided to GetActivityTask. Probably you want to set it to something like $.execution_unique_id where execution_unique_id is field in your data that external service uses to identify Execution (to match it with specific request during initStep).
ResultPath: Where output of ValidationWaiting Activity will be saved in data. You can set it to $.validation_output and json result from external service will be present there.
This way you can send to external service only data that is actually needed by it and you won't lose access to any data that was previously (before ValidationWaiting step) in the input.
For example, you could have following definition of the State Machine:
{
"StartAt": "initStep",
"States": {
"initStep": {
"Type": "Pass",
"Result": {
"executionId": "some:special:id",
"data": {},
"someOtherData": {"value": "key"}
},
"Next": "ValidationWaiting"
},
"ValidationWaiting": {
"Type": "Pass",
"InputPath": "$.executionId",
"ResultPath": "$.validationOutput",
"Result": {
"validationMessages": ["a", "b"]
},
"Next": "Complete"
},
"Complete": {
"Type": "Pass",
"End": true
}
}
}
I've used Pass states for initStep and ValidationWaiting to simplify the example (I haven't run it, but it should work). Result field is specific to Pass task and it is equivalent to the result of your Lambda functions or Activity.
In this scenario Complete step would get following input:
{
"executionId": "some:special:id",
"data": {},
"someOtherData": {"value": key"},
"validationOutput": {
"validationMessages": ["a", "b"]
}
}
So the result of ValidationWaiting step has been saved into validationOutput field.

Based on the answer of Marcin Sucharski I've came up with my own solution.
I needed to use Type: Task since initStep is a lambda, which sends SQS.
I didn't needed InputPath in ValidationWaiting, but only ResultPath, which store the data received in activity.
I work with Serverless framework, here is my final solution:
StartAt: initStep
States:
initStep:
Type: Task
Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:init-step
Next: ValidationWaiting
ValidationWaiting:
Type: Task
ResultPath: $.validationOutput
Resource: arn:aws:states:#{AWS::Region}:#{AWS::AccountId}:activity:validationActivity
Next: Complete
Catch:
- ErrorEquals:
- States.ALL
ResultPath: $.validationOutput
Next: Complete
Complete:
Type: Task
Resource: arn:aws:lambda:#{AWS::Region}:#{AWS::AccountId}:function:complete-step
End: true

Here a short and simple solution with InputPath and ResultPath. My Lambda Check_Ubuntu_Updates return a list of instance ready to be updated. This list of instances is received by the step Notify_Results, then it use this data. Remember that if you have several ResultPath in your Step Function and you need more than 1 input in a step you can use InputPath only with $.
{
"Comment": "A state machine that check some updates systems available.",
"StartAt": "Check_Ubuntu_Updates",
"States": {
"Check_Ubuntu_Updates": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:#############:function:Check_Ubuntu_Updates",
"ResultPath": "$.instances",
"Next": "Notify_Results"
},
"Notify_Results": {
"Type": "Task",
"InputPath": "$.instances",
"Resource": "arn:aws:lambda:us-east-1:#############:function:Notify_Results",
"End": true
}
}
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to push the data from dynamodb through stream - python

Related

How to save json data as it is without data type conversion in dynamo db using python

Return empty columns in athena boto response object

How to parse nested JSON object?

Using the Firestore REST API to Update a Document Field

How to share data in `AWS Step Functions` without passing it between the steps

Categories

Resources