AWS DMS inserts duplicate records into kinesis and S3 - python

I have configured a DMS migration instance that replicates data from Mysql into a AWS Kinesis stream, but I noticed that when I process the kinesis records I pick up duplicate records.This does not happen for every record.
How do I prevent these duplicate records from being pushed to the kinesis data stream or the S3 bucket?
I'm using a lambda function to process the records, so I thought of adding logic to de-duplicate the data, but I'm not sure how to without persisting the data somewhere. I need to process the data in real-time so persisting the data would not be idle.
Regards
Pragesan

I added a global counter variable that stores the pk of each record,so each invocation checks the previous pk value,and if it is different I insert the value.

Related

Azure function - multiple files coming in storage blob at the same time: how to make them be processed sequentially

I'm building a solution who extract data from invoice PDF with the Microsoft Form Recgnizer Api and add the information to an SQL Database.
I build a parser and a code who add rows from the API response to a database when a pdf is uploaded in my storage blob with success.
I'm looking for the easiest way to handle multiple PDF coming at the same time in my database.
I'm facing deadlocks issues when i make a test with multiple pdf incoming because there is process conflict in SQL server: if i upload 4 pdf, the 4 PDF are processed a the same time and are being parsed and data added to SQL at the same time, which cause conflict and potentially non logical arrangement of the database rows (i don't want to make an update group By invoice number after each process to re arrange the whole table).
Now, i'm looking at a solution who can take every element incoming in storage blob one after another, instead of all at the same time. Something like a For loop who iterate sequentially on every blob "source" and send them in entry for my parsing function.
Or something like a queue who could work like this :
PDF1, PDF2, PDF3 incoming in storage blob :
Make PDF2 and PDF3 waiting, send PDF1 to API analyse, add data to SQL and when last row added, send PDF2 to API analyse, add data to SQL and when last row added, send PDF3 etc
Thanks for your suggestion:)
You can route Azure blob storage events to an Azure Function. https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-event-overview

Get item as soon as it inserted in dynamodb in python

Looking for example
We insert single record in dynamodb
We need to retrieve that item as soon as it inserted in dynamodb database using python
Just like get record seeking continuously for latest item in db once it inserted it will retrieved from db
You typically want an event-based solution without having to poll the database which means using DynamoDB Streams and Lambda functions.
You can, however, also write a Python client that polls DynamoDB Streams.

Exporting data from dynamo db to a csv file

We want to export data from dynamo db to a file. We have around 150,000 records each record is of 430 bytes. It would be a periodic activity once a week. Can we do that with lambda? Is it possible as lambda has a maximum execution time of 15 minutes?
If there is a better option using python or via UI as I'm unable to export more than 100 records from UI?
One really simple option is to use the Command Line Interface tools
aws dynamodb scan --table-name YOURTABLE --output text > outputfile.txt
This would give you a tab delimited output. You can run it as a cronjob for regular output.
The scan wouldn't take anything like 15 minutes (probably just a few seconds). So you wouldn't need to worry about your Lambda timing out if you did it that way.
You can export your data from dynamodb in a number of ways.
The simplest way would be a full table scan:
dynamodb = boto3.client('dynamodb')
response = dynamodb.scan(
TableName=your_table,
Select='ALL_ATTRIBUTES')
data = response['Items']
while 'LastEvaluatedKey' in response:
response = dynamodb.scan(
TableName=your_table,
Select='ALL_ATTRIBUTES',
ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
# save your data as csv here
But if you want to do it every x days, what I would recomend to you is:
Create your first dump from your table with the code above.
Then, you can create a dynamodb trigger to a lambda function that will receive all your table changes (insert, update, delete), and then you can append the data in your csv file. The code would be something like:
def lambda_handler(event, context):
for record in event['Records']:
# get the changes here and save it
Since you will receive only your table updates, you don't need to worry about the 15 minutes execution from lambda.
You can read more about dynamodb streams and lambda here: DynamoDB Streams and AWS Lambda Triggers
And if you want to work on your data, you can always create a aws glue or a EMR cluster.
Guys we resolved it using AWS lambda, 150,000 records (each record is of 430 bytes) are processed to csv file in 2.2 minutes using maximum available memory (3008 mb). Created an event rule for that to run on periodic basis. Time and size is written so that anyone can calculate how much they can do with lambda
You can refer to an existing question on stackoverflow. This question is about exporting dynamo db table as a csv.

Automate File loading from s3 to snowflake

In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve
There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html
Look into Snowpipe, it lets you do that within the system, making it (possibly) much easier.
There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages.
In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.
In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
detect S3 change
get S3 object path
parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
Call SnowPipe REST API
What you need:
Create a user with a public key
Create your stage on SnowFlake with AWS credential in order to access S3
Create your pipe on Snowflake with your user role
Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.

mongodb = trigger => python

I'm currently building a pipeline that reads data from MongoDB everytime new document gets inserted and send it to external data source after some preprocessing. Preprocessing and sending data to external data source part works well the way I designed.
The problem, however, I can't read data from MongoDB. I'm trying to build a trigger that reads data from MongoDB when certain MongoDB collection gets updated then sends it to python. I'm not considering polling a MongoDB since it's too resource-intensive.
I've found this library mongotriggers(https://github.com/drorasaf/mongotriggers/) and now taking a look at it.
In summary, how can I build a trigger that sends data to python from MongoDB when new document gets inserted to specific collection?
Any comment or feedback would be appreciated.
Thanks in advance.
Best
Gee
In MongoDB v3.6+, you can now use MongoDB Change Streams. Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them.
For example to listen to streams from MongoDB when a new document gets inserted:
try:
with db.collection.watch([{'$match': {'operationType': 'insert'}}]) as stream:
for insert_change in stream:
# Do something
print(insert_change)
except pymongo.errors.PyMongoError:
# The ChangeStream encountered an unrecoverable error or the
# resume attempt failed to recreate the cursor.
logging.error('...')
pymongo.collection.Collection.watch() is available from PyMongo 3.6.0+.

Categories