We want to export data from dynamo db to a file. We have around 150,000 records each record is of 430 bytes. It would be a periodic activity once a week. Can we do that with lambda? Is it possible as lambda has a maximum execution time of 15 minutes?
If there is a better option using python or via UI as I'm unable to export more than 100 records from UI?
One really simple option is to use the Command Line Interface tools
aws dynamodb scan --table-name YOURTABLE --output text > outputfile.txt
This would give you a tab delimited output. You can run it as a cronjob for regular output.
The scan wouldn't take anything like 15 minutes (probably just a few seconds). So you wouldn't need to worry about your Lambda timing out if you did it that way.
You can export your data from dynamodb in a number of ways.
The simplest way would be a full table scan:
dynamodb = boto3.client('dynamodb')
response = dynamodb.scan(
TableName=your_table,
Select='ALL_ATTRIBUTES')
data = response['Items']
while 'LastEvaluatedKey' in response:
response = dynamodb.scan(
TableName=your_table,
Select='ALL_ATTRIBUTES',
ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
# save your data as csv here
But if you want to do it every x days, what I would recomend to you is:
Create your first dump from your table with the code above.
Then, you can create a dynamodb trigger to a lambda function that will receive all your table changes (insert, update, delete), and then you can append the data in your csv file. The code would be something like:
def lambda_handler(event, context):
for record in event['Records']:
# get the changes here and save it
Since you will receive only your table updates, you don't need to worry about the 15 minutes execution from lambda.
You can read more about dynamodb streams and lambda here: DynamoDB Streams and AWS Lambda Triggers
And if you want to work on your data, you can always create a aws glue or a EMR cluster.
Guys we resolved it using AWS lambda, 150,000 records (each record is of 430 bytes) are processed to csv file in 2.2 minutes using maximum available memory (3008 mb). Created an event rule for that to run on periodic basis. Time and size is written so that anyone can calculate how much they can do with lambda
You can refer to an existing question on stackoverflow. This question is about exporting dynamo db table as a csv.
Related
I have configured a DMS migration instance that replicates data from Mysql into a AWS Kinesis stream, but I noticed that when I process the kinesis records I pick up duplicate records.This does not happen for every record.
How do I prevent these duplicate records from being pushed to the kinesis data stream or the S3 bucket?
I'm using a lambda function to process the records, so I thought of adding logic to de-duplicate the data, but I'm not sure how to without persisting the data somewhere. I need to process the data in real-time so persisting the data would not be idle.
Regards
Pragesan
I added a global counter variable that stores the pk of each record,so each invocation checks the previous pk value,and if it is different I insert the value.
Im new to AWS and glue.
I have a glue job that uses a python script to convert a data source into a json formatted file. The new data is sent to us on a monthly basis and so my thought was to trigger the glue job to run every time the data was added to our s3 bucket.
I have the job setup to overwrite the file every time it run, but it would be nice to capture the differences between the monthly files so that I can have the historic info.
Here is the output of the code:
s3.put_object(Body=output_file, Bucket='mys3, Key='outputfile.json')
Could a crawler help with keeping track of the history? Like if could I crawl for new data only and then store it somewhere?
For my outputs I am viewing them in Athena, but maybe I should start compiling this data to a database on its own ?
Thanks in advance for any inputs!
I'm building a solution who extract data from invoice PDF with the Microsoft Form Recgnizer Api and add the information to an SQL Database.
I build a parser and a code who add rows from the API response to a database when a pdf is uploaded in my storage blob with success.
I'm looking for the easiest way to handle multiple PDF coming at the same time in my database.
I'm facing deadlocks issues when i make a test with multiple pdf incoming because there is process conflict in SQL server: if i upload 4 pdf, the 4 PDF are processed a the same time and are being parsed and data added to SQL at the same time, which cause conflict and potentially non logical arrangement of the database rows (i don't want to make an update group By invoice number after each process to re arrange the whole table).
Now, i'm looking at a solution who can take every element incoming in storage blob one after another, instead of all at the same time. Something like a For loop who iterate sequentially on every blob "source" and send them in entry for my parsing function.
Or something like a queue who could work like this :
PDF1, PDF2, PDF3 incoming in storage blob :
Make PDF2 and PDF3 waiting, send PDF1 to API analyse, add data to SQL and when last row added, send PDF2 to API analyse, add data to SQL and when last row added, send PDF3 etc
Thanks for your suggestion:)
You can route Azure blob storage events to an Azure Function. https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-event-overview
Please pardon my ignorance if this question may sound silly the expert audience here
Currently as per my use case
I am performing certain analysis on the data present in aws redshift tables and saving them a csv file in s3 buckets
(operation is some what similar to Pivot for redshift database)
and after that i am updating data back to redshift db using copy command
Currently after performing analysis (which is done in python3) for 200 csv files are generated which are saved in 200 different table in redshift
The count of csv would keep on increasing with time
Currently the whole process takes about 50-60 minutes to complete
25 minutes to get approx 200 csv and update them in s3 buckets
25 minutes to update the approx 200 csv into 200 aws redshift tables
The size of csv vary form few MB to 1GB
I was looking for tools or aws technologies which can help me reduce my time
*additional info
Structure of csv keeps on changing .Hence i have to drop and create tables again
This would be a repetitive tasks and would be executed in every 6hours
You can achieve a significant speed-up by:
Using multi-part upload of CSV to S3, so instead of waiting for a single file to upload, multi-part upload will upload the file to S3 in-parallel, saving you considerable time. Read about it here and here. Here is the Boto3 reference for it.
Copying data into Redshift from S3, in parallel. If you split your file in multiple parts, and then run the COPY command, the data will be loaded from multiple files in parallel, instead of waiting for 1 GB file to load, which might be really slow. Read more about it here.
Hope this helps.
You should explore Athena. It's a tool that comes within the AWS package and gives you the flexibility to query csv (or even gzip) files.
It'll save you the time you take to manually copy the data in the Redshift tables and you'll be able to query the dataset from the csv itself. Athena has the ability to query them from an s3 bucket.
However, still in the development phase, you'll have to spend sometime with it as it's not very user friendly. A syntax error in your query logs you out from your AWS session rather than throwing a syntax error. Moreover, you'll not find too many documentation and developer talks over the internet since Athena is still largely unexplored.
Athena charges you depending upon the data that your query fetches and is thus, more pocket friendly. If the query fails to execute, Amazon wouldn't charge you.
I have big data in s3 and have to move into redshift, and have one table in redshift. Since I use python, I wrote python script and use psycopg2 to connect redshift. I succeeded to connect to redshift, but I failed to insert data from s3 to redshift.
I checked dashboard in aws website and found that redshift received a query and it loads something, but it does not insert anything and the time consumed for this process is too long like over 3 minutes. There is no error log so I can't find what is the reason.
Is there any possible cause for this?
EDIT
added copy command I used.
copy table FROM 's3://example/2017/02/03/' access_key_id '' secret_access_key '' ignoreblanklines timeformat 'epochsecs' delimiter '\t';
Try querying the stl_load_errors table, it has the info on data load errors
http://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html
select * from stl_load_errors order by starttime desc limit 1