Automate File loading from s3 to snowflake - python

In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve

There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html

Look into Snowpipe, it lets you do that within the system, making it (possibly) much easier.

There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages.
In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.

In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
detect S3 change
get S3 object path
parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
Call SnowPipe REST API
What you need:
Create a user with a public key
Create your stage on SnowFlake with AWS credential in order to access S3
Create your pipe on Snowflake with your user role
Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.

Related

How Python AWS Lambda interacts specifically with the uploaded file?

I´m trying to do the following:
when I upload a file in my s3 storage, the lambda picks this json file and converts it into a csv file.
How can I specify in the lambda code which file must pick?
example of my code in local:
import pandas as pd
df = pd.read_json('movies.json')
df.to_csv('csv-movies.csv')
in this example, I provide the name of the file...but..how can I manage that on a Lambda?
I think I don´t understand how Lambda works...could you give me an example?
Lambda spins up execution environments to handle your requests. When it initialises these environments, it'll pull the code you uploaded, and execute it when invoked.
Execution environments have a concept of ephemeral (temporary) storage with a default size of 512mb.
Lambda doesn't have access to your files in S3 by default. You'd first need to download your file from S3 using something like the AWS SDK for Python. You can store it in the /tmp directory to make use of the ephemeral storage I mentioned earlier.
Once you've downloaded the file using the SDK, you can interact with it as you would if you were running this locally, like in your example.
On the flip side, you'd also need to use the SDK to upload the CSV back to S3 if you want to keep it beyond the lifecycle of that execution environment.
Something else you might want to explore in future is reading that file into memory and doing away with storing it in ephemeral storage altogether.
In order to achieve this you will need to use S3 as the event source for your Lambda, there's a useful tutorial for this provided by AWS themselves and has some sample python code to assist you, you can view it here.
To break it down slightly further and answer how you get the name of the file. The lambda handler will look similar to the following:
def lambda_handler(event, context)
What is important here is the event object. When your event source is the S3 bucket you will be given the name of the bucket and the s3 key in the object which is effectively the path to the file in the S3 bucket. With this information you can do some logic to decide if you want to download the file from that path. If you do, you can use the S3 get_object( ) api call as shown in the tutorial.
Once this file is downloaded it can be used like any other file you would have on your local machine, so you can then proceed to process the json to a CSV. Once it is converted you will presumably want to put it back in S3 and for this you can use the S3 put_object( ) call for this and reuse the information in the event object in order to specify the path.

How to continuously copy new S3 files to another S3 bucket

How can I continuously copy one S3 bucket to another? I want to copy the files every time a new file has been added.
I've tried using the boto3 copy_object however I require the key each time which won't work if I'm getting a new file each time.
From Replicating objects - Amazon Simple Storage Service:
To automatically replicate new objects as they are written to the bucket use live replication, such as Same-Region Replication (SRR) or Cross-Region Replication (CRR).
S3 Replication will automatically create new objects in another bucket as soon as they are created. (Well, it can take a few seconds.)
Alternatively, you could configure the S3 bucket to trigger an AWS Lambda function that uses the CopyObject() command to copy the object to another location. This method is useful if you want to selectively copy files, by having the Lambda function perform some logic before performing the copy (such as checking the file type).
Please look at this: https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/
You can use the aws cli s3 sync command to achieve this.

How to read and modify a csv file on one bucket in cloud storage and save the results in another bucket using Cloud Functions

I have CSV files coming to a folder on a Cloud Storage bucket and I want to create a Cloud Function that opens the CSV, adds a new column to it and then save the results to a new bucket as new.csv file.
Is there a way to do that using python Cloud Function???
Thanks in advance.
The idea that you’re trying to implement is totally possible and can be achieved using Google Cloud Functions.
For that, you would need to create a storage-triggered Cloud Function. More specifically, you can create your function in a way that it will respond to change notifications emerging from your Google Cloud Storage.
These notifications can be configured to respond to various events inside a bucket: object creation, deletion, archiving and metadata updates.
For the situation described, you will need to use the trigger google.storage.object.finalize.
This event is sent when a new object is created in the bucket or an existing object is overwritten, and a new generation of that object is created.
Here you can find a sample code of a storage-triggered Cloud Function written in Python, while this tutorial will give you a more detailed overview of the usage of storage-triggered functions.

Boto3 put_object() is very slow

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.
This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:
s3 = boto3.resource('s3')
data = '{"some":"json","test":"data"}'
key = "path/to/my/file/[timestamp].json"
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.
Since you are potentially uploading many small files, you should consider a few items:
Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
S3 performance implications as described in Request Rate and Performance Considerations

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?
Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.
You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

Categories