download file from s3 to local automatically - python

I am creating a glue job(Python shell) to export data from redshift and store it in S3. But how would I automate/trigger the file in S3 to download to the local network drive so the 3rd party vendor will pick it up.
Without using the glue, I can create a python utility that runs on local server to extract data from redshift as a file and save it in local network drive, but I wanted to implement this framework on cloud to avoid having dependency on local server.
AWS cli sync function wont help as once vendor picks up the file, I should not put it again in the local folder.
Please suggest me the good alternatives.

If the interface team can use S3 API or CLI to get objects from S3 to put on the SFTP server, granting them S3 access through an IAM user or role would probably be the simplest solution. The interface team could write a script that periodically gets the list of S3 objects created after a specified date and copies them to the SFTP server.
If they can't use S3 API or CLI, you could use signed URLs. You'd still need to communicate the S3 object URLs to the interface team. A queue would be a good solution for that. But if they can use an AWS SQS client, I think it's likely they could just use the S3 API to find new objects and retrieve them.
It's not clear to me who controls the SFTP server, whether it's your interface team or the 3rd party vendor. If you can push files to the SFTP server yourself, you could create a S3 event notification that runs a Lambda function to copy the object to the SFTP server every time a new object is created in the S3 bucket.

Related

How to transfer a file from one S3 bucket to other with two different users

I need to write code (python) to copy an S3 file from one S3 bucket to another. The source bucket is in a different AWS account, and we are using an IAM user credentials to read from that bucket. The code runs in the same account as the destination bucket, so it has write access with the IAM role. One way I can think of is to create an s3 client connection with the source account, read the whole file into memory (getObject-?), and then create another s3 client with the destination bucket and write the contents (putObject-?) that have been previously read into memory. But it can get very inefficient if the file size grows, so wondering if there is a better way, preferably if boto3 provides a AWS-managed way that transfers the file without reading contents into memory.
PS: I cannot add or modify roles or policies in the source account to give direct read access to the destination account. The source account is owned by someone else and they only provide a user that can read from the bucket.
Streaming is the standard solution for this kind of problem. You establish a source and a destination and then you stream from one to the other.
In fact, the boto3 get_object() and upload_fileobj() methods both support streams.
Your code is going to look something like this:
import boto3
src = boto3.client('s3', src_access_key, src_secret_key)
dst = boto3.client('s3') # creds implicit through IAM role
src_response = src.get_object(Bucket=src_bucket, Key=src_key)
dst.upload_fileobj(src_response['Body'], dst_bucket, dst_key)
This is just a suggestion that might provide an updated approach. Most tech articles about how to transfer S3 files from one account to another rely on the destination account to "pull" the files so that the destination account ends up owning the copied files.
However, per this article from AWS, you can now configure buckets with a Bucket owner enforced setting—and in fact this is the default for newly created buckets:
Objects in Amazon S3 are no longer automatically owned by the AWS
account that uploads it. By default, any newly created buckets now
have the Bucket owner enforced setting enabled.
On the destination bucket, you should be able to grant IAM permission for the source account user to "push" files to that bucket. Then with appropriate S3 commands or API calls, you should be able to copy files directly from the source to the destination without needing to read, buffer, and write data with your Python client.
You might want to test and verify the permissions configuration with the AWS CLI, and then determine how to implement it in Python.

Automate File loading from s3 to snowflake

In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve
There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html
Look into Snowpipe, it lets you do that within the system, making it (possibly) much easier.
There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages.
In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.
In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
detect S3 change
get S3 object path
parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
Call SnowPipe REST API
What you need:
Create a user with a public key
Create your stage on SnowFlake with AWS credential in order to access S3
Create your pipe on Snowflake with your user role
Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.

Hosting a JSON API data feed on Amazon AWS, originally created in EC2

I run a daily Python script that currently outputs the data to Google Sheets using gspread. I then use this Google Sheet with Sheetsu, which creates an API (that outputs JSON) for an app. However, as the feed gets many requests it can end up being expensive with Sheetsu ($25 dollars a month+)
So I am going to tweak my Python script to output a JSON file instead. However, I need to host this JSON data somewhere. It need to be fast and maybe cached too (I currently use caching with Sheetsu - which is available by request)
What Amazon AWS service options are there for this? I see there is an AWS API Gateway and I have seen it mentioned that people host JSON on S3 storage. But unsure about caching with that and speed etc.
So I need some advice please on options for AWS and the code needed to implement the best option.
EC2 to S3 links
How to transfer files between AWS s3 and AWS ec2
http://tecadmin.net/install-s3cmd-manage-amazon-s3-buckets/#
How to move files amazon ec2 to s3 commandline
https://serverfault.com/questions/285905/how-to-upload-files-from-amazon-ec2-server-to-s3-bucket
Create an S3 bucket with static site hosting enabled. Copy the json file from EC2 to the S3 bucket using the Python AWS SDK (Boto) or the AWS CLI Tool.
You mentioned you are concerned with caching and speed of hosting on S3. You can enable S3 transfer acceleration, or you can place a Content Delivery Network (CDN) like CloudFront in front of your S3 bucket.

Access to Amazon S3 Bucket from EC2 instance

I have an EC2 instance and an S3 bucket in different region. The bucket contains some files that are used regularly by my EC2 instance.
I want to programatically download the files on my EC2 instance (using python)
Is there a way to do that?
Lots of ways to do this from within python
Boto has S3 modules which will do this. http://boto.readthedocs.org/en/latest/ref/s3.html
You could also just use the python requests library to download over http
AWS Cli also give you an option to download from the shell:
aws s3 cp s3://bucket/folder/file.name file.name
Adding to what #joeButler has said above...
Your instances need permission to access S3 using APIs.
So, you need to create IAM role and instance profile. Your instance needs to have instance profile assigned when it is being created. See page 183 (as indicated on bottom of page. The topic name is "Using an IAM Role to Grant Permissions to Applications
Running on Amazon EC2 Instances") of this guide: AWS IAM User Guide to understand the steps and procedure.
I work for Minio, its open source, S3 Compatible object storage written in golang.
You can use minio-py client library, its open source & compatible with AWS S3. Below is a simple example of get_object.py
from minio import Minio
from minio.error import ResponseError
client = Minio('s3.amazonaws.com',
access_key='YOUR-ACCESSKEYID',
secret_key='YOUR-SECRETACCESSKEY')
# Get a full object
try:
data = client.get_object('my-bucketname', 'my-objectname')
with open('my-testfile', 'wb') as file_data:
for d in data:
file_data.write(d)
except ResponseError as err:
print(err)
You can also use minio client aka mc it come mc mirror command to perform the same. You can add it to cron.
$ mc mirror s3/mybucket localfolder
Note:
s3 is an alias
mybucket is your AWS S3 bucket
localfolder is EC2 machine file for backup.
Installing Minio Client:
GNU/Linux
Download mc for:
64-bit Intel from
https://dl.minio.io/client/mc/release/linux-amd64/mc
32-bit Intel from https://dl.minio.io/client/mc/release/linux-386/mc
ARM from https://dl.minio.io/client/mc/release/linux-arm/mc
$ chmod 755 mc
$ ./mc --help
Adding your S3 credentials
$ ./mc config host add mys3 https://s3.amazonaws.com BKIKJAA5BMMU2RHO6IBB V7f1CwQqAcwo80UEIJEjc5gVQUSSx5ohQ9GSrr12
Note: Replace access & secret key with yours.
As mentioned above, you can do this with Boto. To make it more secure and not worry about the user credentials, you could use IAM to grant the EC2 machine access to the specific bucket only. Hope that helps.
If you want to use python, you may want to use the newer boto3 API. I personally like it more than to original boto package. It works with both python2 and python3 and the differences are minimal.
You can specify region when you create a new bucket (see boto3.Client documentation), but bucket names are unique, so you shouldn't need one to connect to it. And you probably don't want to use bucket in different region than your instance because you will pay for data transfer between regions.

mounting an s3 bucket in ec2 and using transparently as a mnt point

I have a webapp (call it myapp.com) that allows users to upload files. The webapp will be deployed on Amazon EC2 instance. I would like to serve these files back out to the webapp consumers via an s3 bucket based domain (i.e. uploads.myapp.com).
When the user uploads the files, I can easily drop them in into a folder called "site_uploads" on the local ec2 instance. However, since my ec2 instance has finite storage, with a lot of uploads, the ec2 file system will fill up quickly.
It would be great if the ec2 instance could mount and s3 bucket as the "site_upload" directory. So that uploads to the EC2 "site_upload" directory automatically end up on uploads.myapp.com (and my webapp can use template tags to make sure the links for this uploaded content is based on that s3 backed domain). This also gives me scalable file serving, as request for files hits s3 and not my ec2 instance. Also, it makes it easy for my webapp to perform scaling/resizing of the images that appear locally in "site_upload" but are actually on s3.
I'm looking at s3fs, but judging from the comments, it doesn't look like a fully baked solution. I'm looking for a non-commercial solution.
FYI, The webapp is written in django, not that that changes the particulars too much.
I'm not using EC2, but I do have my S3 bucket permanently mounted on my Linux server. The way I did it is with Jungledisk. It isn't a non-commercial solution, but it's very inexpensive.
First I setup the jungledisk as normal. Then I make sure fuse is installed. Mostly you just need to create the configuration file with your secret keys and such. Then just add a line to your fstab something like this.
jungledisk /path/to/mount/at fuse noauto,allow_other,config=/path/to/jungledisk/config/file.xml 0 0
Then just mount, and you're good to go.
For uploads, your users can upload directly to S3, as described here.
This way you won't need to mount S3.
When serving the files, you can also do that from S3 directly by marking the files public, I'd prefer to name the site "files.mydomain.com" or "images.mydomain.com" pointing to s3.
I use s3fs, but there are no readily available distributions. I've got my build here for anyone who wants it easier.
Configuration documentation wasn't available, so I winged it until I got this in my fstab:
s3fs#{{ bucket name }} {{ /path/to/mount/point }} fuse allow_other,accessKeyId={{ key }},secretAccessKey={{ secret key }} 0 0
s3fs
This is a little snipped that I use for an Ubuntu system and I have not tested it on so it will obviously need to be adapted for a M$ system. You'll also need to install s3-simple-fuse. If you wind up eventually putting your job to the clound, I'd recommend fabric to run the same command.
import os, subprocess
'''
Note: this is for Linux with s3cmd installed and libfuse2 installed
Run: 'fusermount -u mount_directory' to unmount
'''
def mountS3(aws_access_key_id, aws_secret_access_key, targetDir, bucketName = None):
#######
if bucketName is None:
bucketName = 's3Bucket'
mountDir = os.path.join(targetDir, bucketName)
if not os.path.isdir(mountDir):
os.path.mkdir(mountDir)
subprocess.call('s3-simple-fuse %s -o AWS_ACCESS_KEY_ID=%s,AWS_SECRET_ACCESS_KEY=%s,bucket=%s'%(mountDir, aws_access_key_id, aws_secret_access_key, bucketName)
I'd suggest using a separately-mounted EBS volume. I tried doing the same thing for some movie files. Access to S3 was slow, and S3 has some limitations like not being able to rename files, no real directory structure, etc.
You can set up EBS volumes in a RAID5 configuration and add space as you need it.

Categories