Access to Amazon S3 Bucket from EC2 instance - python

I have an EC2 instance and an S3 bucket in different region. The bucket contains some files that are used regularly by my EC2 instance.
I want to programatically download the files on my EC2 instance (using python)
Is there a way to do that?

Lots of ways to do this from within python
Boto has S3 modules which will do this. http://boto.readthedocs.org/en/latest/ref/s3.html
You could also just use the python requests library to download over http
AWS Cli also give you an option to download from the shell:
aws s3 cp s3://bucket/folder/file.name file.name

Adding to what #joeButler has said above...
Your instances need permission to access S3 using APIs.
So, you need to create IAM role and instance profile. Your instance needs to have instance profile assigned when it is being created. See page 183 (as indicated on bottom of page. The topic name is "Using an IAM Role to Grant Permissions to Applications
Running on Amazon EC2 Instances") of this guide: AWS IAM User Guide to understand the steps and procedure.

I work for Minio, its open source, S3 Compatible object storage written in golang.
You can use minio-py client library, its open source & compatible with AWS S3. Below is a simple example of get_object.py
from minio import Minio
from minio.error import ResponseError
client = Minio('s3.amazonaws.com',
access_key='YOUR-ACCESSKEYID',
secret_key='YOUR-SECRETACCESSKEY')
# Get a full object
try:
data = client.get_object('my-bucketname', 'my-objectname')
with open('my-testfile', 'wb') as file_data:
for d in data:
file_data.write(d)
except ResponseError as err:
print(err)
You can also use minio client aka mc it come mc mirror command to perform the same. You can add it to cron.
$ mc mirror s3/mybucket localfolder
Note:
s3 is an alias
mybucket is your AWS S3 bucket
localfolder is EC2 machine file for backup.
Installing Minio Client:
GNU/Linux
Download mc for:
64-bit Intel from
https://dl.minio.io/client/mc/release/linux-amd64/mc
32-bit Intel from https://dl.minio.io/client/mc/release/linux-386/mc
ARM from https://dl.minio.io/client/mc/release/linux-arm/mc
$ chmod 755 mc
$ ./mc --help
Adding your S3 credentials
$ ./mc config host add mys3 https://s3.amazonaws.com BKIKJAA5BMMU2RHO6IBB V7f1CwQqAcwo80UEIJEjc5gVQUSSx5ohQ9GSrr12
Note: Replace access & secret key with yours.

As mentioned above, you can do this with Boto. To make it more secure and not worry about the user credentials, you could use IAM to grant the EC2 machine access to the specific bucket only. Hope that helps.

If you want to use python, you may want to use the newer boto3 API. I personally like it more than to original boto package. It works with both python2 and python3 and the differences are minimal.
You can specify region when you create a new bucket (see boto3.Client documentation), but bucket names are unique, so you shouldn't need one to connect to it. And you probably don't want to use bucket in different region than your instance because you will pay for data transfer between regions.

Related

Inconsistent access to subfolder of a bucket between gsutil and storage Client

As to avoid managing a large number of buckets for data received from a lot of devices, I plan to have them write the files they capture in folders of a single bucket instead of having one bucket for each device.
As to make sure each device can only write in its subfolder, I have set the IAM condition as described in this answer:
resource.name.startsWith('projects/_/buckets/dev_bucket/objects/test_folder')
My service account now has the Storage Object Creator and Storage Object viewer role with the condition above attached.
This is the (truncated only to this service account) output of the gcloud get-iam-policy <project> command
- condition:
expression: |-
resource.name.startsWith("projects/_/buckets/dev_bucket/objects/test_folder/")
title: only_test_subfolder
members:
- serviceAccount:myserviceaccount.iam.gserviceaccount.com
role: roles/storage.objectCreator
- condition:
expression: |-
resource.name.startsWith("projects/_/buckets/dev_bucket/objects/test_folder/")
title: only_test_subfolder
members:
- serviceAccount:myserviceaccount.iam.gserviceaccount.com
role: roles/storage.objectViewer
When using the gsutil command, everything seems to work fine
# Set the authentication via the service account json key
gcloud auth activate-service-account --key-file=/path/to/my/key.json
# all of these commands work fine
gcloud ls gs://dev_bucket/test_folder
gcloud cp gs://dev_bucket/test_folder/distant_file.txt local_file.txt
# These ones get a 403 as expected
gcloud ls gs://dev_bucket/
gcloud ls gs://another_bucket
gcloud_cp gs://dev_bucket/another_subfolder/somefile.txt local_file.txt
However, when I am trying to use the google storage client (v 2.1.0) I cannot manage to make it work, mainly because I am supposed to define the bucket before getting an object in this bucket.
import os
from google.cloud import storage
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="path/to/my/key.json"
client = storage.Client()
client.get_bucket("dev_bucket")
>>> Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/dev_bucket?projection=noAcl&prettyPrint=false: <Service account> does not have storage.buckets.get access to the Google Cloud Storage bucket.
I have also tried to list all files using the prefix argument, but get the same error:
client.list_blobs("dev_bucket", prefix="test_folder")
Is there a way to use the python storage client with this type of permissions ?
This is an expected behavior!
You are doing:
gsutil ls gs://dev_bucket/test_folder
gsutil cp gs://dev_bucket/test_folder/distant_file.txt local_file.txt
Both commands does not require any additional permissions other than storage.objects.get which your SA has it from the role Storage Object viewer
but in your code you are trying to access the bucket details (the bucket itself, not objects inside the bucket) so it won't work unless your SA have the permission storage.buckets.get
this line:
client.get_bucket("dev_bucket")
will perform a GET method on v1/buckets/get which requires the above mentioned IAM permission.
So, you need to modify your code to read objects only without accessing bucket details.
Here is a sample code for downloading objects from a bucket.
note: the method bucket(bucket_name, user_project=None) which is used in this sample code will not perform any HTTP requests as quoted from docs.
This will not make an HTTP request; it simply instantiates a bucket object owned by this client.
BTW, You can try to run something like:
gsutil ls -L -b gs://dev_bucket
I expect this command to give you the same error which you get from your code.
References:
https://cloud.google.com/storage/docs/access-control/iam-gsutil
https://cloud.google.com/storage/docs/access-control/iam-json

download file from s3 to local automatically

I am creating a glue job(Python shell) to export data from redshift and store it in S3. But how would I automate/trigger the file in S3 to download to the local network drive so the 3rd party vendor will pick it up.
Without using the glue, I can create a python utility that runs on local server to extract data from redshift as a file and save it in local network drive, but I wanted to implement this framework on cloud to avoid having dependency on local server.
AWS cli sync function wont help as once vendor picks up the file, I should not put it again in the local folder.
Please suggest me the good alternatives.
If the interface team can use S3 API or CLI to get objects from S3 to put on the SFTP server, granting them S3 access through an IAM user or role would probably be the simplest solution. The interface team could write a script that periodically gets the list of S3 objects created after a specified date and copies them to the SFTP server.
If they can't use S3 API or CLI, you could use signed URLs. You'd still need to communicate the S3 object URLs to the interface team. A queue would be a good solution for that. But if they can use an AWS SQS client, I think it's likely they could just use the S3 API to find new objects and retrieve them.
It's not clear to me who controls the SFTP server, whether it's your interface team or the 3rd party vendor. If you can push files to the SFTP server yourself, you could create a S3 event notification that runs a Lambda function to copy the object to the SFTP server every time a new object is created in the S3 bucket.

Accessing DynamoDB Local from boto3

I am doing AWS tutorial Python and DynamoDB. I downloaded and installed DynamoDB Local. I got the access key and secret access key. I installed boto3 for python. The only step I have left is setting up authentication credentials. I do not have AWS CLI downloaded, so where should I include access key and secret key and also the region?
Do I include it in my python code?
Do I make a file in my directory where I put this info? Then should I write anything in my python code so it can find it?
You can try passing the accesskey and secretkey in your code like this:
import boto3
session = boto3.Session(
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY,
)
client = session.client('dynamodb')
OR
dynamodb = session.resource('dynamodb')
From the AWS documentation:
Before you can access DynamoDB programmatically or through the AWS
Command Line Interface (AWS CLI), you must configure your credentials
to enable authorization for your applications. Downloadable DynamoDB
requires any credentials to work, as shown in the following example.
AWS Access Key ID: "fakeMyKeyId"
AWS Secret Access Key:"fakeSecretAccessKey"
You can use the aws configure command of the AWS
CLI to set up credentials. For more information, see Using the AWS
CLI.
So, you need to create an .aws folder in yr home directory.
There create the credentials and config files.
Here's how to do this:
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
If you want to write portable code and keep in the spirit of developing 12-factor apps, consider using environment variables.
The advantage is that locally, both the CLI and the boto3 python library in your code (and pretty much all the other offical AWS SDK languages, PHP, Go, etc.) are designed to look for these values.
An example using the official Docker image to quickly start DynamoDB local:
# Start a local DynamoDB instance on port 8000
docker run -p 8000:8000 amazon/dynamodb-local
Then in a terminal, set some defaults that the CLI and SDKs like boto3 are looking for.
Note that these will be available until you close your terminal session.
# Region doesn't matter, CLI will complain if not provided
export AWS_DEFAULT_REGION=us-east-1
# Set some dummy credentials, dynamodb local doesn't care what these are
export AWS_ACCESS_KEY_ID=abc
export AWS_SECRET_ACCESS_KEY=abc
You should then be able to run the following (in the same terminal session) if you have the CLI installed. Note the --endpoint-url flag.
# Create a new table in DynamoDB Local
aws dynamodb create-table \
--endpoint-url http://127.0.0.1:8000 \
--table-name tmp \
--attribute-definitions AttributeName=id,AttributeType=S \
--key-schema AttributeName=id,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
You should then able to list out the tables with:
aws dynamodb list-tables --endpoint-url http://127.0.0.1:8000
And get a result like:
{
"TableNames": [
"tmp"
]
}
So how do we get the endpoint-url that we've been specifying in the CLI to work in Python? Unfortunately, there isn't a default environment variable for the endpoint url in the boto3 codebase, so we'll need to pass it in when the code runs. The docs for .NET and Java are comprehensive but for Python, they are a bit more elusive. From the boto3 github repo and also see this great answer, we need to create a client or resource with the endpoint_url keyword. In the below, we're looking for a custom environment variable called AWS_DYNAMODB_ENDPOINT_URL. The point being that if specified, it will be used, otherwise will fall back to whatever the platform default is, making your code portable.
# Run in the same shell as before
export AWS_DYNAMODB_ENDPOINT_URL=http://127.0.0.1:8000
# file test.py
import os
import boto3
# Get environment variable if it's defined
# Make sure to set the environment variable before running
endpoint_url = os.environ.get('AWS_DYNAMODB_ENDPOINT_URL', None)
# Using (high level) resource, same keyword for boto3.client
resource = boto3.resource('dynamodb', endpoint_url=endpoint_url)
tables = resource.tables.all()
for table in tables:
print(table)
Finally, run this snippet with
# Run in the same shell as before
python3 test.py
# Should produce the following output:
# dynamodb.Table(name='tmp')

How to add access to certain service account, specific Google Storage Bucket via Python?

I make automation when storage buckets should be created automatically. It nails down to create if it doesn't exist and applies correct IAM policies.
In gsutil I can do it by:
gsutil acl ch -u john.doe#example.com:WRITE gs://example-bucket
gsutil acl ch -u john.doe#example.READ gs://example-bucket
The problem is that I don't understand how to the same in Python. I looked through GitHub, StackOverflow, and official docs and don't see the way.
I create a bucket client.create_bucket(bucket_name) based on the official python library and examples from there
I think you should use google-cloud-storage
Here is an example code:
client = storage.Client()
bucket = client.get_bucket(bucket_name)
acl = bucket.acl
# For user
acl.user("me#example.org").grant_read()
acl.all_authenticated().grant_write()
# For service account
acl.service_account("example#example.iam.gserviceaccount.com").grant_read()
acl.save()
print(list(acl))

mounting an s3 bucket in ec2 and using transparently as a mnt point

I have a webapp (call it myapp.com) that allows users to upload files. The webapp will be deployed on Amazon EC2 instance. I would like to serve these files back out to the webapp consumers via an s3 bucket based domain (i.e. uploads.myapp.com).
When the user uploads the files, I can easily drop them in into a folder called "site_uploads" on the local ec2 instance. However, since my ec2 instance has finite storage, with a lot of uploads, the ec2 file system will fill up quickly.
It would be great if the ec2 instance could mount and s3 bucket as the "site_upload" directory. So that uploads to the EC2 "site_upload" directory automatically end up on uploads.myapp.com (and my webapp can use template tags to make sure the links for this uploaded content is based on that s3 backed domain). This also gives me scalable file serving, as request for files hits s3 and not my ec2 instance. Also, it makes it easy for my webapp to perform scaling/resizing of the images that appear locally in "site_upload" but are actually on s3.
I'm looking at s3fs, but judging from the comments, it doesn't look like a fully baked solution. I'm looking for a non-commercial solution.
FYI, The webapp is written in django, not that that changes the particulars too much.
I'm not using EC2, but I do have my S3 bucket permanently mounted on my Linux server. The way I did it is with Jungledisk. It isn't a non-commercial solution, but it's very inexpensive.
First I setup the jungledisk as normal. Then I make sure fuse is installed. Mostly you just need to create the configuration file with your secret keys and such. Then just add a line to your fstab something like this.
jungledisk /path/to/mount/at fuse noauto,allow_other,config=/path/to/jungledisk/config/file.xml 0 0
Then just mount, and you're good to go.
For uploads, your users can upload directly to S3, as described here.
This way you won't need to mount S3.
When serving the files, you can also do that from S3 directly by marking the files public, I'd prefer to name the site "files.mydomain.com" or "images.mydomain.com" pointing to s3.
I use s3fs, but there are no readily available distributions. I've got my build here for anyone who wants it easier.
Configuration documentation wasn't available, so I winged it until I got this in my fstab:
s3fs#{{ bucket name }} {{ /path/to/mount/point }} fuse allow_other,accessKeyId={{ key }},secretAccessKey={{ secret key }} 0 0
s3fs
This is a little snipped that I use for an Ubuntu system and I have not tested it on so it will obviously need to be adapted for a M$ system. You'll also need to install s3-simple-fuse. If you wind up eventually putting your job to the clound, I'd recommend fabric to run the same command.
import os, subprocess
'''
Note: this is for Linux with s3cmd installed and libfuse2 installed
Run: 'fusermount -u mount_directory' to unmount
'''
def mountS3(aws_access_key_id, aws_secret_access_key, targetDir, bucketName = None):
#######
if bucketName is None:
bucketName = 's3Bucket'
mountDir = os.path.join(targetDir, bucketName)
if not os.path.isdir(mountDir):
os.path.mkdir(mountDir)
subprocess.call('s3-simple-fuse %s -o AWS_ACCESS_KEY_ID=%s,AWS_SECRET_ACCESS_KEY=%s,bucket=%s'%(mountDir, aws_access_key_id, aws_secret_access_key, bucketName)
I'd suggest using a separately-mounted EBS volume. I tried doing the same thing for some movie files. Access to S3 was slow, and S3 has some limitations like not being able to rename files, no real directory structure, etc.
You can set up EBS volumes in a RAID5 configuration and add space as you need it.

Categories