AWS uploading file into wrong bucket - python

I am using AWS Sagemaker and trying to upload a data folder into S3 from Sagemaker. I am trying to do is to upload my data into the s3_train_data directory (the directory exists in S3). However, it wouldn't upload it in that bucket, but in a default Bucket that has been created, and in turn creates a new folder directory with the S3_train_data variables.
code to input in directory
import os
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = <bucket name>
prefix = <folders1/folders2>
key = <input>
s3_train_data = 's3://{}/{}/{}/'.format(bucket, prefix, key)
#path 'data' is the folder in the Jupyter Instance, contains all the training data
inputs = sagemaker_session.upload_data(path= 'data', key_prefix= s3_train_data)
Is the problem in the code or more in how I created the notebook?

You could look at the Sample notebooks, how to upload the data S3 bucket
There have many ways. I am just giving you hints to answer.
And you forgot create a boto3 session to access the S3 bucket
It is one of the ways to do it.
import os
import urllib.request
import boto3
def download(url):
filename = url.split("/")[-1]
if not os.path.exists(filename):
urllib.request.urlretrieve(url, filename)
def upload_to_s3(channel, file):
s3 = boto3.resource('s3')
data = open(file, "rb")
key = channel + '/' + file
s3.Bucket(bucket).put_object(Key=key, Body=data)
# caltech-256
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
upload_to_s3('train', 'caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
upload_to_s3('validation', 'caltech-256-60-val.rec')
link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.ipynb
Another way to do it.
bucket = '<your_s3_bucket_name_here>'# enter your s3 bucket where you will copy data and model artifacts
prefix = 'sagemaker/breast_cancer_prediction' # place to upload training files within the bucket
# do some processing then prepare to push the data.
f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))
f.seek(0)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_fileobj(f)
Link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_applying_machine_learning/breast_cancer_prediction/Breast%20Cancer%20Prediction.ipynb
Youtube link : https://www.youtube.com/watch?v=-YiHPIGyFGo - how to pull the data in S3 bucket.

Related

Mock download file from s3 with actual file

I would like to write a test to mock the download of a function from s3 and replace it locally with an actual file that exists of my machine. I took inspiration from this post. The idea is the following:
from moto import mock_s3
import boto3
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('fake_bucket').download_file(src_f, dest_f)
#mock_s3
def _create_and_mock_bucket():
# Create fake bucket and mock it
bucket = "fake_bucket"
# We need to create the bucket since this is all in Moto's 'virtual' AWS account
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3.put_object(Bucket=bucket, Key=file_path, Body="")
dl(file_path, 'some_other_real_file.txt')
_create_and_mock_bucket()
Now some_other_real_file.txt exists, but it is not a copy of some_real_file.txt. Any idea on how to do that?
If 'some_real_file.txt' already exists on your system, you should use upload_file instead:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file
For your example:
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3_resource = boto3.resource('s3')
s3_resource.meta.client.upload_file(file_path, bucket, file_path)
Your code currently creates an empty file in S3 (since Body=""), and that is exactly what is being downloaded to 'some_other_real_file.txt'.
Notice that, if you change the Body-parameter to have some text in it, that exact content will be downloaded to 'some_other_real_file.txt'.

Django - AWS S3 - Moving Files

I am using AWS S3 as my default file storage system. I have a model with a file field like so:
class Segmentation(models.Model):
file = models.FileField(...)
I am running image processing jobs on a second server that dump processsed-images to a different AWS S3 bucket.
I want to save the processed-image in my Segmentation table.
Currently I am using boto3 to manually download the file to my "local" server (where my django-app lives) and then upload it to the local S3 bucket like so:
from django.core.files import File
import boto3
def save_file(segmentation, foreign_s3_key):
# set foreign bucket
foreign_bucket = 'foreign-bucket'
# create a temp file:
temp_local_file = 'tmp/temp.file'
# use boto3 to download foreign file locally:
s3_client = boto3.client('s3')
s3_client.download_file(foreign_bucket , foreign_s3_key, temp_local_file)
# save file to segmentation:
segmentation.file = File(open(temp_local_file, 'rb'))
segmentation.save()
# delete temp file:
os.remove(temp_local_file)
This works fine but it is resource intensive. I have some jobs that need to process hundreds of images.
Is there a way to copy a file from the foreign bucket to my local bucket and set the segmentation.file field to the copied file?
I am assuming you want to move some files from one source bucket to some destination bucket, as the OP header suggests, and do some processing in between.
import boto3
my_west_session = boto3.Session(region_name = 'us-west-2')
my_east_session = boto3.Session(region_name = 'us-east-1')
backup_s3 = my_west_session.resource("s3")
video_s3 = my_east_session.resource("s3")
local_bucket = backup_s3.Bucket('localbucket')
foreign_bucket = video_s3.Bucket('foreignbucket')
for obj in foreign_bucket.objects.all():
# do some processing
# on objects
copy_source = {
'Bucket': foreign_bucket,
'Key': obj.key
}
local_bucket.copy(copy_source, obj.key)
Session configurations
S3 Resource Copy Or CopyObject depending on your requirement.

How to download a file from Google Cloud Platform storage

I was reading the python documentation for google cloud storage and was successfully able to create a method that uploads files, however, I am not able to find a way to download files using a blob's URL. I was able to download the file using the filename, but that's not practical since the user could upload files with the same name. The blob is private. I have access to the blob's URL, so I was wondering if there is a way to download files using this link.
This is my upload code which works perfectly:
def upload_blob(bucket_name, filename, file_obj):
if filename and file_obj:
storage_client = storage.Client()
bucket = storage_client.bucket('example-storage-bucket')
blob = bucket.blob(filename)
blob.upload_from_file(file_obj) # binary file data
form_logger.info('File {} uploaded'.format(filename))
return blob
This code downloads the file, but I could only figure it out with the blob name, not URL:
def download_blob(bucket_name, url):
if url:
storage_client = storage.Client()
bucket = storage_client.bucket('example-storage-bucket')
blob = bucket.blob(url)
blob.download_to_filename("example.pdf")
Any suggestions or thoughts on how to download the file using the blob's media link URL?
For example, bucket example-storage-bucket has file folder/example.pdf and its
Link URL is https://storage.cloud.google.com/example-storage-bucket/folder/example.pdf and
URI is gs://example-storage-bucket/folder/example.pdf
Use below function to download blob using GCS link URL(if you are using Python 3.x):
import os
from urllib.parse import urlparse
def decode_gcs_url(url):
p = urlparse(url)
path = p.path[1:].split('/', 1)
bucket, file_path = path[0], path[1]
return bucket, file_path
def download_blob(url):
if url:
storage_client = storage.Client()
bucket, file_path = decode_gcs_url(url)
bucket = storage_client.bucket(bucket)
blob = bucket.blob(file_path)
blob.download_to_filename(os.path.basename(file_path))
I think what you're saying is that you want to download the blob to a file whose name is based on the blob name, correct? If so, you can find the blob name in the blob.metadata, and then pick a filename based on that blob name.

Google cloud function to copy all data of source bucket to another bucket using python

I want to copy data from one bucket to another bucket using google cloud function. At this time I am able to copy only a single file to destination but I want to copy all files, folders, and sub-folders to my destination bucket.
from google.cloud import storage
def copy_blob(bucket_name= "loggingforproject", blob_name= "assestnbfile.json", destination_bucket_name= "test-assest", destination_blob_name= "logs"):
"""Copies a blob from one bucket to another with a new name."""
bucket_name = "loggingforproject"
blob_name = "assestnbfile.json"
destination_bucket_name = "test-assest"
destination_blob_name = "logs"
storage_client = storage.Client()
source_bucket = storage_client.bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
Using gsutil cp is a good option. However, if you want to copy the files using Cloud Functions - it can be achieved as well.
At the moment, your function only copies a single file. In order to copy the whole content of your bucket you would need to iterate through the files within it.
Here is a code sample that I wrote for an HTTP Cloud Function and tested - you can use it for a reference:
MAIN.PY
from google.cloud import storage
def copy_bucket_files(request):
"""
Copies the files from a specified bucket into the selected one.
"""
# Check if the bucket's name was specified in the request
if request.args.get('bucket'):
bucketName = request.args.get('bucket')
else:
return "The bucket name was not provided. Please try again."
try:
# Initiate Cloud Storage client
storage_client = storage.Client()
# Define the origin bucket
origin = storage_client.bucket(bucketName)
# Define the destination bucket
destination = storage_client.bucket('<my-test-bucket>')
# Get the list of the blobs located inside the bucket which files you want to copy
blobs = storage_client.list_blobs(bucketName)
for blob in blobs:
origin.copy_blob(blob, destination)
return "Done!"
except:
return "Failed!"
REQUIREMENTS.TXT
google-cloud-storage==1.22.0
How to call that function:
It can be called via the URL provided for triggering the function, by appending that URL with /?bucket=<name-of-the-bucket-to-copy> (name without <, >):
https://<function-region>-<project-name>.cloudfunctions.net/<function-name>/?bucket=<bucket-name>
You can use the gsutil cp command for this:
gsutil cp gs://first-bucket/* gs://second-bucket
See https://cloud.google.com/storage/docs/gsutil/commands/cp for more details
Here is my typescript code, I call it from my website when a need to move images.
exports.copiarImagen = functions.https.onCall(async (data, response) => {
var origen = data.Origen;
var destino = data.Destino;
console.log('Files:');
const [files] = await admin.storage().bucket("bucketĀ“s path").getFiles({ prefix: 'path where your images are"});
files.forEach(async file => {
var nuevaRuta = file.name;
await admin.storage().bucket("posavka.appspot.com").file(file.name)
.copy(admin.storage().bucket("posavka.appspot.com").file(nuevaRuta.replace(origen,destino)));
await admin.storage().bucket("posavka.appspot.com").file(file.name).delete();
});
First I get all files in a specific path, then I copy those files to the new path, and finally I delete the file on the old path
I hope it helps you :D

Python - Unable to upload file to defined folder in AWS S3 bucket

I am trying to upload a csv file from my local to aws s3 bucket. Given below is the code I am using but it doesn't seen to upload the file to the s3 folder defined. Could anyone assist.
import boto3
from botocore.client import Config
ACCESS_KEY_ID = 'accesskeyid'
ACCESS_SECRET_KEY = 'secretkeyid'
BUCKET_NAME = 'bucketname'
data = open('/desktop/file.csv', 'rb')
s3 = boto3.resource(
's3',
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_SECRET_KEY,
config=Config(signature_version='s3v4')
)
s3.Bucket(BUCKET_NAME).put_object(Key='/sub-folder/sub-folder2/file.csv', Body=data)
print ("Uploaded successfully")
Could anyone help me in finding where I am going wrong. Thanks
You need to remove the / from the beginning of the Key argument.
Using your existing code, the file path will be: BUCKET_NAME//sub-folder/sub-folder2/file.csv.

Categories