I am using AWS S3 as my default file storage system. I have a model with a file field like so:
class Segmentation(models.Model):
file = models.FileField(...)
I am running image processing jobs on a second server that dump processsed-images to a different AWS S3 bucket.
I want to save the processed-image in my Segmentation table.
Currently I am using boto3 to manually download the file to my "local" server (where my django-app lives) and then upload it to the local S3 bucket like so:
from django.core.files import File
import boto3
def save_file(segmentation, foreign_s3_key):
# set foreign bucket
foreign_bucket = 'foreign-bucket'
# create a temp file:
temp_local_file = 'tmp/temp.file'
# use boto3 to download foreign file locally:
s3_client = boto3.client('s3')
s3_client.download_file(foreign_bucket , foreign_s3_key, temp_local_file)
# save file to segmentation:
segmentation.file = File(open(temp_local_file, 'rb'))
segmentation.save()
# delete temp file:
os.remove(temp_local_file)
This works fine but it is resource intensive. I have some jobs that need to process hundreds of images.
Is there a way to copy a file from the foreign bucket to my local bucket and set the segmentation.file field to the copied file?
I am assuming you want to move some files from one source bucket to some destination bucket, as the OP header suggests, and do some processing in between.
import boto3
my_west_session = boto3.Session(region_name = 'us-west-2')
my_east_session = boto3.Session(region_name = 'us-east-1')
backup_s3 = my_west_session.resource("s3")
video_s3 = my_east_session.resource("s3")
local_bucket = backup_s3.Bucket('localbucket')
foreign_bucket = video_s3.Bucket('foreignbucket')
for obj in foreign_bucket.objects.all():
# do some processing
# on objects
copy_source = {
'Bucket': foreign_bucket,
'Key': obj.key
}
local_bucket.copy(copy_source, obj.key)
Session configurations
S3 Resource Copy Or CopyObject depending on your requirement.
Related
I have the following the lambda function which uses an s3 trigger to copy files from a source to a destination bucket. This is working fine.
import os
import logging
import boto3
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
DST_BUCKET = os.environ.get('DST_BUCKET')
REGION = os.environ.get('REGION')
s3 = boto3.resource('s3', region_name=REGION)
def handler(event, context):
LOGGER.info('Event structure: %s', event)
LOGGER.info('DST_BUCKET: %s', DST_BUCKET)
for record in event['Records']:
src_bucket = record['s3']['bucket']['name']
src_key = record['s3']['object']['key']
copy_source = {
'Bucket': src_bucket,
'Key': src_key
}
LOGGER.info('copy_source: %s', copy_source)
bucket = s3.Bucket(DST_BUCKET)
bucket.copy(copy_source, src_key)
return {
'status': 'ok'
}
What I'm wanting to do now is to modify the code above to delete the file(s) (not the folder) from the source bucket after successful upload to the destination bucket.
Use case: user uploads three files, two legit csv file and one corrupted csv file. Lambda triggers on source bucket, begins the copying of those files. Lambda loops through the files outputting successful or true when done and false if there were issues along with the filename, then deletes those successfully uploaded from the source bucket.
I've tried various try/catch blocks for this but it ends up either deleting the entire folder or there's issues synchronizing the buckets where the file is deleted from the source folder before successful upload, etc.
I dont want to do away with the loop above so that if multiple files are uploaded it will loop through all of them and similarly delete all of them when successfully uploaded to the other bucket. Unsure if a simple boolean would be sufficient for this use case or another flag of some sort. The flag would have to keep track of the specific key, though, so that it knows which was successful and not.
Before removing the file from the source bucket, you can verify that it was uploaded correctly using s3.Object(DST_BUCKET, src_key).load():
import os
import logging
import boto3
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
DST_BUCKET = os.environ.get('DST_BUCKET')
REGION = os.environ.get('REGION')
s3 = boto3.resource('s3', region_name=REGION)
def handler(event, context):
LOGGER.info(f'Event structure: {event}')
LOGGER.info(f'DST_BUCKET: {DST_BUCKET}')
for record in event['Records']:
src_bucket = record['s3']['bucket']['name']
src_key = record['s3']['object']['key']
copy_source = {
'Bucket': src_bucket,
'Key': src_key
}
LOGGER.info(f'copy_source: {copy_source}')
bucket = s3.Bucket(DST_BUCKET)
bucket.copy(copy_source, src_key)
try:
#Check file
s3.Object(DST_BUCKET, src_key).load()
LOGGER.info(f"File {src_key} uploaded to Bucket {DST_BUCKET}")
# Delete the file from the source bucket
s3.Object(src_bucket, src_key).delete()
LOGGER.info(f"File {src_key} deleted from Bucket {src_bucket}")
except Exception as e:
return {"error":str(e)}
return {'status': 'ok'}
I've tested it with files in two different regions and worked great for me.
I would like to write a script which will detect new files (csv files in this case) that have been added to a folder then upload these new files to my AWS S3 Bucket. I would like them to maintain the original name. Currently the script i have allows me to manually select a file then upload it with a name of my choice.
hc = pd.read_csv((open(r'CSV PATH')))
s3 = boto3.client('s3',aws_access_key_id = 'ACCESSKEYID',
aws_secret_access_key = 'ACCESSKEY')
csv_buf = StringIO()
hc.to_csv(csv_buf, header = True, index = False)
csv_buf.seek(0)
s3.put_object(Bucket = 'BucketName', Body = csv_buf.getvalue(), Key = 'Original CSV Name from Above')
I assume i need the following section in the code:
Code to monitor said location (but only when running the app - does not need to run 24/7)
Code to pull new file from said location
Code to upload to S3 Bucket
Any Tips?
I have a requirement to copy the file between two bucket detailed below -
Bucket A /folder A is source inbound box for daily files which are created as f1_abc_20210304_000 > I want to scan the latest file in folder A (10 files every day) and copy the latest file and next > Copy it in to Bucket B/Folder B / FILE name (ie from 10 files) / 2021/03/04 and drop the files in 04 folder.
Any suggestion how I should proceed with the design?
Thanks
RG
Did you want to do this copy task using Airflow?
If yes, Airflow provide GCSToGCSOperator
One approach is by using client libraries, for the example below I'm using the python client library for google cloud storage.
move.py
from google.cloud import storage
from google.oauth2 import service_account
import os
# as mention on https://cloud.google.com/docs/authentication/production
key_path = "credentials.json"
credentials = service_account.Credentials.from_service_account_file(key_path)
storage_client = storage.Client(credentials=credentials)
bucket_name = "source-bucket-id"
destination_bucket_name = "destination-bucket-id"
source_bucket = storage_client.bucket(bucket_name)
# prefix 'original_data' is the folder where i store the data
array_blobs = source_bucket.list_blobs(prefix='original_data')
filtered_dict = []
for blob in array_blobs:
if str(blob.name).endswith('.csv'):
#add additional logic to handle the files you want to ingest
filtered_dict.append({'name':blob.name,'time':blob.time_created})
orderedlist = sorted(filtered_dict, key=lambda d: d['time'], reverse=True)
latestblob = orderedlist[0]['name']
# prefix 'destination_data' is the folder where i want to move the data
destination_blob_name = "destination_data/{}".format(os.path.basename(latestblob))
source_blob = source_bucket.blob(latestblob)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(source_blob, destination_bucket, destination_blob_name)
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
For a bit of context on the code, what I did was to use the google cloud storage python client, log in, get the list of files from my source folder original_data inside bucket source-bucket-id and add the relevant files ( you can modify the pick up logic by adding your own criteria which fits your situation ). After that I pick up the latest files based on time creation and use that name to move it into my destination-bucket-id. As a note, destination_bucket_name variable includes the folder where I want to allocate the file and also the end filename.
UPDATE: I miss the airflow tag. So on that case you should use the operator that comes with google provider which is GCSToGCSOperator. The parameters to pass can be obtained using a python operator and pass it to your operator. It will work like this:
#task(task_id="get_gcs_params")
def get_gcs_params(**kwargs):
date = kwargs["next_ds"]
# logic should be as displayed on move.py
# ...
return {"source_objects":source,"destination_object":destination}
gcs_params = get_gcs_params()
copy_file = GCSToGCSOperator(
task_id='copy_single_file',
source_bucket='data',
source_objects= gcs_params.output['source_objects'],
destination_bucket='data_backup',
destination_object= gcs_params.output['destination_object'],
gcp_conn_id=google_cloud_conn_id
)
For additional guidance you can check the cloud storage examples list. I use Copy an object between buckets for guidance.
I would like to write a test to mock the download of a function from s3 and replace it locally with an actual file that exists of my machine. I took inspiration from this post. The idea is the following:
from moto import mock_s3
import boto3
def dl(src_f, dest_f):
s3 = boto3.resource('s3')
s3.Bucket('fake_bucket').download_file(src_f, dest_f)
#mock_s3
def _create_and_mock_bucket():
# Create fake bucket and mock it
bucket = "fake_bucket"
# We need to create the bucket since this is all in Moto's 'virtual' AWS account
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3.put_object(Bucket=bucket, Key=file_path, Body="")
dl(file_path, 'some_other_real_file.txt')
_create_and_mock_bucket()
Now some_other_real_file.txt exists, but it is not a copy of some_real_file.txt. Any idea on how to do that?
If 'some_real_file.txt' already exists on your system, you should use upload_file instead:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file
For your example:
file_path = "some_real_file.txt"
s3 = boto3.client("s3", region_name="us-east-1")
s3.create_bucket(Bucket=bucket)
s3_resource = boto3.resource('s3')
s3_resource.meta.client.upload_file(file_path, bucket, file_path)
Your code currently creates an empty file in S3 (since Body=""), and that is exactly what is being downloaded to 'some_other_real_file.txt'.
Notice that, if you change the Body-parameter to have some text in it, that exact content will be downloaded to 'some_other_real_file.txt'.
I am using AWS Sagemaker and trying to upload a data folder into S3 from Sagemaker. I am trying to do is to upload my data into the s3_train_data directory (the directory exists in S3). However, it wouldn't upload it in that bucket, but in a default Bucket that has been created, and in turn creates a new folder directory with the S3_train_data variables.
code to input in directory
import os
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = <bucket name>
prefix = <folders1/folders2>
key = <input>
s3_train_data = 's3://{}/{}/{}/'.format(bucket, prefix, key)
#path 'data' is the folder in the Jupyter Instance, contains all the training data
inputs = sagemaker_session.upload_data(path= 'data', key_prefix= s3_train_data)
Is the problem in the code or more in how I created the notebook?
You could look at the Sample notebooks, how to upload the data S3 bucket
There have many ways. I am just giving you hints to answer.
And you forgot create a boto3 session to access the S3 bucket
It is one of the ways to do it.
import os
import urllib.request
import boto3
def download(url):
filename = url.split("/")[-1]
if not os.path.exists(filename):
urllib.request.urlretrieve(url, filename)
def upload_to_s3(channel, file):
s3 = boto3.resource('s3')
data = open(file, "rb")
key = channel + '/' + file
s3.Bucket(bucket).put_object(Key=key, Body=data)
# caltech-256
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
upload_to_s3('train', 'caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
upload_to_s3('validation', 'caltech-256-60-val.rec')
link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.ipynb
Another way to do it.
bucket = '<your_s3_bucket_name_here>'# enter your s3 bucket where you will copy data and model artifacts
prefix = 'sagemaker/breast_cancer_prediction' # place to upload training files within the bucket
# do some processing then prepare to push the data.
f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))
f.seek(0)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_fileobj(f)
Link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_applying_machine_learning/breast_cancer_prediction/Breast%20Cancer%20Prediction.ipynb
Youtube link : https://www.youtube.com/watch?v=-YiHPIGyFGo - how to pull the data in S3 bucket.