Upload new files to bucket from local drive - python

I would like to write a script which will detect new files (csv files in this case) that have been added to a folder then upload these new files to my AWS S3 Bucket. I would like them to maintain the original name. Currently the script i have allows me to manually select a file then upload it with a name of my choice.
hc = pd.read_csv((open(r'CSV PATH')))
s3 = boto3.client('s3',aws_access_key_id = 'ACCESSKEYID',
aws_secret_access_key = 'ACCESSKEY')
csv_buf = StringIO()
hc.to_csv(csv_buf, header = True, index = False)
csv_buf.seek(0)
s3.put_object(Bucket = 'BucketName', Body = csv_buf.getvalue(), Key = 'Original CSV Name from Above')
I assume i need the following section in the code:
Code to monitor said location (but only when running the app - does not need to run 24/7)
Code to pull new file from said location
Code to upload to S3 Bucket
Any Tips?

Related

Option to upload and replace a file in google colab?

I'm trying to upload and replace a file using google colab.
Currently what I do is
from google.colab import files
upload = files.upload()
then if I need to modify the file, I do it locally on my computer. If I upload it again using the same cell, the new version of the file will be uploaded as "filename(1)". I would like the new version to replace the old one.
What I do then is
!rm "filename"
And then I run the first cell again. But it is not great.
Is there an option like what follows ?
upload = files.upload(replace=True)
My approach is as follows.
lsdi = os.listdir('/content')
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name = fn, length = len(uploaded[fn])))
if fn in lsdi:
os.remove(fn)
lsdi = os.listdir('/content') # list is in arbitrary order
for k in sorted(lsdi,reverse=True): # sorted to get the most recent file name.
fil_dados = re.match(fn[:fn.rfind('.')],k)
if fil_dados:
fn = k

Process different file types separately in AWS S3 (python)

I'm looking to process files in an S3 location and re upload them to a different location depending on the file type which is specified in the key name. I want to use pagination to be able to process in batches according to accumulated file size, but I'm unsure how to do this separately for each filetype.
I establish a connection to the boto3 client and relevant S3 locations earlier in the code and check that the locations and objects within them exist.
My code is successfully giving me the number of each filetype in the location, but is not entering any of the for loops.
Should I be putting the paginator code within the if statements for each filetype instead?
if locationExists and filesFound:
paginator = s3c.get_paginator("list_objects_v2")
operation_parameters = {'Bucket': bucket,
'Prefix': path}
page_iterator = paginator.paginate(**operation_parameters)
lastresponse_objects = page_iterator.search("Contents[?contains(Key, `lastresponse`)][]")
snapshot_objects = page_iterator.search("Contents[?contains(Key, `_snapshot_`)][]")
snapshotupdate_objects = page_iterator.search("Contents[?contains(Key, `snapshotpriceupdate`)][]")
if lastresponse_objects:
total_LR_files = len(list(lastresponse_objects))
print(f'Lastresponse files: {total_LR_files}')
for file in lastresponse_objects:
print(file.key)
# processing logic for files containing 'lastresponse'
if snapshot_objects:
total_snapshot_files = len(list(snapshot_objects))
print(f'Snapshot files: {total_snapshot_files}')
for file in snapshot_objects:
print(file.key)
# processing logic for files containing '_snapshot_'
if snapshotupdate_objects:
total_snapshotupdate_files = len(list(snapshotupdate_objects))
print(f'Snapshot update files: {total_snapshotupdate_files}')
for file in snapshot_objects:
print(file.key)
# processing logic for files containing 'snapshotpriceupdate'

Copy files between two GCS bucket which is partitioned by date

I have a requirement to copy the file between two bucket detailed below -
Bucket A /folder A is source inbound box for daily files which are created as f1_abc_20210304_000 > I want to scan the latest file in folder A (10 files every day) and copy the latest file and next > Copy it in to Bucket B/Folder B / FILE name (ie from 10 files) / 2021/03/04 and drop the files in 04 folder.
Any suggestion how I should proceed with the design?
Thanks
RG
Did you want to do this copy task using Airflow?
If yes, Airflow provide GCSToGCSOperator
One approach is by using client libraries, for the example below I'm using the python client library for google cloud storage.
move.py
from google.cloud import storage
from google.oauth2 import service_account
import os
# as mention on https://cloud.google.com/docs/authentication/production
key_path = "credentials.json"
credentials = service_account.Credentials.from_service_account_file(key_path)
storage_client = storage.Client(credentials=credentials)
bucket_name = "source-bucket-id"
destination_bucket_name = "destination-bucket-id"
source_bucket = storage_client.bucket(bucket_name)
# prefix 'original_data' is the folder where i store the data
array_blobs = source_bucket.list_blobs(prefix='original_data')
filtered_dict = []
for blob in array_blobs:
if str(blob.name).endswith('.csv'):
#add additional logic to handle the files you want to ingest
filtered_dict.append({'name':blob.name,'time':blob.time_created})
orderedlist = sorted(filtered_dict, key=lambda d: d['time'], reverse=True)
latestblob = orderedlist[0]['name']
# prefix 'destination_data' is the folder where i want to move the data
destination_blob_name = "destination_data/{}".format(os.path.basename(latestblob))
source_blob = source_bucket.blob(latestblob)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(source_blob, destination_bucket, destination_blob_name)
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
For a bit of context on the code, what I did was to use the google cloud storage python client, log in, get the list of files from my source folder original_data inside bucket source-bucket-id and add the relevant files ( you can modify the pick up logic by adding your own criteria which fits your situation ). After that I pick up the latest files based on time creation and use that name to move it into my destination-bucket-id. As a note, destination_bucket_name variable includes the folder where I want to allocate the file and also the end filename.
UPDATE: I miss the airflow tag. So on that case you should use the operator that comes with google provider which is GCSToGCSOperator. The parameters to pass can be obtained using a python operator and pass it to your operator. It will work like this:
#task(task_id="get_gcs_params")
def get_gcs_params(**kwargs):
date = kwargs["next_ds"]
# logic should be as displayed on move.py
# ...
return {"source_objects":source,"destination_object":destination}
gcs_params = get_gcs_params()
copy_file = GCSToGCSOperator(
task_id='copy_single_file',
source_bucket='data',
source_objects= gcs_params.output['source_objects'],
destination_bucket='data_backup',
destination_object= gcs_params.output['destination_object'],
gcp_conn_id=google_cloud_conn_id
)
For additional guidance you can check the cloud storage examples list. I use Copy an object between buckets for guidance.

Django - AWS S3 - Moving Files

I am using AWS S3 as my default file storage system. I have a model with a file field like so:
class Segmentation(models.Model):
file = models.FileField(...)
I am running image processing jobs on a second server that dump processsed-images to a different AWS S3 bucket.
I want to save the processed-image in my Segmentation table.
Currently I am using boto3 to manually download the file to my "local" server (where my django-app lives) and then upload it to the local S3 bucket like so:
from django.core.files import File
import boto3
def save_file(segmentation, foreign_s3_key):
# set foreign bucket
foreign_bucket = 'foreign-bucket'
# create a temp file:
temp_local_file = 'tmp/temp.file'
# use boto3 to download foreign file locally:
s3_client = boto3.client('s3')
s3_client.download_file(foreign_bucket , foreign_s3_key, temp_local_file)
# save file to segmentation:
segmentation.file = File(open(temp_local_file, 'rb'))
segmentation.save()
# delete temp file:
os.remove(temp_local_file)
This works fine but it is resource intensive. I have some jobs that need to process hundreds of images.
Is there a way to copy a file from the foreign bucket to my local bucket and set the segmentation.file field to the copied file?
I am assuming you want to move some files from one source bucket to some destination bucket, as the OP header suggests, and do some processing in between.
import boto3
my_west_session = boto3.Session(region_name = 'us-west-2')
my_east_session = boto3.Session(region_name = 'us-east-1')
backup_s3 = my_west_session.resource("s3")
video_s3 = my_east_session.resource("s3")
local_bucket = backup_s3.Bucket('localbucket')
foreign_bucket = video_s3.Bucket('foreignbucket')
for obj in foreign_bucket.objects.all():
# do some processing
# on objects
copy_source = {
'Bucket': foreign_bucket,
'Key': obj.key
}
local_bucket.copy(copy_source, obj.key)
Session configurations
S3 Resource Copy Or CopyObject depending on your requirement.

AWS uploading file into wrong bucket

I am using AWS Sagemaker and trying to upload a data folder into S3 from Sagemaker. I am trying to do is to upload my data into the s3_train_data directory (the directory exists in S3). However, it wouldn't upload it in that bucket, but in a default Bucket that has been created, and in turn creates a new folder directory with the S3_train_data variables.
code to input in directory
import os
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = <bucket name>
prefix = <folders1/folders2>
key = <input>
s3_train_data = 's3://{}/{}/{}/'.format(bucket, prefix, key)
#path 'data' is the folder in the Jupyter Instance, contains all the training data
inputs = sagemaker_session.upload_data(path= 'data', key_prefix= s3_train_data)
Is the problem in the code or more in how I created the notebook?
You could look at the Sample notebooks, how to upload the data S3 bucket
There have many ways. I am just giving you hints to answer.
And you forgot create a boto3 session to access the S3 bucket
It is one of the ways to do it.
import os
import urllib.request
import boto3
def download(url):
filename = url.split("/")[-1]
if not os.path.exists(filename):
urllib.request.urlretrieve(url, filename)
def upload_to_s3(channel, file):
s3 = boto3.resource('s3')
data = open(file, "rb")
key = channel + '/' + file
s3.Bucket(bucket).put_object(Key=key, Body=data)
# caltech-256
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
upload_to_s3('train', 'caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
upload_to_s3('validation', 'caltech-256-60-val.rec')
link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.ipynb
Another way to do it.
bucket = '<your_s3_bucket_name_here>'# enter your s3 bucket where you will copy data and model artifacts
prefix = 'sagemaker/breast_cancer_prediction' # place to upload training files within the bucket
# do some processing then prepare to push the data.
f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype('float32'), train_y.astype('float32'))
f.seek(0)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_fileobj(f)
Link : https://buildcustom.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/introduction_to_applying_machine_learning/breast_cancer_prediction/Breast%20Cancer%20Prediction.ipynb
Youtube link : https://www.youtube.com/watch?v=-YiHPIGyFGo - how to pull the data in S3 bucket.

Categories