I have an AWS S3 structure that looks like this:
bucket_1
|
|__folder_1
| |__file_1
| |__file_2
|
|__folder_2
|__file_1
|__file_2
bucket_2
And I am trying to find a "good way" (efficient and cost effective) to achieve the following:
bucket_1
|
|__folder_1
| |__file_1
| |__file_2
|
|__folder_2
|__file_1
|__file_2
bucket_2
|
|__folder_1_file_1
|__folder_2_file_1
|__processed_file_2
Where:
folder_1_file_1 and folder_2_file_1 are the original two file_1 that have been copied/renamed (prepending the folder path to the file_name) into the new bucket
processed_file_2 is a file that depends on the content of the two file_2 (e.g., if file_2 were text files, processed_file_2 might be a joint text file where the two original files are appended to each other-note that this is just an example).
I do have a python script that does this for me locally (copy/rename files, process the other files and move to a new folder), but I'm not sure of what tools I should use to do this on AWS, without having to download the data, process them and re-upload them.
I have done some readings, and I've seen that AWS lambda might be one way of doing this, but I'm not sure it's the ideal solution. I'm not even sure if I should keep this as a python script or I should look at other ways (I'm open to other programming languages/tools, as long as they are possibly a very good solution to my problem).
As a plus, it would be useful to have this process triggered either every N days, or when a certain threshold of files have been reached, but also a semi-automated solution (where I should manually run the script/use the tool) would be an acceptable solution.
[Move and Rename objects within s3 bucket using boto3]
import boto3
s3_resource = boto3.resource(‘s3’)
# Copy object A as object B
s3_resource.Object(“bucket_name”, “newpath/to/object_B.txt”).copy_from(
CopySource=”path/to/your/object_A.txt”)
# Delete the former object A
s3_resource.Object(“bucket_name”, “path/to/your/object_A.txt”).delete()
You could move the files within the s3 bucket using the s3fs module.
import s3fs
path1='s3:///bucket_name/folder1/sample_file.pkl'
path2='s3:///bucket_name2/folder2/sample_file.pkl'
s3=s3fs.S3FileSystem()
s3.move(path1,path2)
In case if you have credentials, you could pass within the client_kwargs of S3FileSystem as shown below:
import s3fs
path1='s3:///bucket_name/folder1/sample_file.pkl'
path2='s3:///bucket_name/folder2/sample_file.pkl'
credentials= {}
credentials.setdefault("region_name", r_name) # mention the region
credentials.setdefault("aws_access_key_id", a_key) # mention the access_key_id
credentials.setdefault("aws_secret_access_key", s_a_key) # mention the
secret_access_key
s3=s3fs.S3FileSystem(client_kwargs=credentials)
s3.move(path1,path2)
Related
currently I am trying to compare two S3 buckets with the target to delete files.
problem defintion:
-BucketA
-BucketB
The script is looking for files (same key name) in BucketB which are not available in BucketA.
That files which are only available in BucketB have to be deleted.
The buckets contain about 3-4 Million files each.
Many thanks.
Kind regards,
Alexander
My solution idea:
The List filling is quite slow. Is there any possibility to accelerate it?
#Filling the lists
#e.g. BucketA (BucketB same procedure)
s3_client = boto3.client(3")
bucket_name = "BucketA"
paginator = s3_client.get_paginator("list_objects_v2")
response = paginator.paginate(Bucket="BucketA", PaginationConfig={"PageSize": 2})
for page in response:
files = page.get("Contents")
for file in files:
ListA.append(file['Key'])
#finding the deletion values
diff = list(set(BucketB) - set(BucketA))
#delete files from BucketB (building junks, since with delete_objects_from_bucket max. 1000 objects at once)
for delete_list in group_elements(diff , 1000):
delete_objects_from_bucket(delete_list)
The ListBucket() API call only returns 1000 objects at a time, so listing buckets with 100,000+ objects is very slow and best avoided. You have 3-4 million objects, so definitely avoid listing them!
Instead, use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects in a bucket. Activate it for both buckets, and then use the provided CSV files as input to make the comparison.
You can also use the CSV file generated by Amazon S3 Inventory as a manifest file for Amazon S3 Batch Operations. So, your code could generate a file that lists only the objects that you would like deleted, then get S3 Batch Operations to process those deletions.
I have an S3 bucket where my application saves some final result DataFrames as .csv files. I would like to download the latest 1000 files in this bucket, but I don't know how to do it.
I cannot do it mannualy, as the bucket doesn't allow me to sort the files by date because it has more than 1000 elements
I've seen some questions that could work using AWS CLI, but I don't have enough user permissions to use the AWS CLI, so I have to do it with a boto3 python script that I'm going to upload into a lambda.
How can I do this?
If your application uploads files periodically, you could try this:
import boto3
import datetime
last_n_days = 250
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='processed')
date_limit = datetime.datetime.now() - datetime.timedelta(30)
for page in pages:
for obj in page['Contents']:
if obj['LastModified'] >= date_limit and obj['Key'][-1] != '/':
s3.download_file('bucket', obj['Key'], obj['Key'].split('/')[-1])
With the script above, all files modified in the last 250 days will be downloaded. If your application uploads 4 files per day, this could do the fix.
The best solution is to redefine your problem: rather than retrieving the N most recent files, retrieve all files from the N most recent days. I think that you'll find this to be a better solution in most cases.
However, to make it work you'll need to adopt some form of date-stamped prefix for the uploaded files. For example, 2021-04-16/myfile.csv.
If you feel that you must retrieve N files, then you can use the prefix to retrieve only a portion of the list. Assuming that you know that you have approximately 100 files uploaded per day, then start your bucket listing with 2021-04-05/.
Basically, I want to iterate through the bucket and use the folders structure to classify each file by its run date(year).
So, I have an s3 bucket that path essential looks like:
file/archive/run=2017-10-07-06-13-21/folder_paths/version=1-0-0/part-00000-b.txt
file/archive/run=2018-11-07-06-13-21/folder_paths/version=1-0-0/part-00000-c.txt
In the archive folder, it has the run dates in them.
Ultimately, I want to be able to iterate of the files, and write the part-000....txt files to a csv file by date(year). So I want all the .txt files that runs are in 2018 in one csv file, all .txt files in 2017, and all .txt files that are in 2019.
I am new to boto3 and s3 so I am pretty confused on how to go about doing this:
Here is my code so far:
#Import boto3 module
import boto3
import logging
from botocore.exceptions import ClientError
#This is to List existing Buckets for the AWS account
PREFIX = 'shredded/'
#Create a session to your AWS account
s3client = boto3.client(
's3',
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY,
region_name=REGION_NAME,
)
bucket = 'mybucket'
startAfter = '2020-00-00-00-00-00'
s3objects= s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in s3objects['Contents']:
print(object['Key'])
Any suggestions or ideas would help.
One way to approach this is something like this:
2017_files = [object for object in s3objects['Contents'] if 'run=2017' in object]
2018_files = [object for object in s3objects['Contents'] if 'run=2018' in object]
2019_files = [object for object in s3objects['Contents'] if 'run=2019' in object]
This will check for all the items in the array s3objects['Contents'] if it matches the string condition run={year}.
So then each of the variables i.e. 2017_files, 2018_files & 2019_files would contain all the relevant paths.
From there, you could split the string by / and get the last split which would be part-00000-b.txt as an example.
To write to a .csv, check out Python's csv library (https://docs.python.org/3/library/csv.html) and how to use that, it's pretty solid.
Post back with how you go!
I am using HTcondor to generate some data (txt, png). By running my program, it creates a directory next to the .sub file, named datasets, where the datasets are stored into. Unfortunately, condor does not give me back this created data when finished. In other words, my goal is to get the created data in a "Datasets" subfolder next to the .sub file.
I tried:
1) to not put the data under the datasets subfolder, and I obtained them as thought. Howerver, this is not a smooth solution, since I generate like 100 files which are now mixed up with the .sub file and all the other.
2) Also I tried to set this up in the sub file, leading to this:
notification = Always
should_transfer_files = YES
RunAsOwner = True
When_To_Transfer_Output = ON_EXIT_OR_EVICT
getenv = True
transfer_input_files = main.py
transfer_output_files = Datasets
universe = vanilla
log = log/test-$(Cluster).log
error = log/test-$(Cluster)-$(Process).err
output = log/test-$(Cluster)-$(Process).log
executable = Simulation.bat
queue
This time I get the error, that Datasets was not found. Spelling was checked already.
3) Another option would be, to pack everything in a zip, but since I have to run hundreds of jobs, I do not want to unpack all this files afterwards.
I hope somebody comes up with a good idea on how to solve this.
Just for the record here: HTCondor does not transfer created directories at the end of the run or its contents. The best way to get the content back is to write a wrapper script that will run your executable and then compress the created directory at the root of the working directory. This file will be transferred with all other files. For example, create run.exe:
./Simulation.bat
tar zcf Datasets.tar.gz Datasets
and in your condor submission script put:
executable = run.exe
However, if you do not want to do this and if HTCondor is using a common shared space like an AFS you can simply copy the whole directory out:
./Simulation.bat
cp -r Datasets <AFS location>
The other alternative is to define an initialdir as described at the end of: https://research.cs.wisc.edu/htcondor/manual/quickstart.html
But one must create the directory structure by hand.
also, look around pg. 65 of: https://indico.cern.ch/event/611296/contributions/2604376/attachments/1471164/2276521/TannenbaumT_UserTutorial.pdf
This document is, in general, a very useful one for beginners.
I am using boto library to import data from S3 into python following instructions: http://boto.cloudhackers.com/en/latest/s3_tut.html
The following code allows me to import all files in main folder into python, but replacing c.get_bucket('mainfolder/subfolder') does not work. Does anybody knows how i can access a sub-folder and import its contents ?
import boto
c = boto.connect_s3()
b = c.get_bucket('mainfolder')
The get_bucket method on the connection returns a Bucket object. To access individual files or directories within that bucket, you need to create a Key object with the file path, or use Bucket.list_keys with a folder path to get all the keys for files under that path. Each Key object acts as a handle for a stored file. You then call functions on the keys to manipulate the files stored. For example:
import boto
connection = boto.connect_s3()
bucket = connection.get_bucket('myBucketName')
fileKey = bucket.get_key('myFileName.txt')
print fileKey.get_contents_as_string()
for key in bucket.list('myFolderName'):
print key.get_contents_as_string()
The example here simply prints out the contents of each file (which is probably a bad idea!). Depending on what you want to do with the files, you may want to download them to a temporary directory, or read them to a variable etc. See http://boto.cloudhackers.com/en/latest/ref/s3.html#module-boto.s3.key for the documentation on what can be done with keys.