S3 boto3 delete files except a specific file

S3 boto3 delete files except a specific file - python

Probably this is just a newbie question.
I have a python code using boto3 sdk and i need to delete all files from an s3 bucket except one file.
The issue is that the user is updating this S3 bucket and places some files into some folders. After i copy those files i need to delete them, hence the issue here is that the folders are deleted as well, since there is no concept of folders on Cloud Providers. I need to keep the "folder" structure intact. I was thinking of placing a dummy file inside each "folder" and exclude that file from deletion.
Is this something doable?

If you create a zero-byte object with the same name as the folder you want ending in a /, it will show up as an empty folder in the AWS Console, and other tools that enumerate objects delimited by prefix will see the prefix in their list of common prefixes:
s3 = boto3.client('s3')
s3.put_object(Bucket='example-bucket', Key='example/folder/', Body=b'')
Then, as you enumerate the list of objects to delete them, ignore any object that ends in a /, since this will just be the markers you're using for folders:
s3 = boto3.client('s3')
resp = s3.list_objects_v2(Bucket='example-bucket', Prefix='example/folder/')
for cur in resp['Contents']:
if cur['Key'].endswith('/'):
print("Ignoring folder marker: " + cur['Key'])
else:
print("Should delete object: " + cur['Key'])
# TODO: Delete this object

for file in files_in_bucket:
if file.name != file_name_to_keep:
file.delete()
Could follow this sort of logic in your script?

Related

Python Google Cloud Storage doesn't list some folders

I started using Google Cloud Storage python API, and got a weird bug.
Some folders aren't returning with the API calls, its like they don't exist.
I tried the following codes:
• List the files/folders in the parent directory:
storage_client.list_blobs(bucket_or_name=bucket, prefix=path)
My folder is not listed in the iterator
• Check if it exists:
bucket.get_blob(path + "/my_folder").exists()
Get AttributeError bacause NoneType doesn't have attribute exists() (i.e., the blob couldn't be found)
• Try to list files inside of it:
storage_client.list_blobs(bucket_or_name=bucket, prefix=path + "/my_folder")
And get zero-length iterator
The folder's path is copied from the Google Cloud Console, and it definitely exists. So why can't I see it? Am I missing something?

Thanks to John Hanley, I have realized my mistake. I was think about it wrong.
There are no folders in Google Cloud Storage, and the "folders" the code returned me are just empty files (but not every folder has empty file to represent it).
So I wrote this code that returns a generator of the files (and "folders") in the storage:
def _iterate_files(storage_client, bucket: Bucket, folder_path: str, iterate_subdirectories: bool = True):
blobs = storage_client.list_blobs(bucket_or_name=bucket,
prefix=folder_path.rstrip('/') + "/",
delimiter='/')
# First, yield all the files
for blob in blobs:
if not blob.name.endswith('/'):
yield blob
# Then, yield the subfolders
for prefix in blobs.prefixes:
yield bucket.blob(prefix)
# And if required, yield back the files and folders in the subfolders.
if iterate_subdirectories:
yield from _iterate_files(bucket, prefix, True)

deleting a folder inside gcp bucket

I am having a temporary folder which I want to delete in gcp bucket I want to delete it with all its content, what I thought of is I can pass the path of this temp folder as a prefix and list all blobs inside it and delete every blob, but I had doubts that it would delete the blobs inside it without deleting the folder itself, is it right? The aim goal is to find folder1/ empty without the temp folder but, when I tried I found un expected behavior I found the folder that contains this temp folder is deleted !!
for example if we have folder1/tempfolder/file1.csv and folder1/tempfolder/file2.csv I found that folder1 is deleted after applying my changes, here are the changes applied:
delete_folder_on_uri(storage_client,"gs://bucketname/folder1/tempfolder/")
def delete_folder_on_uri(storage_client, folder_gcs_uri):
bucket_uri = BucketUri(folder_gcs_uri)
delete_folder(storage_client, bucket_uri.name, bucket_uri.key)
def delete_folder(storage_client, bucket_name, folder):
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=folder)
for blob in blobs:
blob.delete()
PS: BucketUri is a class in which bucket_uri.name retrieves the bucket name and bucket_uri.key retrieves the path which is folder1/tempfolder/

There is no such thing as folder in the GCS at all. The "folder" - is just a human friendly representation of an object's (or file's) prefix. So the full path looks like gs://bucket-name/prefix/another-prefix/object-name. As there are no folders (ontologically) - thus - there is nothing to delete.
Thus you might need to delete all objects (files) which start with a some prefix in a given bucket.
I think you are doing everything correctly.
Here is an old example (similar to your code) - How to delete GCS folder from Python?
And here is an API description - Blobs / Objects - you might like to check the delete method.

How to retrieve only the file name in a s3 folders path using pyspark

Hi I have aws s3 bucket in which few of the folders and subfolders are defined
I need to retrieve only the filename in whichever folder it will be. How to go about it
s3 bucket name - abc
path - s3://abc/ann/folder1/folder2/folder3/file1
path - s3://abc/ann/folder1/folder2/file2
code tried so far
s3 = boto3.client(s3)
lst_obj = s3.list_objects(bucket='abc',prefix='ann/')
lst_obj["contents"]
I'm further looping to get all the contents
for file in lst_obj["contents"]:
do somtheing...
Here file["Key"] gives me the whole path, but i just need the filename

Here is an example of how to get the filenames.
import boto3
s3 = boto3.resource('s3')
for obj in s3.Bucket(name='<your bucket>').objects.filter(Prefix='<prefix>'):
filename = obj.key.split('/')[-1]
print(filename)

You can just extract the name by splitting the file Key on / symbol and extracting last element
for file in lst_obj["contents"]:
name = file["Key"].split("/")[-1]

Using list objects even with a prefix is simply filtering objects that start with a specific prefix.
What you see as a path in S3 is actually part of the objects key, in fact the key (which is acting as a piece of metadata to identify the object) actually has the value including what might look as if they're subfolders.
If you want the last part of the object key, you will need to split the key by the separator ('/').
You could do this with file['Key'].rsplit(',')[1] which would give you the filename.

Deleting s3 file also deletes the folder python code

I am trying to delete all the files inside a folder in S3 Bucket using the below python code but the folder is also getting deleted along with the file
import boto3
s3 = boto3.resource('s3')
old_bucket_name='new-bucket'
old_prefix='data/'
old_bucket = s3.Bucket(old_bucket_name)
old_bucket.objects.filter(Prefix=old_prefix).delete()

S3 does not have folders. Object names can contain / and the console will represent objects with common prefixes that contain a / as folders, but the folder does not actually exist. If you're looking to have that visual representation, you can create a zero-length object that ends with a / which is basically equivalent to what the console does if you create a folder via the UI.
Relevant docs can be found here

Can you list all folders in an S3 bucket?

I have a bucket containing a number of folders each folders contains a number of images. Is it possible to list all the folders without iterating through all keys (folders and images) in the bucket. I'm using Python and boto.

You can use list() with an empty prefix (first parameter) and a folder delimiter (second parameter) to achieve what you're asking for:
s3conn = boto.connect_s3(access_key, secret_key, security_token=token)
bucket = s3conn.get_bucket(bucket_name)
folders = bucket.list('', '/')
for folder in folders:
print folder.name
Remark:
In S3 there is no such thing as "folders". All you have is buckets and objects.
The objects represent files. When you name a file: name-of-folder/name-of-file it will look as if it's a file: name-of-file that resides inside folder: name-of-folder - but in reality there's no such thing as the "folder".
You can also use AWS CLI (Command Line Interface): the command s3ls <bucket-name> will list only the "folders" in the first-level of the bucket.

Yes ! You can list by using prefix and delimiters of a key. Have a look at the following documentation.
http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

S3 boto3 delete files except a specific file - python

for file in files_in_bucket: if file.name != file_name_to_keep: file.delete() Could follow this sort of logic in your script?

Related

Python Google Cloud Storage doesn't list some folders

deleting a folder inside gcp bucket

How to retrieve only the file name in a s3 folders path using pyspark

Deleting s3 file also deletes the folder python code

Can you list all folders in an S3 bucket?

Categories

Resources