Issue in list() method of module boto - python

I am using the list method as:
all_keys = self.s3_bucket.list(self.s3_path)
The bucket "s3_path" contains files and folders. The return value of above line is confusing. It is returning:
Parent directory
A few directories not all
All the files in folder and subfolders.
I had assumed it would return files only.

There is actually no such thing as a folder in Amazon S3. It is just provided for convenience. Objects can be stored in a given path even if a folder with that path does not exist. The Key of the object is the full path plus the filename.
For example, this will copy a file even if the folder does not exist:
aws s3 cp file.txt s3://my-bucket/foo/bar/file.txt
This will not create the /foo/bar folder. It simply creates an object with a Key of: /foo/bar/file.txt
However, if folders are created in the S3 Management Console, a zero-length file is created with the name of the folder so that it appears in the console. When listing files, this will appear as the name of the directory, but it is actually the name of a zero-length file.
That is why some directories might appear but not others -- it depends whether they were specifically created, or whether they objects were simply stored in that path.
Bottom line: Amazon S3 is an object storage system. It is really just a big Key/Value store -- the Key is the name of the Object, the Value is the contents of the object. Do not assume it works the same as a traditional file system.

If you have a lot of items in the bucket, the results of a list_objects will be paginated. By default, it will return up to 1000 items. See the Boto docs to learn how to use Marker to pagniate through all items.
Oh, looks like you're on Boto 2. For you, it will be BucketListResultSet.

Related

Attempting to delete files in s3 folder but the command is removing the entire directory itself

I have an s3 bucket which has 4 folders now of which is input/.
After the my airflow DAG Runs at the end of the py code are few lines which attempt to delete all files in the input/.
response_keys = self._s3_hook.delete_objects(bucket=self.s3_bucket, keys=s3_input_keys)
deleted_keys = [x['Key'] for x in response_keys.get("Deleted", []) if x['Key'] not in ['input/']]
self.log.info("Deleted: %s", deleted_keys)
if "Errors" in response_keys:
errors_keys = [x['Key'] for x in response_keys.get("Errors", [])]
raise AirflowException("Errors when deleting: {}".format(errors_keys))
Now, this sometimes deletes all files and sometimes deletes the directory itself. I am not sure why it is deleting even though I have specifically excluded the same.
Is there any other way I can try to achieve the deletion?
PS I tried using BOTO, but the AWS has a security which will not let both access the buckets. so Hook is all I got. Please help
Directories do not exist in Amazon S3. Instead, the Key (filename) of an object includes the full path. For example, the Key might be invoices/january.xls, which includes the path.
When an object is created in a path, the directory magically appears. If all objects in a directory are deleted, then the directory magically disappears (because it never actually existed).
However, if you click the Create Folder button in the Amazon S3 management console, a zero-byte object is created with the name of the directory. This forces the directory to 'appear' since there is an object in that path. However, the directory does not actually exist!
So, your Airflow job might be deleting all the objects in a given path, which causes the directory to disappear. This is quite okay and nothing to be worried about. However, if the Create Folder button was used to create the folder, then the folder will still exist when all objects are deleted (assuming that the delete operation does not also delete the zero-length object).

deleting a folder inside gcp bucket

I am having a temporary folder which I want to delete in gcp bucket I want to delete it with all its content, what I thought of is I can pass the path of this temp folder as a prefix and list all blobs inside it and delete every blob, but I had doubts that it would delete the blobs inside it without deleting the folder itself, is it right? The aim goal is to find folder1/ empty without the temp folder but, when I tried I found un expected behavior I found the folder that contains this temp folder is deleted !!
for example if we have folder1/tempfolder/file1.csv and folder1/tempfolder/file2.csv I found that folder1 is deleted after applying my changes, here are the changes applied:
delete_folder_on_uri(storage_client,"gs://bucketname/folder1/tempfolder/")
def delete_folder_on_uri(storage_client, folder_gcs_uri):
bucket_uri = BucketUri(folder_gcs_uri)
delete_folder(storage_client, bucket_uri.name, bucket_uri.key)
def delete_folder(storage_client, bucket_name, folder):
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=folder)
for blob in blobs:
blob.delete()
PS: BucketUri is a class in which bucket_uri.name retrieves the bucket name and bucket_uri.key retrieves the path which is folder1/tempfolder/
There is no such thing as folder in the GCS at all. The "folder" - is just a human friendly representation of an object's (or file's) prefix. So the full path looks like gs://bucket-name/prefix/another-prefix/object-name. As there are no folders (ontologically) - thus - there is nothing to delete.
Thus you might need to delete all objects (files) which start with a some prefix in a given bucket.
I think you are doing everything correctly.
Here is an old example (similar to your code) - How to delete GCS folder from Python?
And here is an API description - Blobs / Objects - you might like to check the delete method.

S3 boto3 delete files except a specific file

Probably this is just a newbie question.
I have a python code using boto3 sdk and i need to delete all files from an s3 bucket except one file.
The issue is that the user is updating this S3 bucket and places some files into some folders. After i copy those files i need to delete them, hence the issue here is that the folders are deleted as well, since there is no concept of folders on Cloud Providers. I need to keep the "folder" structure intact. I was thinking of placing a dummy file inside each "folder" and exclude that file from deletion.
Is this something doable?
If you create a zero-byte object with the same name as the folder you want ending in a /, it will show up as an empty folder in the AWS Console, and other tools that enumerate objects delimited by prefix will see the prefix in their list of common prefixes:
s3 = boto3.client('s3')
s3.put_object(Bucket='example-bucket', Key='example/folder/', Body=b'')
Then, as you enumerate the list of objects to delete them, ignore any object that ends in a /, since this will just be the markers you're using for folders:
s3 = boto3.client('s3')
resp = s3.list_objects_v2(Bucket='example-bucket', Prefix='example/folder/')
for cur in resp['Contents']:
if cur['Key'].endswith('/'):
print("Ignoring folder marker: " + cur['Key'])
else:
print("Should delete object: " + cur['Key'])
# TODO: Delete this object
for file in files_in_bucket:
if file.name != file_name_to_keep:
file.delete()
Could follow this sort of logic in your script?

Deleting s3 file also deletes the folder python code

I am trying to delete all the files inside a folder in S3 Bucket using the below python code but the folder is also getting deleted along with the file
import boto3
s3 = boto3.resource('s3')
old_bucket_name='new-bucket'
old_prefix='data/'
old_bucket = s3.Bucket(old_bucket_name)
old_bucket.objects.filter(Prefix=old_prefix).delete()
S3 does not have folders. Object names can contain / and the console will represent objects with common prefixes that contain a / as folders, but the folder does not actually exist. If you're looking to have that visual representation, you can create a zero-length object that ends with a / which is basically equivalent to what the console does if you create a folder via the UI.
Relevant docs can be found here

Can you list all folders in an S3 bucket?

I have a bucket containing a number of folders each folders contains a number of images. Is it possible to list all the folders without iterating through all keys (folders and images) in the bucket. I'm using Python and boto.
You can use list() with an empty prefix (first parameter) and a folder delimiter (second parameter) to achieve what you're asking for:
s3conn = boto.connect_s3(access_key, secret_key, security_token=token)
bucket = s3conn.get_bucket(bucket_name)
folders = bucket.list('', '/')
for folder in folders:
print folder.name
Remark:
In S3 there is no such thing as "folders". All you have is buckets and objects.
The objects represent files. When you name a file: name-of-folder/name-of-file it will look as if it's a file: name-of-file that resides inside folder: name-of-folder - but in reality there's no such thing as the "folder".
You can also use AWS CLI (Command Line Interface): the command s3ls <bucket-name> will list only the "folders" in the first-level of the bucket.
Yes ! You can list by using prefix and delimiters of a key. Have a look at the following documentation.
http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html

Categories