Python Google Cloud Storage doesn't list some folders - python

I started using Google Cloud Storage python API, and got a weird bug.
Some folders aren't returning with the API calls, its like they don't exist.
I tried the following codes:
• List the files/folders in the parent directory:
storage_client.list_blobs(bucket_or_name=bucket, prefix=path)
My folder is not listed in the iterator
• Check if it exists:
bucket.get_blob(path + "/my_folder").exists()
Get AttributeError bacause NoneType doesn't have attribute exists() (i.e., the blob couldn't be found)
• Try to list files inside of it:
storage_client.list_blobs(bucket_or_name=bucket, prefix=path + "/my_folder")
And get zero-length iterator
The folder's path is copied from the Google Cloud Console, and it definitely exists. So why can't I see it? Am I missing something?

Thanks to John Hanley, I have realized my mistake. I was think about it wrong.
There are no folders in Google Cloud Storage, and the "folders" the code returned me are just empty files (but not every folder has empty file to represent it).
So I wrote this code that returns a generator of the files (and "folders") in the storage:
def _iterate_files(storage_client, bucket: Bucket, folder_path: str, iterate_subdirectories: bool = True):
blobs = storage_client.list_blobs(bucket_or_name=bucket,
prefix=folder_path.rstrip('/') + "/",
delimiter='/')
# First, yield all the files
for blob in blobs:
if not blob.name.endswith('/'):
yield blob
# Then, yield the subfolders
for prefix in blobs.prefixes:
yield bucket.blob(prefix)
# And if required, yield back the files and folders in the subfolders.
if iterate_subdirectories:
yield from _iterate_files(bucket, prefix, True)

Related

deleting a folder inside gcp bucket

I am having a temporary folder which I want to delete in gcp bucket I want to delete it with all its content, what I thought of is I can pass the path of this temp folder as a prefix and list all blobs inside it and delete every blob, but I had doubts that it would delete the blobs inside it without deleting the folder itself, is it right? The aim goal is to find folder1/ empty without the temp folder but, when I tried I found un expected behavior I found the folder that contains this temp folder is deleted !!
for example if we have folder1/tempfolder/file1.csv and folder1/tempfolder/file2.csv I found that folder1 is deleted after applying my changes, here are the changes applied:
delete_folder_on_uri(storage_client,"gs://bucketname/folder1/tempfolder/")
def delete_folder_on_uri(storage_client, folder_gcs_uri):
bucket_uri = BucketUri(folder_gcs_uri)
delete_folder(storage_client, bucket_uri.name, bucket_uri.key)
def delete_folder(storage_client, bucket_name, folder):
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=folder)
for blob in blobs:
blob.delete()
PS: BucketUri is a class in which bucket_uri.name retrieves the bucket name and bucket_uri.key retrieves the path which is folder1/tempfolder/
There is no such thing as folder in the GCS at all. The "folder" - is just a human friendly representation of an object's (or file's) prefix. So the full path looks like gs://bucket-name/prefix/another-prefix/object-name. As there are no folders (ontologically) - thus - there is nothing to delete.
Thus you might need to delete all objects (files) which start with a some prefix in a given bucket.
I think you are doing everything correctly.
Here is an old example (similar to your code) - How to delete GCS folder from Python?
And here is an API description - Blobs / Objects - you might like to check the delete method.

S3 boto3 delete files except a specific file

Probably this is just a newbie question.
I have a python code using boto3 sdk and i need to delete all files from an s3 bucket except one file.
The issue is that the user is updating this S3 bucket and places some files into some folders. After i copy those files i need to delete them, hence the issue here is that the folders are deleted as well, since there is no concept of folders on Cloud Providers. I need to keep the "folder" structure intact. I was thinking of placing a dummy file inside each "folder" and exclude that file from deletion.
Is this something doable?
If you create a zero-byte object with the same name as the folder you want ending in a /, it will show up as an empty folder in the AWS Console, and other tools that enumerate objects delimited by prefix will see the prefix in their list of common prefixes:
s3 = boto3.client('s3')
s3.put_object(Bucket='example-bucket', Key='example/folder/', Body=b'')
Then, as you enumerate the list of objects to delete them, ignore any object that ends in a /, since this will just be the markers you're using for folders:
s3 = boto3.client('s3')
resp = s3.list_objects_v2(Bucket='example-bucket', Prefix='example/folder/')
for cur in resp['Contents']:
if cur['Key'].endswith('/'):
print("Ignoring folder marker: " + cur['Key'])
else:
print("Should delete object: " + cur['Key'])
# TODO: Delete this object
for file in files_in_bucket:
if file.name != file_name_to_keep:
file.delete()
Could follow this sort of logic in your script?

How to create directories in Azure storage container without creating extra files?

I've created python code to create a range of folders and subfolders (for data lake) in an Azure storage container. The code works and is based on the documentation on Microsoft Azure. One thing though is that I'm creating a dummy 'txt' file in the folders in order to create the directory (which I can clean up later). I was wondering if there's a way to create the folders and subfolders without creating a file. I understand that the folders in Azure container storage are not hierarchical and are instead metadata and what I'm asking for may not be possible?
connection_string = config['azure_storage_connectionstring']
gen2_container_name = config['gen2_container_name']
container_client = ContainerClient.from_connection_string(connection_string, gen2_container_name)
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# blob_service_client.create_container(gen2_container_name)
def create_folder(folder, sub_folder):
blob_client = container_client.get_blob_client('{}/{}/start_here.txt'.format(folder, sub_folder))
with open ('test.txt', 'rb') as data:
blob_client.upload_blob(data)
def create_all_folders():
config = load_config()
folder_list = config['folder_list']
sub_folder_list = config['sub_folder_list']
for folder in folder_list:
for sub_folder in sub_folder_list:
try:
create_folder(folder, sub_folder)
except Exception as e:
print ('Looks like something went wrong here trying to create this folder structure {}/{}. Maybe the structure already exists?'.format(folder, sub_folder))
I've created python code to create a range of folders and subfolders
(for data lake) in an Azure storage container. The code works and is
based on the documentation on Microsoft Azure. One thing though is
that I'm creating a dummy 'txt' file in the folders in order to create
the directory (which I can clean up later). I was wondering if there's
a way to create the folders and subfolders without creating a file. I
understand that the folders in Azure container storage are not
hierarchical and are instead metadata and what I'm asking for may not
be possible?
No, for blob storage, this is not possible. There is no way to create so-called "folders"
But you can use data-lake SDK like this to create directory:
from azure.storage.filedatalake import DataLakeServiceClient
connect_str = "DefaultEndpointsProtocol=https;AccountName=0730bowmanwindow;AccountKey=xxxxxx;EndpointSuffix=core.windows.net"
datalake_service_client = DataLakeServiceClient.from_connection_string(connect_str)
myfilesystem = "test"
myfolder = "test1111111111"
myfile = "FileName.txt"
file_system_client = datalake_service_client.get_file_system_client(myfilesystem)
directory_client = file_system_client.create_directory(myfolder)
Just to add some context, the reason this is not possible in Blob Storage is that folders/directories are not "real". Folders do not exist as standalone objects, they are only defined as part of a blob name.
For example, if you have a folder "mystuff" with a file (blob) "somefile.txt", the blob name actually includes the folder name and "/" character like mystuff/somefile.txt. The blob exists directly inside the container, not inside a folder. This naming convention can be nested many times over in a blob name like folder1/folder2/mystuff/anotherfolder/somefile.txt, but that blob still only exists directly in the container.
Folders can appear to exist in certain tooling (like Azure Storage Explorer) because the SDK permits blob name filtering: if you do so on the "/" character, you can mimic the appearance of a folder and its contents. But in order for a folder to even appear to exist, there must be blob in the container with the appropriate name. If you want to "force" a folder to exist, you can create a 0-byte blob with the correct folder path in the name, but the blob artifact will still need to exist.
The exception is Azure Data Lake Storage (ADLS) Gen 2, which is Blob Storage that implements a Hierarchical Namespace. This makes it more like a file system and so respects the concept of Directories as standalone objects. ADLS is built on Blob Storage, so there is a lot of parity between the two. If you absolutely must have empty directories, then ADLS is the way to go.

Deleting s3 file also deletes the folder python code

I am trying to delete all the files inside a folder in S3 Bucket using the below python code but the folder is also getting deleted along with the file
import boto3
s3 = boto3.resource('s3')
old_bucket_name='new-bucket'
old_prefix='data/'
old_bucket = s3.Bucket(old_bucket_name)
old_bucket.objects.filter(Prefix=old_prefix).delete()
S3 does not have folders. Object names can contain / and the console will represent objects with common prefixes that contain a / as folders, but the folder does not actually exist. If you're looking to have that visual representation, you can create a zero-length object that ends with a / which is basically equivalent to what the console does if you create a folder via the UI.
Relevant docs can be found here

Issue in list() method of module boto

I am using the list method as:
all_keys = self.s3_bucket.list(self.s3_path)
The bucket "s3_path" contains files and folders. The return value of above line is confusing. It is returning:
Parent directory
A few directories not all
All the files in folder and subfolders.
I had assumed it would return files only.
There is actually no such thing as a folder in Amazon S3. It is just provided for convenience. Objects can be stored in a given path even if a folder with that path does not exist. The Key of the object is the full path plus the filename.
For example, this will copy a file even if the folder does not exist:
aws s3 cp file.txt s3://my-bucket/foo/bar/file.txt
This will not create the /foo/bar folder. It simply creates an object with a Key of: /foo/bar/file.txt
However, if folders are created in the S3 Management Console, a zero-length file is created with the name of the folder so that it appears in the console. When listing files, this will appear as the name of the directory, but it is actually the name of a zero-length file.
That is why some directories might appear but not others -- it depends whether they were specifically created, or whether they objects were simply stored in that path.
Bottom line: Amazon S3 is an object storage system. It is really just a big Key/Value store -- the Key is the name of the Object, the Value is the contents of the object. Do not assume it works the same as a traditional file system.
If you have a lot of items in the bucket, the results of a list_objects will be paginated. By default, it will return up to 1000 items. See the Boto docs to learn how to use Marker to pagniate through all items.
Oh, looks like you're on Boto 2. For you, it will be BucketListResultSet.

Categories