Google Cloud Storage : Python API get blob information with wildcard - python

I am trying to get blob information from a bucket but i want to use wildcards in blob name. Consider my bucket
$ gsutil ls gs://myBucket/myPath/
gs://myBucket/myPath/
gs://myBucket/myPath/ranOn=2018-12-11/
gs://myBucket/myPath/ranOn=2018-12-12/
gs://myBucket/myPath/ranOn=2018-12-13/
gs://myBucket/myPath/ranOn=2018-12-14/
gs://myBucket/myPath/ranOn=2018-12-15/
gs://myBucket/myPath/ranOn=2019-02-18/
gs://myBucket/myPath/ranOn=2019-02-19/
gs://myBucket/myPath/ranOn=2019-02-20/
gs://myBucket/myPath/ranOn=2019-02-21/
now from the command line, i am able to do
$ gsutil ls gs://myBucket/myPath/ranOn=2018*
gs://myBucket/myPath/
gs://myBucket/myPath/ranOn=2018-12-11/
gs://myBucket/myPath/ranOn=2018-12-12/
gs://myBucket/myPath/ranOn=2018-12-13/
gs://myBucket/myPath/ranOn=2018-12-14/
gs://myBucket/myPath/ranOn=2018-12-15/
and hence i can do the same for the size
$ gsutil du -sh gs://myBucket/myPath/ranOn=2018*
2.7 G
now, i want to do the same thing with the python api. Here is what i tried
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('myBucket')
blob = bucket.get_blob('myPath/ranOn=2018*')
print('Size: {} bytes'.format(blob.size))
Size: None bytes
why is this not working? How can i use wildcards in blob paths with python api?

Unfortunately get_blob is just for getting individual files, not multiple files.
You'll need to iterate over all the files that match the prefix and sum their sizes to get the total size.
blobs = bucket.list_blobs(prefix="myPath/ranOn=2018")
total = sum([blob.size for blob in blobs])

Related

why does list_blobs returns virtual directory as well?

I thought it only return the leaf blob files and that has been the case earlier, but now all the sudden it returns the virtual directory as well, did I have the wrong impression or is it that something changed over the night?
Assume I have the following structure
container/dir0/dir1/blob1.json
container/dir0/dir1/blob2.json
And the following code
blobs = blob_service_client.list_blobs('container', 'dir0/')
for blob in blobs:
print(blob.name)
will return
dir0/dir1
dir0/dir1//blob1.json
dir0/dir1//blob2.json
instead of
dir0/dir1//blob1.json
dir0/dir1//blob2.json
anyway to not having the virtual directory as one blob returned in the list?
After reproducing from my end, I could able to get this done by using the BlockBlobService class.
pip install azure-storage-blob==2.1.0
You can use either list_blobs() or list_blob_names() method to list the blobs inside the specified container. Below is the complete code that worked for me.
from azure.storage.blob import BlockBlobService
ACCOUNT_NAME = "<ACCOUNT_NAME>"
CONTAINER_NAME = "<CONTAINER_NAME>"
SAS_TOKEN='<SAS_TOKEN>'
block_blob_service = BlockBlobService(account_name=ACCOUNT_NAME,account_key=None,sas_token=SAS_TOKEN)
print("\nList of blobs in "+CONTAINER_NAME+"\n")
generator = block_blob_service.list_blobs(CONTAINER_NAME,'dir0/')
for blob in generator:
print(blob.name)
RESULTS:

base64 image not being sent correctly from python to firebase storage

I've been trying to send an image to firebase storage but when it gets to the storage, firebase can't render the image.
The image for now is pure base64.
versions:
Python 3.10.6
firebase==3.0.1
firebase-admin==6.0.1
Flask==2.0.3
dontpad.com for the base64 being used
Code:
def filePath(folderPath):
return f'{folderPath}/{date.today()}'
def fileUpload(file,folderPath):
fileName = filePath(folderPath)
from firebase_admin import storage
bucket = storage.bucket()
blob = bucket.blob(fileName)
blob.upload_from_string(file,'image/jpg' )
blob.make_public()
return blob.public_url
Additional info if needed will be provided when asked.
Expected:
Result:
What did I try?
Alternative data objects to replace base64 has been studied in the project but base64 is the only data I'm provided for the image so alternative ways have been discarded.
Most similar questions have used JavaScript, that's not my case, and they use different libraries with different methods and parameters so that hasn't helped my case.
Tried adding "data:image/jpeg;base64," to the start of the filename.
Tried replacing content type with "data_url" or "base64".
Tried uploading with and without the extension on the filename.

Firebase Storage: Programatically delete files from /tmp/

I botched a Firebase cloud function and accidentally created 1.9 million images stored in gs://myapp.appspot.com//tmp/. That double slash is accurate--the server was writing to /tmp/, which I guess results in the path mentioned above.
I'm now wanting to delete those files (they're all nonsense). I tried using the Python wrapper like so:
export GOOGLE_APPLICATION_CREDENTIALS="../secrets/service-account.json"
Then:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('tmp')
blobs = bucket.list_blobs(bucket='tmp', prefix='')
for blob in blobs:
print(' * deleting', blob)
blob.delete()
But this throws:
google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tmp?projection=noAcl: firebase-adminsdk-yr6f8#myapp.iam.gserviceaccount.com does not have storage.buckets.get access to tmp.
Does anyone know how to allow the admin credentials to delete from /tmp/? Any pointers would be hugely helpful!
I was able to reproduce this problem with gsutil command:
gsutil cp ~/<example-file> gs://<my-project-name>.appspot.com//tmp/
First of all, in my Firebase console I am able to do it with one tick (whole folder) not sure if you consider this.
Anyway if you want to have it done with API I have found following solution.
I think (comparing to my test) bucket name should be: myapp.appspot.com
If you print the blobs in python you will get something like this: <Blob: <my-project-name>.appspot.com, /tmp/<example-file>, 1584700014023619>
The 2nd value is name property of blob object. I noticed that in this situation its blobs name starts with /tmp/
Code that works on my side is:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('myapp.appspot.com')
blobs = bucket.list_blobs()
for blob in blobs:
if blob.name.startswith('/tmp/'):
print(' * deleting', blob)
blob.delete()
I don't think its very elegant solution, but for the one time fix maybe be good enough.
I hope it will help!

Uploading image file to Google Bucket in Python

I am attempting to create a function in Python in which I pass a filename and an image object, which I want to be uploaded to a Google storage bucket. I have the bucket already created, I have all the credentials in an environment variable, but I'm confused about the whole process.
Currently I have the following setup:
class ImageStorage:
bucket_name = os.getenv('STORAGE_BUCKET_NAME')
project_name = os.getenv('STORAGE_BUCKET_PROJECT_ID')
client = storage.Client(project=project_name)
bucket = client.get_bucket(bucket_name)
def save_image(self, filename, image):
blob = self.bucket.blob(filename)
blob.upload_from_file(image)
But once I run this, I get the error:
total bytes could not be determined. Please pass an explicit size, or supply a chunk size for a streaming transfer.
I'm not sure how I can provide a bytes size of this image object. Do I first need to create a file locally from the image object and then pass onto uploading it?
As per the Github issue, you should provide chunk_size parameter for stream upload.
blob = self.bucket.blob(filename, chunk_size=262144) # 256KB
blob.upload_from_file(image)
chunk_size (int) – The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Saving matplotlib figure on azure blob storage

I have a word cloud created using WordCloud class. I am plotting this word cloud using matplotlib. Now I want to save this figure on azure blob storage, but I can't find any python SDK to do the same.
In order to use *plt.savefig(), path for blob storage is needed. Could anyone tell how this path can be mentioned or some other way to store it on blob?
Code I am using is:
fig, ax = plt.subplots()
words = text.split()
word_cloud = WordCloud(width = 8000, height = 800,
background_color ='black',
min_font_size = 10).generate(str(text))
plt.imshow(word_cloud)
display(fig)
As per my research, you cannot save Matplotlib output to Azure Blob Storage directly.
You may follow the below steps to save Matplotlib output to Azure Blob Storage:
Step 1: You need to first save it to the Databrick File System (DBFS) and then copy it to Azure Blob storage.
Saving Matplotlib output to Databricks File System (DBFS): We are using the below command to save the output to DBFS: plt.savefig('/dbfs/myfolder/Graph1.png')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'fruits':['apple','banana'], 'count': [1,2]})
plt.close()
df.set_index('fruits',inplace = True)
df.plot.bar()
plt.savefig('/dbfs/myfolder/Graph1.png')
Step 2: Copy the file from Databricks File System to Azure Blob Storage.
There are two methods to copy file from DBFS to Azure Blob Stroage.
Method 1: Access Azure Blob storage directly
Access Azure Blob Storage directly by setting "Spark.conf.set" and copy file from DBFS to Blob Storage.
spark.conf.set("fs.azure.account.key.< Blob Storage Name>.blob.core.windows.net", "<Azure Blob Storage Key>")
Use dbutils.fs.cp to copy file from DBFS to Azure Blob Storage:
dbutils.fs.cp('dbfs:/myfolder/Graph1.png', 'wasbs://<Container>#<Storage Name>.blob.core.windows.net/Azure')
Method 2: Mount Azure Blob storage containers to DBFS
You can mount a Blob storage container or a folder inside a container to Databricks File System (DBFS). The mount is a pointer to a Blob storage container, so the data is never synced locally.
dbutils.fs.mount(
source = "wasbs://sampledata#chepra.blob.core.windows.net/Azure",
mount_point = "/mnt/chepra",
extra_configs = {"fs.azure.sas.sampledata.chepra.blob.core.windows.net":dbutils.secrets.get(scope = "azurestorage", key = "azurestoragekey")})
Use dbutils.fs.cp copy the file to Azure Blob Storage Container:
dbutils.fs.cp('dbfs:/myfolder/Graph1.png', '/dbfs/mnt/chepra')
By following Method1 or Method2 you can successfully save the output to Azure Blob Storage.
Hope this helps. Do let us know if you any further queries.
I'll assume you have mounted the blob storage (if not please refer to databricks guide )
after that you can follow:
plt.figure(figsize=(20,35))
plt.pcolor(df, cmap="gray")
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
#create the folder where the plot needs to sit - matplotlib cannot create folders
# and even create an empty file with dbutils.fs.put will not work
dbutils.fs.mkdirs('/mnt/...base_path.../folder/')
# save the file using /dbfs/ in front of the regular path
fig.savefig('/dbfs/mnt/...base_path.../folder/file_name.png')
et viola!
Have a good one.

Categories