Saving matplotlib figure on azure blob storage - python

I have a word cloud created using WordCloud class. I am plotting this word cloud using matplotlib. Now I want to save this figure on azure blob storage, but I can't find any python SDK to do the same.
In order to use *plt.savefig(), path for blob storage is needed. Could anyone tell how this path can be mentioned or some other way to store it on blob?
Code I am using is:
fig, ax = plt.subplots()
words = text.split()
word_cloud = WordCloud(width = 8000, height = 800,
background_color ='black',
min_font_size = 10).generate(str(text))
plt.imshow(word_cloud)
display(fig)

As per my research, you cannot save Matplotlib output to Azure Blob Storage directly.
You may follow the below steps to save Matplotlib output to Azure Blob Storage:
Step 1: You need to first save it to the Databrick File System (DBFS) and then copy it to Azure Blob storage.
Saving Matplotlib output to Databricks File System (DBFS): We are using the below command to save the output to DBFS: plt.savefig('/dbfs/myfolder/Graph1.png')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'fruits':['apple','banana'], 'count': [1,2]})
plt.close()
df.set_index('fruits',inplace = True)
df.plot.bar()
plt.savefig('/dbfs/myfolder/Graph1.png')
Step 2: Copy the file from Databricks File System to Azure Blob Storage.
There are two methods to copy file from DBFS to Azure Blob Stroage.
Method 1: Access Azure Blob storage directly
Access Azure Blob Storage directly by setting "Spark.conf.set" and copy file from DBFS to Blob Storage.
spark.conf.set("fs.azure.account.key.< Blob Storage Name>.blob.core.windows.net", "<Azure Blob Storage Key>")
Use dbutils.fs.cp to copy file from DBFS to Azure Blob Storage:
dbutils.fs.cp('dbfs:/myfolder/Graph1.png', 'wasbs://<Container>#<Storage Name>.blob.core.windows.net/Azure')
Method 2: Mount Azure Blob storage containers to DBFS
You can mount a Blob storage container or a folder inside a container to Databricks File System (DBFS). The mount is a pointer to a Blob storage container, so the data is never synced locally.
dbutils.fs.mount(
source = "wasbs://sampledata#chepra.blob.core.windows.net/Azure",
mount_point = "/mnt/chepra",
extra_configs = {"fs.azure.sas.sampledata.chepra.blob.core.windows.net":dbutils.secrets.get(scope = "azurestorage", key = "azurestoragekey")})
Use dbutils.fs.cp copy the file to Azure Blob Storage Container:
dbutils.fs.cp('dbfs:/myfolder/Graph1.png', '/dbfs/mnt/chepra')
By following Method1 or Method2 you can successfully save the output to Azure Blob Storage.
Hope this helps. Do let us know if you any further queries.

I'll assume you have mounted the blob storage (if not please refer to databricks guide )
after that you can follow:
plt.figure(figsize=(20,35))
plt.pcolor(df, cmap="gray")
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
#create the folder where the plot needs to sit - matplotlib cannot create folders
# and even create an empty file with dbutils.fs.put will not work
dbutils.fs.mkdirs('/mnt/...base_path.../folder/')
# save the file using /dbfs/ in front of the regular path
fig.savefig('/dbfs/mnt/...base_path.../folder/file_name.png')
et viola!
Have a good one.

Related

Why do I get a green check mark in Google Colab when exporting an image to Google Drive but no file is found in Drive?

I have tested and successfully exported files from Google Colab to Google Drive, however, I am trying to export a file of data (as a .tif) after I've performed calculations on Landsat data in Google Colab and I get a green checkmark after running the code but then there is no file in Drive. Why do I get the green check mark if the code is not executing? Or am I trying to export the information incorrectly? I have tried the two options below based on this.
# Export the image, specifying scale and region.
class1export=class1.select('field')
task = ee.batch.Export.image.toDrive(**{
'image': class1export,
'description': 'class1export',
'folder':'GEE_Colab',
'scale': 30,
'region': shp
})
task.start()
# Export the image, specifying scale and region.
class1export=class1export.select('field')
task = ee.batch.Export.image.toDrive(
image=class1export,
scale=30,
fileFormat='GeoTIFF',
description='class1export.tif',
folder='GEE_Colab',
maxPixels=1e9)
task.start()
In case anyone else is experiencing this issue, I found a solution.
shp = ee.FeatureCollection('users/xyz/shp')
out_dir = os.path.join(os.path.expanduser('~'), 'Downloads')
filename = os.path.join(out_dir, 'class1export.tif')
roi=shp.geometry()
geemap.ee_export_image(class1export, filename=filename, scale=30, region=roi, file_per_band=False)

Firebase Storage: Programatically delete files from /tmp/

I botched a Firebase cloud function and accidentally created 1.9 million images stored in gs://myapp.appspot.com//tmp/. That double slash is accurate--the server was writing to /tmp/, which I guess results in the path mentioned above.
I'm now wanting to delete those files (they're all nonsense). I tried using the Python wrapper like so:
export GOOGLE_APPLICATION_CREDENTIALS="../secrets/service-account.json"
Then:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('tmp')
blobs = bucket.list_blobs(bucket='tmp', prefix='')
for blob in blobs:
print(' * deleting', blob)
blob.delete()
But this throws:
google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tmp?projection=noAcl: firebase-adminsdk-yr6f8#myapp.iam.gserviceaccount.com does not have storage.buckets.get access to tmp.
Does anyone know how to allow the admin credentials to delete from /tmp/? Any pointers would be hugely helpful!
I was able to reproduce this problem with gsutil command:
gsutil cp ~/<example-file> gs://<my-project-name>.appspot.com//tmp/
First of all, in my Firebase console I am able to do it with one tick (whole folder) not sure if you consider this.
Anyway if you want to have it done with API I have found following solution.
I think (comparing to my test) bucket name should be: myapp.appspot.com
If you print the blobs in python you will get something like this: <Blob: <my-project-name>.appspot.com, /tmp/<example-file>, 1584700014023619>
The 2nd value is name property of blob object. I noticed that in this situation its blobs name starts with /tmp/
Code that works on my side is:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('myapp.appspot.com')
blobs = bucket.list_blobs()
for blob in blobs:
if blob.name.startswith('/tmp/'):
print(' * deleting', blob)
blob.delete()
I don't think its very elegant solution, but for the one time fix maybe be good enough.
I hope it will help!

Uploading image file to Google Bucket in Python

I am attempting to create a function in Python in which I pass a filename and an image object, which I want to be uploaded to a Google storage bucket. I have the bucket already created, I have all the credentials in an environment variable, but I'm confused about the whole process.
Currently I have the following setup:
class ImageStorage:
bucket_name = os.getenv('STORAGE_BUCKET_NAME')
project_name = os.getenv('STORAGE_BUCKET_PROJECT_ID')
client = storage.Client(project=project_name)
bucket = client.get_bucket(bucket_name)
def save_image(self, filename, image):
blob = self.bucket.blob(filename)
blob.upload_from_file(image)
But once I run this, I get the error:
total bytes could not be determined. Please pass an explicit size, or supply a chunk size for a streaming transfer.
I'm not sure how I can provide a bytes size of this image object. Do I first need to create a file locally from the image object and then pass onto uploading it?
As per the Github issue, you should provide chunk_size parameter for stream upload.
blob = self.bucket.blob(filename, chunk_size=262144) # 256KB
blob.upload_from_file(image)
chunk_size (int) – The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Google Cloud Storage : Python API get blob information with wildcard

I am trying to get blob information from a bucket but i want to use wildcards in blob name. Consider my bucket
$ gsutil ls gs://myBucket/myPath/
gs://myBucket/myPath/
gs://myBucket/myPath/ranOn=2018-12-11/
gs://myBucket/myPath/ranOn=2018-12-12/
gs://myBucket/myPath/ranOn=2018-12-13/
gs://myBucket/myPath/ranOn=2018-12-14/
gs://myBucket/myPath/ranOn=2018-12-15/
gs://myBucket/myPath/ranOn=2019-02-18/
gs://myBucket/myPath/ranOn=2019-02-19/
gs://myBucket/myPath/ranOn=2019-02-20/
gs://myBucket/myPath/ranOn=2019-02-21/
now from the command line, i am able to do
$ gsutil ls gs://myBucket/myPath/ranOn=2018*
gs://myBucket/myPath/
gs://myBucket/myPath/ranOn=2018-12-11/
gs://myBucket/myPath/ranOn=2018-12-12/
gs://myBucket/myPath/ranOn=2018-12-13/
gs://myBucket/myPath/ranOn=2018-12-14/
gs://myBucket/myPath/ranOn=2018-12-15/
and hence i can do the same for the size
$ gsutil du -sh gs://myBucket/myPath/ranOn=2018*
2.7 G
now, i want to do the same thing with the python api. Here is what i tried
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('myBucket')
blob = bucket.get_blob('myPath/ranOn=2018*')
print('Size: {} bytes'.format(blob.size))
Size: None bytes
why is this not working? How can i use wildcards in blob paths with python api?
Unfortunately get_blob is just for getting individual files, not multiple files.
You'll need to iterate over all the files that match the prefix and sum their sizes to get the total size.
blobs = bucket.list_blobs(prefix="myPath/ranOn=2018")
total = sum([blob.size for blob in blobs])

Can I download from Google storage blobs into a VM as an n-d array?

I have a Google Cloud Engine VM and am trying to grab data from my cloud storage - which is in the form of a blob, and turn it into a np array with the same shape as it was when stored.
Currently the only way I can get this working is by downloading to file and then loading into a numpy array which seems sub-optimal
I have tried downloading as a string array directly and converting into numpy array but the dimensions are not maintained (they are flattened).
I could move all files to the VM instead but would rather read 'on-the-fly' if possible?
Current code:
def __getitem__(self, index):
index = int(self.indexes[int(index)])
blob = bucket.blob(self.data_path + 'case_'+str(index)+'_volume.npy')
blob.download_to_filename('im.npy')
image = np.load('im.npy')
return image
If you have enough RAM to store the entire file in memory (while it is also loaded into numpy), you can do the read into a BytesIO object, seek back to the beginning of the buffer, then hand it to numpy.load(). Adapt this as necessary to your particular function:
import io
import numpy as np
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('my-bucket')
blob = bucket.blob('my-file.npy')
with io.BytesIO() as in_memory_file:
blob.download_to_file(in_memory_file)
in_memory_file.seek(0)
image = np.load(in_memory_file)
# then, for example:
print(image)
At least for now there doesn't appear to be a way to actually stream the read out of GCS without writing the necessary client library yourself.

Categories