Azure Storage Account blob stream with Python - python

Using the latest azure.storage.blob (12.4.0) python library, I need to open a stream on a blob without downloading it completely in memory.
I have hdf5 files stored in storage account, using h5py (2.10.0) I need to extract some information, read data without having the file loaded in memory. The files can contains many giga bytes of data.
container_client = blob_service_client.get_container_client('sample')
blob = container_client.get_blob_client('SampleHdF5.hdf5')
stream = BytesIO()
downloader = blob.download_blob()
# download the entire file in memory here
# file can be many giga bytes! Big problem
downloader.readinto(stream)
# works fine to open the stream and read data
f = h5py.File(stream, 'r')
Maybe there's another service more appropriate for this kind of need on Azure.

get_blob_to_stream can be used with azure.storage.blob.baseblobservice refering to here. There are packages that I used.
from azure.storage.blob.baseblobservice import BaseBlobService
import io
connection_string = ""
container_name = ""
blob_name = ""
blob_service = BaseBlobService(connection_string=connection_string)
with io.BytesIO() as input_io:
blob_service.get_blob_to_stream(container_name=container_name, blob_name=blob_name, stream=input_io)

Related

How to download all files from a blob container using python

I have to mention that i barely know anything to python. I use an application that has no native support for downloading data from blop's. But it support python.
I have found a way to list all blop's within the container.
But I have no clue how to download them.
from azure.storage.blob import BlobServiceClient, ContainerClient
import io
from io import StringIO
import pandas as pd
from csv import reader
sas_url = r'https://ubftp.blob.core.windows.netxxxxxxxxxxxxxxxx'
container = ContainerClient.from_container_url(sas_url, delimiter='/')
blob_list = container.list_blobs()
for index, blob in enumerate(blob_list):
#for blob in blob_list:
#print(list(blob.keys()))
print(type(blob_name),blob['name'])
blob_name = blob['name']
It list's all the blops within every subfolder.
What do I add to the code to download them?
Or read them into a dataframe?
Kind regards
This is may be what you are looking for:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python?tabs=managed-identity%2Croles-azure-portal%2Csign-in-azure-cli#download-blobs
# Download the blob to a local file
# Add 'DOWNLOAD' before the .txt extension so you can see both files in the data directory
download_file_path = os.path.join(local_path, str.replace(local_file_name ,'.txt', 'DOWNLOAD.txt'))
container_client = blob_service_client.get_container_client(container= container_name)
print("\nDownloading blob to \n\t" + download_file_path)
with open(file=download_file_path, mode="wb") as download_file:
download_file.write(container_client.download_blob(blob.name).readall())

Gzip a file in Python before uploading to Cloud Storage

I have the following Python function to write the given content to a bucket in Cloud Storage:
import gzip
from google.cloud import storage
def upload_to_cloud_storage(json):
"""Write to Cloud Storage."""
# The contents to upload as a JSON string.
contents = json
storage_client = storage.Client()
# Path and name of the file to upload (file doesn't yet exist).
destination = "path/to/name.json.gz"
# Gzip the contents before uploading
with gzip.open(destination, "wb") as f:
f.write(contents.encode("utf-8"))
# Bucket
my_bucket = storage_client.bucket('my_bucket')
# Blob (content)
blob = my_bucket.blob(destination)
blob.content_encoding = 'gzip'
# Write to storage
blob.upload_from_string(contents, content_type='application/json')
However, I receive an error when running the function:
FileNotFoundError: [Errno 2] No such file or directory: 'path/to/name.json.gz'
Highlighting this line as the cause:
with gzip.open(destination, "wb") as f:
I can confirm that the bucket and path both exist although the file itself is new and to be written.
I can also confirm that removing the Gzipping part sees the file successfully written to Cloud Storage.
How can I gzip a new file and upload to Cloud Storage?
Other answers I've used for reference:
https://stackoverflow.com/a/54769937
https://stackoverflow.com/a/67995040
Although #David's answer wasn't complete at the time of solving my problem, it got me on the right track. Here's what I ended up using along with explanations I found out along the way.
import gzip
from google.cloud import storage
from google.cloud.storage import fileio
def upload_to_cloud_storage(json_string):
"""Gzip and write to Cloud Storage."""
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
# Filename (include path)
blob = bucket.blob('path/to/file.json')
# Set blog meta data for decompressive transcoding
blob.content_encoding = 'gzip'
blob.content_type = 'application/json'
writer = fileio.BlobWriter(blob)
# Must write as bytes
gz = gzip.GzipFile(fileobj=writer, mode="wb")
# When writing as bytes we must encode our JSON string.
gz.write(json_string.encode('utf-8'))
# Close connections
gz.close()
writer.close()
We use the GzipFile() class instead of convenience method (compress) to enable us to pass in the mode. When trying to write using w or wt you will receive the error:
TypeError: memoryview: a bytes-like object is required, not 'str'
So we must write in binary mode (wb). This will also enable the .write() method. When doing so however we need to encode our JSON string. This can be done using str.encode() and setting it as utf-8. Failing to do this will also result in the same error.
Finally, I wanted to be able to enable decompressive transcoding where the requester (browser in my case) will receive the uncompressed version of the file when requested. To enable this google.cloud.storage.blob allows you to set some meta data including content_type and content_encoding so we can can follow best practices.
This sees the JSON object in memory written to your chosen destination in Cloud Storage in a compressed format and decompressed on the fly (without needing to download a gzip archive).
Thanks also to #JohnHanley for the troubleshooting advice.
The best solution is not to write the gzip to a file at all, and directly compress and stream to GCS.
from google.cloud import storage
from google.cloud.storage import fileio
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = fileio.BlobWriter(blob)
gz = gzip.GzipFile(fileobj=writer, mode="w") # use "wb" if bytes
gz.write(contents)
gz.close()
writer.close()

Azure SDK for Python: Reading blobs without downloading

I'm currently using the Azure Blob Storage SDK for Python. For my project I want to read/load the data from a specific blob without having to download it / store it on disk before accessing.
According to the documentation loading a specfic blob works for my with:
blob_client = BlobClient(blob_service_client.url,
container_name,
blob_name,
credential)
data_stream = blob_client.download_blob()
data = data_stream.readall()
The last readall() command returns me the byte information of the blob content (in my case a image).
With:
with open(loca_path, "wb") as local_file:
data_stream.readinto(my_blob)
it is possible to save the blob content on disk (classic downloading operation)
BUT:
Is it also possible to convert the byte data from data = data_stream.readall() directly into an image?
It already tried image_data = Image.frombytes(mode="RGB", data=data, size=(1080, 1920))
but it returns me an error not enough image data
Here is the sample code for reading the text without downloading the file.
from azure.storage.blob import BlockBlobService, PublicAccess
accountname="xxxx"
accountkey="xxxx"
blob_service_client = BlockBlobService(account_name=accountname,account_key=accountkey)
container_name="test2"
blob_name="a5.txt"
#get the length of the blob file, you can use it if you need a loop in your code to read a blob file.
blob_property = blob_service_client.get_blob_properties(container_name,blob_name)
print("the length of the blob is: " + str(blob_property.properties.content_length) + " bytes")
print("**********")
#get the first 10 bytes data
b1 = blob_service_client.get_blob_to_text(container_name,blob_name,start_range=0,end_range=10)
#you can use the method below to read stream
#blob_service_client.get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10)
print(b1.content)
print("*******")
#get the next range of data
b2=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=10,end_range=50)
print(b2.content)
print("********")
#get the next range of data
b3=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=50,end_range=200)
print(b3.content)
For complete information you can check the document with Python libraries.

Download Blob To Local Storage using Python

I'm trying to download a blob file & store it locally on my machine. The file format is HDF5 (a format I have limited/no experience of so far).
So far I've been successful in downloading something using the scripts below. The key issue is it doesn't seem to be the full file. When downloading the file directly from storage explorer it is circa 4,000kb. The HDF5 file I save is 2kb.
What am I doing wrong? Am I missing a readall() somewhere?
My first time working with blob storage & HDF5's, so coming a little stuck right now. A lot of the old questions seem to be using deprecated commands as the azure.storage.blob module has been updated.
from azure.storage.blob import BlobServiceClient
from io import StringIO, BytesIO
import h5py
# Initialise client
blob_service_client = BlobServiceClient.from_connection_string("my_conn_str")
# Initialise container
blob_container_client = blob_service_client.get_container_client("container_name")
# Get blob
blob_client = blob_container_client.get_blob_client("file_path")
# Download
download_stream = blob_client.download_blob()
# Create empty stream
stream = BytesIO()
# Read downloaded blob into stream
download_stream.readinto(stream)
# Create new empty hdf5 file
hf = h5py.File('data.hdf5', 'w')
# Write stream into empty HDF5
hf.create_dataset('dataset_1',stream)
# Close Blob (& save)
hf.close()
I tried to reproduce the scenario in my system facing with same issue with code you tried
So I tried the another solution read the hdf5 file as stream and write it inside another hdf5 file
Try with this solution .Taken some dummy data for testing purpose.
from azure.storage.blob import BlobServiceClient
from io import StringIO, BytesIO
import numpy as np
import h5py
# Initialise client
blob_service_client = BlobServiceClient.from_connection_string("Connection String")
# Initialise container
blob_container_client = blob_service_client.get_container_client("test//Container name")
# Get blob
blob_client = blob_container_client.get_blob_client("test.hdf5 //Blob name")
print("downloaded the blob ")
# Download
download_stream = blob_client.download_blob()
stream = BytesIO()
downloader = blob_client.download_blob()
# download the entire file in memory here
# file can be many giga bytes! Big problem
downloader.readinto(stream)
# works fine to open the stream and read data
f = h5py.File(stream, 'r')
//dummy data
data_matrix = np.random.uniform(-1, 1, size=(10, 3))
with h5py.File(stream, "r") as f:
# List all groups
print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]
# Get the data
data = list(f[a_group_key])
data_matrix=data
print(data)
with h5py.File("file1.hdf5", "w") as data_file:
data_file.create_dataset("group_name", data=data_matrix)
OUTPUT

how to transfer file to azure blob storage in chunks without writing to file using python

I need to transfer files from google cloud storage to azure blob storage.
Google gives a code snippet to download files to byte variable like so:
# Get Payload Data
req = client.objects().get_media(
bucket=bucket_name,
object=object_name,
generation=generation) # optional
# The BytesIO object may be replaced with any io.Base instance.
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, req, chunksize=1024*1024)
done = False
while not done:
status, done = downloader.next_chunk()
if status:
print 'Download %d%%.' % int(status.progress() * 100)
print 'Download Complete!'
print fh.getvalue()
I was able to modify this to store to file by changing the fh object type like so:
fh = open(object_name, 'wb')
Then I can upload to azure blob storage using blob_service.put_block_blob_from_path.
I want to avoid writing to local file on machine doing the transfer.
I gather Google's snippet loads the data into the io.BytesIO() object a chunk at a time. I reckon I should probably use this to write to blob storage a chunk at a time.
I experimented with reading the whole thing into memory, and then uploading using put_block_blob_from_bytes, but I got a memory error (file is probably too big (~600MB).
Any suggestions?
According to the source codes of blobservice.py for Azure Storage and BlobReader for Google Cloud Storage, you can try to use the Azure function blobservice.put_block_blob_from_file to write the stream from the GCS class blobreader has the function read as stream, please see below.
So refering to the code from https://cloud.google.com/appengine/docs/python/blobstore/#Python_Using_BlobReader, you can try to do this as below.
from google.appengine.ext import blobstore
from azure.storage.blob import BlobService
blob_key = ...
blob_reader = blobstore.BlobReader(blob_key)
blob_service = BlobService(account_name, account_key)
container_name = ...
blob_name = ...
blobservice.put_block_blob_from_file(container_name, blob_name, blob_reader)
After looking through the SDK source code, something like this could work:
from azure.storage.blob import _chunking
from azure.storage.blob import BlobService
# See _BlobChunkUploader
class PartialChunkUploader(_chunking._BlockBlobChunkUploader):
def __init__(self, blob_service, container_name, blob_name, progress_callback = None):
super(PartialChunkUploader, self).__init__(blob_service, container_name, blob_name, -1, -1, None, False, 5, 1.0, progress_callback, None)
def process_chunk(self, chunk_offset, chunk_data):
'''chunk_offset is the integer offset. chunk_data is an array of bytes.'''
return self._upload_chunk_with_retries(chunk_offset, chunk_data)
blob_service = BlobService(account_name='myaccount', account_key='mykey')
uploader = PartialChunkUploader(blob_service, "container", "foo")
# while (...):
# uploader.process_chunk(...)

Categories