I'm currently using the Azure Blob Storage SDK for Python. For my project I want to read/load the data from a specific blob without having to download it / store it on disk before accessing.
According to the documentation loading a specfic blob works for my with:
blob_client = BlobClient(blob_service_client.url,
container_name,
blob_name,
credential)
data_stream = blob_client.download_blob()
data = data_stream.readall()
The last readall() command returns me the byte information of the blob content (in my case a image).
With:
with open(loca_path, "wb") as local_file:
data_stream.readinto(my_blob)
it is possible to save the blob content on disk (classic downloading operation)
BUT:
Is it also possible to convert the byte data from data = data_stream.readall() directly into an image?
It already tried image_data = Image.frombytes(mode="RGB", data=data, size=(1080, 1920))
but it returns me an error not enough image data
Here is the sample code for reading the text without downloading the file.
from azure.storage.blob import BlockBlobService, PublicAccess
accountname="xxxx"
accountkey="xxxx"
blob_service_client = BlockBlobService(account_name=accountname,account_key=accountkey)
container_name="test2"
blob_name="a5.txt"
#get the length of the blob file, you can use it if you need a loop in your code to read a blob file.
blob_property = blob_service_client.get_blob_properties(container_name,blob_name)
print("the length of the blob is: " + str(blob_property.properties.content_length) + " bytes")
print("**********")
#get the first 10 bytes data
b1 = blob_service_client.get_blob_to_text(container_name,blob_name,start_range=0,end_range=10)
#you can use the method below to read stream
#blob_service_client.get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10)
print(b1.content)
print("*******")
#get the next range of data
b2=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=10,end_range=50)
print(b2.content)
print("********")
#get the next range of data
b3=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=50,end_range=200)
print(b3.content)
For complete information you can check the document with Python libraries.
Related
I am sending multiple images to azureblobstorage using python code now I need to perform azure ocr on those multiple images and save the results and confidence level in Excel sheet
I don't know what is next step I mean how to extract ocr from multiple images I searched many documentation which was provided by Microsoft but still I didn't find the answer
I am trying to import BlobServiceClient but I am getting import error like BlobServiceClient cannot be imported from azure.blob.storage
Here I installed the following packaged
azure-storage-blob
azure-cognitiveservices-vision-computervisio
using pip install <Package Name>
Now you can use the following code which is printing the line extracted and the confidence for each line of the image.
// Get the following values from portal :
blob_storage_Connection_String = ""
Cognitive_Service_key = ""
Cognitive_Service_endpoint = ""
container_name = "testcontainer"
storage_account_name = ""
blob_service_client = BlobServiceClient.from_connection_string(blob_storage_Connection_String)
computervision_client = ComputerVisionClient(Cognitive_Service_endpoint, CognitiveServicesCredentials(Cognitive_Service_key))
container_client = blob_service_client.get_container_client(container_name)
// List of Blobs
blob_list = container_client.list_blobs()
for blob in blob_list:
blob_url = "https://"+storage_account_name+".blob.core.windows.net/"+container_name+"/"+blob.name
read_response = computervision_client.read(blob_url, raw=True)
read_operation_location = read_response.headers["Operation-Location"]
operation_id = read_operation_location.split("/")[-1]
while True:
read_result = computervision_client.get_read_result(operation_id)
if read_result.status not in ['notStarted', 'running']:
break
time.sleep(1)
for textresult in read_result.analyze_result.read_results:
for line in textresult.lines:
print("Text: ",line.text)
print("Confidence: ", line.appearance.style.confidence)
Here I have 2 images in the azure storage container thus there are two sets of results
Output :
Further you can add the line.text and line.appearance.style.confidence in excel sheet by using xlwt module. Refer this Article by aishwarya.27 on adding data to excel.
I'm trying to download a blob file & store it locally on my machine. The file format is HDF5 (a format I have limited/no experience of so far).
So far I've been successful in downloading something using the scripts below. The key issue is it doesn't seem to be the full file. When downloading the file directly from storage explorer it is circa 4,000kb. The HDF5 file I save is 2kb.
What am I doing wrong? Am I missing a readall() somewhere?
My first time working with blob storage & HDF5's, so coming a little stuck right now. A lot of the old questions seem to be using deprecated commands as the azure.storage.blob module has been updated.
from azure.storage.blob import BlobServiceClient
from io import StringIO, BytesIO
import h5py
# Initialise client
blob_service_client = BlobServiceClient.from_connection_string("my_conn_str")
# Initialise container
blob_container_client = blob_service_client.get_container_client("container_name")
# Get blob
blob_client = blob_container_client.get_blob_client("file_path")
# Download
download_stream = blob_client.download_blob()
# Create empty stream
stream = BytesIO()
# Read downloaded blob into stream
download_stream.readinto(stream)
# Create new empty hdf5 file
hf = h5py.File('data.hdf5', 'w')
# Write stream into empty HDF5
hf.create_dataset('dataset_1',stream)
# Close Blob (& save)
hf.close()
I tried to reproduce the scenario in my system facing with same issue with code you tried
So I tried the another solution read the hdf5 file as stream and write it inside another hdf5 file
Try with this solution .Taken some dummy data for testing purpose.
from azure.storage.blob import BlobServiceClient
from io import StringIO, BytesIO
import numpy as np
import h5py
# Initialise client
blob_service_client = BlobServiceClient.from_connection_string("Connection String")
# Initialise container
blob_container_client = blob_service_client.get_container_client("test//Container name")
# Get blob
blob_client = blob_container_client.get_blob_client("test.hdf5 //Blob name")
print("downloaded the blob ")
# Download
download_stream = blob_client.download_blob()
stream = BytesIO()
downloader = blob_client.download_blob()
# download the entire file in memory here
# file can be many giga bytes! Big problem
downloader.readinto(stream)
# works fine to open the stream and read data
f = h5py.File(stream, 'r')
//dummy data
data_matrix = np.random.uniform(-1, 1, size=(10, 3))
with h5py.File(stream, "r") as f:
# List all groups
print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]
# Get the data
data = list(f[a_group_key])
data_matrix=data
print(data)
with h5py.File("file1.hdf5", "w") as data_file:
data_file.create_dataset("group_name", data=data_matrix)
OUTPUT
I have multiple methods on my Python script to work with a csv file. It's working on my local machine but it does not when I am working with the same csv file stored inside a Google Cloud Storage bucket. I need to keep track of my current_position in the file so this is why I am using seek() and tell(). I tried to use the pandas library but there is no such methods. Does anyone has a basic example of a Python script to read a csv stored in a GCP bucket with those methods?
def read_line_from_csv(position):
#df = pandas.read_csv('gs://trends_service_v1/your_path.csv')
with open('keywords.csv') as f:
f.seek(position)
keyword = f.readline()
position = f.tell()
f.close()
return position, keyword
def save_new_position(current_positon):
f = open("position.csv", "w")
f.write(str(current_positon))
f.close()
update_csv_bucket("position.csv")
def get_position_reader():
try:
with open('position.csv') as f:
return int(f.readline())
except OSError as e:
print(e)
Official library do not have such capabilities I think.
You can download file first than open it and work normally.
Apart from official one you can use gcsfs which implements missing functionality
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-google-project')
with fs.open('my-bucket/my-file.txt', 'rb') as f:
print(f.seek(location))
Another way other than #emil-gi's suggestions would be to use the method mentioned here
#Download the contents of this blob as a bytes object
blob.download_as_string()
Where blob is the object associated with your CSV in your GCS bucket.
If you need to create the connection to the blob first (I don't know what you do in other parts of the code), use the docs
You can use Google Cloud Storage fileio.
For instance:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(file_path) #folder/filename.csv
#Instantiate a BlobReader
blobReader=storage.fileio.BlobReader(blob)
#Get current position in your file
print(blobReader.tell())
#Read line by line
print(blobReader.readline().decode('utf-8')) #read and print row 1
print(blobReader.readline().decode('utf-8')) #read and print row 2
#Read chunk of X bytes
print(blobReader.read(1000).decode('utf-8')) #read next 1000 bytes
#To seek a specific position.
blobReader.seek(position)
Using the latest azure.storage.blob (12.4.0) python library, I need to open a stream on a blob without downloading it completely in memory.
I have hdf5 files stored in storage account, using h5py (2.10.0) I need to extract some information, read data without having the file loaded in memory. The files can contains many giga bytes of data.
container_client = blob_service_client.get_container_client('sample')
blob = container_client.get_blob_client('SampleHdF5.hdf5')
stream = BytesIO()
downloader = blob.download_blob()
# download the entire file in memory here
# file can be many giga bytes! Big problem
downloader.readinto(stream)
# works fine to open the stream and read data
f = h5py.File(stream, 'r')
Maybe there's another service more appropriate for this kind of need on Azure.
get_blob_to_stream can be used with azure.storage.blob.baseblobservice refering to here. There are packages that I used.
from azure.storage.blob.baseblobservice import BaseBlobService
import io
connection_string = ""
container_name = ""
blob_name = ""
blob_service = BaseBlobService(connection_string=connection_string)
with io.BytesIO() as input_io:
blob_service.get_blob_to_stream(container_name=container_name, blob_name=blob_name, stream=input_io)
I feel kind of stupid right now. I have been reading numerous documentations and stackoverflow questions but I can't get it right.
I have a file on Google Cloud Storage. It is in a bucket 'test_bucket'. Inside this bucket there is a folder, 'temp_files_folder', which contains two files, one .txt file named 'test.txt' and one .csv file named 'test.csv'. The two files are simply because I try using both but the result is the same either way.
The content in the files is
hej
san
and I am hoping to read it into python the same way I would do on a local with
textfile = open("/file_path/test.txt", 'r')
times = textfile.read().splitlines()
textfile.close()
print(times)
which gives
['hej', 'san']
I have tried using
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test_bucket')
blob = bucket.get_blob('temp_files_folder/test.txt')
print(blob.download_as_string)
but it gives the output
<bound method Blob.download_as_string of <Blob: test_bucket, temp_files_folder/test.txt>>
How can I get the actual string(s) in the file?
download_as_string is a method, you need to call it.
print(blob.download_as_string())
More likely, you want to assign it to a variable so that you download it once and can then print it and do whatever else you want with it:
downloaded_blob = blob.download_as_string()
print(downloaded_blob)
do_something_else(downloaded_blob)
The method 'download_as_string()' will read in the content as byte.
Find below an example to process a .csv file.
import csv
from io import StringIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket(YOUR_BUCKET_NAME)
blob = bucket.blob(YOUR_FILE_NAME)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
According to the documentation (https://googleapis.dev/python/storage/latest/blobs.html), As of the time of writing (2021/08), the download_as_string method is a depreciated alias for the download_as_byte method which - as suggested by the name - returns a byte object.
You can instead use the download_as_text method to return a str object.
For instances, to download the file MYFILE from bucket MYBUCKET and store it as an utf-8 encoded string:
from google.cloud.storage import Client
client = Client()
bucket = client.get_bucket(MYBUCKET)
blob = bucket.get_blob(MYFILE)
downloaded_file = blob.download_as_text(encoding="utf-8")
You can then also use this in order to read different file formats. For json, replace the last line to
import json
downloaded_json_file = json.loads(blob.download_as_text(encoding="utf-8"))
For yaml files, replace the last line to :
import yaml
downloaded_yaml_file = yaml.safe_load(blob.download_as_text(encoding="utf-8"))
DON'T USE: blob.download_as_string()
USE: blob.download_as_text()
blob.download_as_text() does indeed return a string.
blob.download_as_string() is deprecated and returns a bytes object instead of a string object.
Works out when reading a docx / text file
from google.cloud import storage
# create storage client
storage_client = storage.Client.from_service_account_json('**PATH OF JSON FILE**')
bucket = storage_client.get_bucket('**BUCKET NAME**')
# get bucket data as blob
blob = bucket.blob('**SPECIFYING THE DOXC FILENAME**')
downloaded_blob = blob.download_as_string()
downloaded_blob = downloaded_blob.decode("utf-8")
print(downloaded_blob)