loading file contents from Azure blob storage [duplicate] - python

I need to download a PDF from a blob container in azure as a download stream (StorageStreamDownloader) and open it in both PDFPlumber and PDFminer.
I developed all the requirements loading them as a file, but I cant manage to received a download stream (StorageStreamDownloader) and open it successfully.
I was opening the PDFs like this:
pdf = pdfplumber.open(pdfpath) //for pdfplumber
fp = open('Pdf/' + fileGlob, 'rb') // for pdfminer
parser = PDFParser(fp)
document = PDFDocument(parser)
However, i need to be able to download a stream. Code snippet that downloads the pdf as a file:
blob_client = container.get_blob_client(remote_file)
with open(local_file_path,"wb") as local_file:
download_stream = blob_client.download_blob()
local_file.write(download_stream.readall())
local_file.close()
I tried several options, even using a temp file with no luck.
Any ideas?

download_blob() download the blob to a StorageStreamDownloader class, and in this class there is a download_to_stream, with this you will get the blob stream.
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
from io import BytesIO
import PyPDF2
filename = "test.pdf"
container_name="test"
blob_service_client = BlobServiceClient.from_connection_string("connection string")
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
fileReader = PyPDF2.PdfFileReader(stream)
print(fileReader.numPages)
And this is my result. It will print the pdf pages number.

It seems download_to_stream() is now deprecated and instead should be used readinto().
from azure.storage.blob import BlobClient
conn_string = ''
container_name = ''
blob_name = ''
blob_obj = BlobClient.from_connection_string(
conn_str=conn_string, container_name=container_name,
blob_name=blob_name
)
with open(blob_name, 'wb') as f:
b = blob_obj.download_blob()
b.readinto(f)
This will create a file in working directory with the data that was downloaded.

simply add readall() to the download_blob() which will read the data
as bytes.
from azure.storage.blob import BlobClient
conn_string = ''
container_name = ''
blob_name = ''
blob_obj =
BlobClient.from_connection_string(conn_string,container_name,blob_name)
with open(blob_name, 'wb') as f:
b = blob_obj.download_blob().readall()

Related

Download Blob To Local Storage using Python

I'm trying to download a blob file & store it locally on my machine. The file format is HDF5 (a format I have limited/no experience of so far).
So far I've been successful in downloading something using the scripts below. The key issue is it doesn't seem to be the full file. When downloading the file directly from storage explorer it is circa 4,000kb. The HDF5 file I save is 2kb.
What am I doing wrong? Am I missing a readall() somewhere?
My first time working with blob storage & HDF5's, so coming a little stuck right now. A lot of the old questions seem to be using deprecated commands as the azure.storage.blob module has been updated.
from azure.storage.blob import BlobServiceClient
from io import StringIO, BytesIO
import h5py
# Initialise client
blob_service_client = BlobServiceClient.from_connection_string("my_conn_str")
# Initialise container
blob_container_client = blob_service_client.get_container_client("container_name")
# Get blob
blob_client = blob_container_client.get_blob_client("file_path")
# Download
download_stream = blob_client.download_blob()
# Create empty stream
stream = BytesIO()
# Read downloaded blob into stream
download_stream.readinto(stream)
# Create new empty hdf5 file
hf = h5py.File('data.hdf5', 'w')
# Write stream into empty HDF5
hf.create_dataset('dataset_1',stream)
# Close Blob (& save)
hf.close()
I tried to reproduce the scenario in my system facing with same issue with code you tried
So I tried the another solution read the hdf5 file as stream and write it inside another hdf5 file
Try with this solution .Taken some dummy data for testing purpose.
from azure.storage.blob import BlobServiceClient
from io import StringIO, BytesIO
import numpy as np
import h5py
# Initialise client
blob_service_client = BlobServiceClient.from_connection_string("Connection String")
# Initialise container
blob_container_client = blob_service_client.get_container_client("test//Container name")
# Get blob
blob_client = blob_container_client.get_blob_client("test.hdf5 //Blob name")
print("downloaded the blob ")
# Download
download_stream = blob_client.download_blob()
stream = BytesIO()
downloader = blob_client.download_blob()
# download the entire file in memory here
# file can be many giga bytes! Big problem
downloader.readinto(stream)
# works fine to open the stream and read data
f = h5py.File(stream, 'r')
//dummy data
data_matrix = np.random.uniform(-1, 1, size=(10, 3))
with h5py.File(stream, "r") as f:
# List all groups
print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]
# Get the data
data = list(f[a_group_key])
data_matrix=data
print(data)
with h5py.File("file1.hdf5", "w") as data_file:
data_file.create_dataset("group_name", data=data_matrix)
OUTPUT

Stream Files to Zip File in Azure Blob Storage using Python?

I have the following problem in Python:
I am looking to create a zipfile in Blob Storage consisting of files from an array of URLs but I don't want to create the entire zipfile in memory and then upload it. I ideally want to stream the files to the zipfile in blob storage. I found this write up for C# https://andrewstevens.dev/posts/stream-files-to-zip-file-in-azure-blob-storage/
as well as this answer also in C# https://stackoverflow.com/a/54767264/10550055 .
I haven't been able to find equivalent functionality in the python azure blob SDK and python zipfile library.
Try this :
from zipfile import ZipFile
from azure.storage.blob import BlobServiceClient
import os,requests
tempPath = '<temp path>'
if not os.path.isdir(tempPath):
os.mkdir(tempPath)
zipFileName = 'test.zip'
storageConnstr = ''
container = ''
blob = BlobServiceClient.from_connection_string(storageConnstr).get_container_client(container).get_blob_client(zipFileName)
fileURLs = {'https://cdn.pixabay.com/photo/2015/04/23/22/00/tree-736885__480.jpg',
'http://1812.img.pp.sohu.com.cn/images/blog/2009/11/18/18/8/125b6560a6ag214.jpg',
'http://513.img.pp.sohu.com.cn/images/blog/2009/11/18/18/27/125b6541abcg215.jpg'}
def download_url(url, save_path, chunk_size=128):
r = requests.get(url, stream=True)
with open(save_path, 'wb') as fd:
for chunk in r.iter_content(chunk_size=chunk_size):
fd.write(chunk)
zipObj = ZipFile(tempPath + zipFileName, 'w')
#download file and write to zip
for url in fileURLs:
localFilePath = tempPath + os.path.basename(url)
download_url(url,localFilePath)
zipObj.write(localFilePath)
zipObj.close()
#upload zip
with open(tempPath + zipFileName, 'rb') as stream:
blob.upload_blob(stream)

Read Json files from Azure blob using python?

I need to read a JSON file from a blob container in Azure for doing some transformation on top of the JSON Files. I have seen few documentation and StackOverflow answers and developed a python code that will read the files from the blob.
I have tried the below script from one of the Stackoverflow answers to read JSON file but I get the below error
"TypeError: the JSON object must be str, bytes or byte array, not BytesIO"
I am new to python programming so not sure of the issue in the code. I tried with download_stream.content_as_text() but the file doesnt read the file without any error.
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
from io import BytesIO
import requests
from pandas import json_normalize
import json
filename = "sample.json"
container_name="test"
constr = ""
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
# with open(stream) as j:
# contents = json.loads(j)
fileReader = json.loads(stream)
print(filereader)
You can use readallfunction. Please try this code:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import json
filename = "sample.json"
container_name="test"
constr = ""
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader = blob_client.download_blob()
fileReader = json.loads(streamdownloader.readall())
print(fileReader)
Result:

Open an Azure StorageStreamDownloader without saving it as a file

I need to download a PDF from a blob container in azure as a download stream (StorageStreamDownloader) and open it in both PDFPlumber and PDFminer.
I developed all the requirements loading them as a file, but I cant manage to received a download stream (StorageStreamDownloader) and open it successfully.
I was opening the PDFs like this:
pdf = pdfplumber.open(pdfpath) //for pdfplumber
fp = open('Pdf/' + fileGlob, 'rb') // for pdfminer
parser = PDFParser(fp)
document = PDFDocument(parser)
However, i need to be able to download a stream. Code snippet that downloads the pdf as a file:
blob_client = container.get_blob_client(remote_file)
with open(local_file_path,"wb") as local_file:
download_stream = blob_client.download_blob()
local_file.write(download_stream.readall())
local_file.close()
I tried several options, even using a temp file with no luck.
Any ideas?
download_blob() download the blob to a StorageStreamDownloader class, and in this class there is a download_to_stream, with this you will get the blob stream.
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
from io import BytesIO
import PyPDF2
filename = "test.pdf"
container_name="test"
blob_service_client = BlobServiceClient.from_connection_string("connection string")
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
fileReader = PyPDF2.PdfFileReader(stream)
print(fileReader.numPages)
And this is my result. It will print the pdf pages number.
It seems download_to_stream() is now deprecated and instead should be used readinto().
from azure.storage.blob import BlobClient
conn_string = ''
container_name = ''
blob_name = ''
blob_obj = BlobClient.from_connection_string(
conn_str=conn_string, container_name=container_name,
blob_name=blob_name
)
with open(blob_name, 'wb') as f:
b = blob_obj.download_blob()
b.readinto(f)
This will create a file in working directory with the data that was downloaded.
simply add readall() to the download_blob() which will read the data
as bytes.
from azure.storage.blob import BlobClient
conn_string = ''
container_name = ''
blob_name = ''
blob_obj =
BlobClient.from_connection_string(conn_string,container_name,blob_name)
with open(blob_name, 'wb') as f:
b = blob_obj.download_blob().readall()

Azure Blob - Read using Python

Can someone tell me if it is possible to read a csv file directly from Azure blob storage as a stream and process it using Python? I know it can be done using C#.Net (shown below) but wanted to know the equivalent library in Python to do this.
CloudBlobClient client = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = client.GetContainerReference("outfiles");
CloudBlob blob = container.GetBlobReference("Test.csv");*
Yes, it is certainly possible to do so. Check out Azure Storage SDK for Python
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='myaccount', account_key='mykey')
block_blob_service.get_blob_to_path('mycontainer', 'myblockblob', 'out-sunset.png')
You can read the complete SDK documentation here: http://azure-storage.readthedocs.io.
Here's a way to do it with the new version of the SDK (12.0.0):
from azure.storage.blob import BlobClient
blob = BlobClient(account_url="https://<account_name>.blob.core.windows.net"
container_name="<container_name>",
blob_name="<blob_name>",
credential="<account_key>")
with open("example.csv", "wb") as f:
data = blob.download_blob()
data.readinto(f)
See here for details.
One can stream from blob with python like this:
from tempfile import NamedTemporaryFile
from azure.storage.blob.blockblobservice import BlockBlobService
entry_path = conf['entry_path']
container_name = conf['container_name']
blob_service = BlockBlobService(
account_name=conf['account_name'],
account_key=conf['account_key'])
def get_file(filename):
local_file = NamedTemporaryFile()
blob_service.get_blob_to_stream(container_name, filename, stream=local_file,
max_connections=2)
local_file.seek(0)
return local_file
Provide Your Azure subscription Azure storage name and Secret Key as Account Key here
block_blob_service = BlockBlobService(account_name='$$$$$$', account_key='$$$$$$')
This still get the blob and save in current location as 'output.jpg'
block_blob_service.get_blob_to_path('you-container_name', 'your-blob', 'output.jpg')
This will get text/item from blob
blob_item= block_blob_service.get_blob_to_bytes('your-container-name','blob-name')
blob_item.content
I recommend using smart_open.
import os
from azure.storage.blob import BlobServiceClient
from smart_open import open
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
'client': BlobServiceClient.from_connection_string(connect_str),
}
# stream from Azure Blob Storage
with open('azure://my_container/my_file.txt', transport_params=transport_params) as fin:
for line in fin:
print(line)
# stream content *into* Azure Blob Storage (write mode):
with open('azure://my_container/my_file.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'hello world')
Since I wasn't able to find what I needed on this thread, I wanted to follow up on #SebastianDziadzio's answer to retrieve the data without downloading it as a local file, which is what I was trying to find for myself.
Replace the with statement with the following:
from io import BytesIO
import pandas as pd
with BytesIO() as input_blob:
blob_client_instance.download_blob().download_to_stream(input_blob)
input_blob.seek(0)
df = pd.read_csv(input_blob, compression='infer', index_col=0)
Here is the simple way to read a CSV using Pandas from a Blob:
import os
from azure.storage.blob import BlobServiceClient
service_client = BlobServiceClient.from_connection_string(os.environ['AZURE_STORAGE_CONNECTION_STRING'])
client = service_client.get_container_client("your_container")
bc = client.get_blob_client(blob="your_folder/yourfile.csv")
data = bc.download_blob()
with open("file.csv", "wb") as f:
data.readinto(f)
df = pd.read_csv("file.csv")
To Read from Azure Blob
I want to use csv from azure blob storage to openpyxl xlsx
from io import BytesIO
conn_str = os.environ.get('BLOB_CONN_STR')
container_name = os.environ.get('CONTAINER_NAME')
blob = BlobClient.from_connection_string(conn_str, container_name=container_name,
blob_name="YOUR BLOB PATH HERE FROM AZURE BLOB")
data = blob.download_blob()
workbook_obj = openpyxl.load_workbook(filename=BytesIO(data.readall()))
To write in Azure Blob
I struggled lot for this I don't want anyone to do same,
If you are using openpyxl and want to directly write from azure function to blob storage do following steps and you will achieve what you are seeking for.
Thanks. HMU if you need anyhelp.
blob=BlobClient.from_connection_string(conn_str=conString,container_name=container_name, blob_name=r'YOUR_PATH/test1.xlsx')
blob.upload_blob(save_virtual_workbook(wb))
I know this is an old post but if someone wants to do the same.
I was able to access as per below codes
Note: you need to set the AZURE_STORAGE_CONNECTION_STRING which can be obtained from Azure Portal -> Go to your storage -> Settings -> Access keys and then you will get the connection string there.
For Windows:
setx AZURE_STORAGE_CONNECTION_STRING ""
For Linux:
export AZURE_STORAGE_CONNECTION_STRING=""
For macOS:
export AZURE_STORAGE_CONNECTION_STRING=""
import os
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
connect_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
print(connect_str)
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client("Your Storage Name Here")
try:
print("\nListing blobs...")
# List the blobs in the container
blob_list = container_client.list_blobs()
for blob in blob_list:
print("\t" + blob.name)
except Exception as ex:
print('Exception:')
print(ex)

Categories