Azure Blob Storage downloading ORC files in Python - python

It is my first time using Azure Storage and ORC.
Here is what I have learned so far, I able to download a ORC blob storage file from Azure and save to disk. Once download complete, I can iterate ORC file using pyorc library in Python. They are mostly smaller files and can easily fit into memory. My question is, instead of writing to a file, I would like to keep the blob in memory and iterate and can avoid writing to a disk. I can download the blob into stream but I am not sure how to use pyorc with blob stream or I cannot locate the help for it.
I appreciate any help and best practice for azure storage download.

Regarding the issue, please refer to the following steps
import pyorc
import io
from azure.storage.blob import BlobClient
key = 'account key'
blob_client = BlobClient(account_url='https://<accountname>.blob.core.windows.net',
container_name='test',
blob_name='my.orc',
credential=key,)
with io.BytesIO() as f:
blob_client.download_blob().readinto(f)
reader = pyorc.Reader(f)
print(next(reader))

I want to thank Jim Xu for his solution and I slightly modify his solution to fit to my need if anyone interested
from azure.storage.blob import ContainerClient, BlobClient
from io import BytesIO
import pyorc
containerClient = ContainerClient.from_connection_string(azureConnString, container_name=azureContainer)
blobList = containerClient.list_blobs(azureBlobFolder)
for fileNo, blob in enumerate(blobList):
blobClient = containerClient.get_blob_client(blob=blob.name)
with BytesIO() as f:
blobClient.download_blob().readinto(f)
reader = pyorc.Reader(f)
print(next(reader))

Related

Reading excel files from "input" blob storage container and exporting to csv in "output" container with python

I'm trying to develop a script in python to read a file in .xlsx from a blob storage container called "source", convert it in .csv and store it in a new container (I'm testing the script locally, if working I should include it in an ADF pipeline). So far, I managed to access to the blob storage, but I'm having problems in reading the file content.
from azure.storage.blob import BlobServiceClient, ContainerClient, BlobClient
import pandas as pd
conn_str = "DefaultEndpointsProtocol=https;AccountName=XXXXXX;AccountKey=XXXXXX;EndpointSuffix=core.windows.net"
container = "source"
blob_name = "prova.xlsx"
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_excel(downloaded_blob)
print(df)
I get following error:
ValueError: Invalid file path or buffer object type: <class 'azure.storage.blob._download.StorageStreamDownloader'>
I tried with a .csv file as input and writing the parsing code as follows:
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()) )
and it works.
Any suggestion on how to modify the code so that the excel file becomes readable?
I summary the solution as below.
When we use the method pd.read_excel() in sdk pandas, we need to provide bytes as input. But when we use download_blob to download the excel file from azure blob, we just get azure.storage.blob.StorageStreamDownloader. So we need to use the method readall() or content_as_bytes() to convert it to bytes. For more details, please refer to the document and the document
Change
df = pd.read_excel(downloaded_blob)
to
df = pd.read_excel(downloaded_blob.content_as_bytes())

Load CSV stored as an Azure Blob directly into a Pandas data frame without saving to disk first

The article Explore data in Azure blob storage with pandas (here) shows how to load data from an Azure Blob Store into a Pandas data frame.
They do it by first downloading the blob and storing it locally as a CSV file and then loading that CSV file into a data frame.
import pandas as pd
from azure.storage.blob import BlockBlobService
blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME, BLOBNAME, LOCALFILENAME)
dataframe_blobdata = pd.read_csv(LOCALFILE)
Is there a way to pull the blob directly into a data frame without saving it to local disk first?
You could try something like that (using StringIO) :
import pandas as pd
from azure.storage.blob import BlockBlobService
from io import StringIO
blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
blob_string = blob_service.get_blob_to_text(CONTAINERNAME, BLOBNAME)
dataframe_blobdata = pd.read_csv(StringIO(blobstring))
Be aware that the file will be stored in-memory, which means that if it's a large file it can cause a MemoryError (maybe you can try to del the blob_string in order to free memory once you got data in dataframe, idk).
I've done more or less the same thing with Azure DataLake Storage Gen2 (which uses Azure Blob Storage).
Hope it helps.

How do I read a gzipped parquet file from S3 into Python using Boto3?

I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3.
The solution is actually quite straightforward.
import boto3 # For read+push to S3 bucket
import pandas as pd # Reading parquets
from io import BytesIO # Converting bytes to bytes input file
import pyarrow # Fast reading of parquets
# Set up your S3 client
# Ideally your Access Key and Secret Access Key are stored in a file already
# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_HERE,
aws_secret_access_key=SECRET_ACCESS_KEY_HERE)
# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)
# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()
# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))
If you are using an IDE in your laptop/PC to connect to AWS S3 you may refer to the first solution of Corey:
import boto3
import pandas as pd
import io
s3 = boto3.resource(service_name='s3', region_name='XXXX',
aws_access_key_id='YYYY', aws_secret_access_key='ZZZZ')
buffer = io.BytesIO()
object = s3.Object(bucket_name='bucket_name', key='path/to/your/file.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
If you are using Glue job you may refer to the second solution of Corey in the Glue script:
df = pd.read_parquet(path='s3://bucket_name/path/to/your/file.parquet')
In case you want to read a .json file (using an IDE in your laptop/PC):
object = s3.Object(bucket_name='bucket_name',
key='path/to/your/file.json').get()['Body'].read().decode('utf-8')
df = pd.read_json(object, lines=True)

How to load pickled dataframes from GCS into App Engine

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.
The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

How to read parquet file into pandas from Azure blob store

I need to read and write parquet files from an Azure blob store within the context of a Jupyter notebook running Python 3 kernel.
I see code for working strictly with parquet files and python and other code for grabbing/writing to an Azure blob store but nothing yet that put's it all together.
Here is some sample code I'm playing with:
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='testdata', account_key='key-here')
block_blob_service.get_blob_to_text(container_name='mycontainer', blob_name='testdata.parquet')
This last line with throw an encoding-related error.
I've played with storefact but coming up short there.
Thanks for any help
To access the file you need to access the azure blob storage first.
storage_account_name = "your storage account name"
storage_account_access_key = "your storage account access key"
Read path of parquet file into variable
commonLOB_mst_source = "Parquet file path"
file_type = "parquet"
Connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Read Parquet file into dataframe
df = spark.read.format(file_type).option("inferSchema", "true").load(commonLOB_mst_source)

Categories