Reading multiple json files from Azure storage into Python dataframe

Reading multiple json files from Azure storage into Python dataframe - python

I m using below code to read json file from Azure storage into a dataframe in Python.
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import json
import json
import pandas as pd
from pandas import DataFrame
from datetime import datetime
import uuid
filename = "raw/filename.json"
container_name="test"
constr = ""
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader = blob_client.download_blob()
fileReader = json.loads(streamdownloader.readall())
df = pd.DataFrame(fileReader)
rslt_df = df[df['ID'] == 'f2a8141f-f1c1-42c3-bb57-910052b78110']
rslt_df.head()
This works fine. But I want to read multiple files into a dataframe. Is there any way we can pass a pattern in the file name to read multiple files from Azure storage like below to read the files recursively.
filename = "raw/filename*.json"
Thank you

I tried in my environment which can read multiple json files got result successfully:
ServiceClient = BlobServiceClient.from_connection_string("< CONNECTION STRING>")
ContainerClient=ServiceClient.get_container_client("container1")
BlobList=ContainerClient.list_blobs(name_starts_with="directory1")
for blob in BlobList:
print()
print("The file "+blob.name+" containers:")
blob_client = ContainerClient.get_blob_client(blob.name)
downloaderpath = blob_client.download_blob()
fileReader = json.loads(downloaderpath.readall())
dataframe = pd.DataFrame(fileReader)
print(dataframe.to_string())
I uploaded my three json files in my container you can see below:
Output:

Related

Problem on upload a file from a azure storage container to another container after processing

The scenario is:
I have a CSV file in Azure storage. I wanna process a column of this file (for example, separate and create a new file every minute of record receiving), and then new files store on another azure storage container.
In the below code I read a file and process it and create separate files but when I want to upload I received this error: [Errno 2] No such file or directory:
My code is:
import os, uuid
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
import pandas as pd
try:
print("Azure Blob Storage v" + __version__ + " - Python quickstart sample")
accountName = "***"
accountKey = "*****"
containerName = "sourcedata"
blobName = "testdataset.csv"
urlblob = "https://***.blob.core.windows.net/sorcedata/testdataset.csv"
connect_str = "******"
blobService = BlobServiceClient(account_name=accountName, account_key=accountKey, account_url=urlblob)
# Create the BlobServiceClient object which will be used to create a container client
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
# Create a unique name for the container
container_name = str(uuid.uuid4())
# Create the container
container_client = blob_service_client.create_container(container_name)
df = pd.read_csv(urlblob)
# create datetime column
df['datetime'] = pd.to_datetime(df.received_time, format='%M:%S.%f')
# groupby with Grouper, and save to csv
for g, d in df.groupby(pd.Grouper(key='datetime', freq='1min')):
# Create file name
filename = str(g.time()).replace(':', '')
# remove datetime column and save CSV file
d.iloc[:, :-1].to_csv(f'{filename}.csv', index=False)
# Create a blob client using the local file name as the name for the blob
blob_client = blob_service_client.get_blob_client(container=container_name, blob=filename)
print("\nUploading to Azure Storage as blob:\n\t" + filename)
# Upload the created file
with open(filename, "rb") as data:
blob_client.upload_blob(data)
except Exception as ex:
print('Exception:')
print(ex)

You should write your code like this:
with open(filename + ".csv", "rb") as data:
filename is just your file name without a suffix. It is incomplete, so when python opens this file, it cannot find the file.
Result image:

Read Json files from Azure blob using python?

I need to read a JSON file from a blob container in Azure for doing some transformation on top of the JSON Files. I have seen few documentation and StackOverflow answers and developed a python code that will read the files from the blob.
I have tried the below script from one of the Stackoverflow answers to read JSON file but I get the below error
"TypeError: the JSON object must be str, bytes or byte array, not BytesIO"
I am new to python programming so not sure of the issue in the code. I tried with download_stream.content_as_text() but the file doesnt read the file without any error.
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
from io import BytesIO
import requests
from pandas import json_normalize
import json
filename = "sample.json"
container_name="test"
constr = ""
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
# with open(stream) as j:
# contents = json.loads(j)
fileReader = json.loads(stream)
print(filereader)

You can use readallfunction. Please try this code:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import json
filename = "sample.json"
container_name="test"
constr = ""
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader = blob_client.download_blob()
fileReader = json.loads(streamdownloader.readall())
print(fileReader)
Result:

Memory Error:While reading a large .txt file from BLOB in python

I am trying to read a large (~1.5 GB) .txt file from Azure blob in python which is giving Memory Error. Is there a way in which I can read this file in an efficient way?
Below is the code that I am trying to run:
from azure.storage.blob import BlockBlobService
import pandas as pd
from io import StringIO
import time
STORAGEACCOUNTNAME= '*********'
STORAGEACCOUNTKEY= "********"
CONTAINERNAME= '******'
BLOBNAME= 'path/to/blob'
blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
start = time.time()
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
end = time.time()
print("Time taken = ",end-start)
Below are last few lines of the error:
---> 16 blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
17
18 #df = pd.read_csv(StringIO(blobstring))
~/anaconda3_420/lib/python3.5/site-packages/azure/storage/blob/baseblobservice.py in get_blob_to_text(self, container_name, blob_name, encoding, snapshot, start_range, end_range, validate_content, progress_callback, max_connections, lease_id, if_modified_since, if_unmodified_since, if_match, if_none_match, timeout)
2378 if_none_match,
2379 timeout)
-> 2380 blob.content = blob.content.decode(encoding)
2381 return blob
2382
MemoryError:
How can I read a file of size ~1.5 GB in Python from a Blob container? Also, I want to have an optimum runtime for my code.

Assumed that there is enough memory in your machine, and according to the pandas.read_csv API reference below, you can directly read the csv blob content into pandas dataframe by the csv blob url with sas token.
Here is my sample code as refenece for you.
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
import pandas as pd
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_name = '<your csv blob name>'
url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"
service = BaseBlobService(account_name=account_name, account_key=account_key)
# Generate the sas token for your csv blob
token = service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
# Directly read the csv blob content into dataframe by the url with sas token
df = pd.read_csv(f"{url}?{token}")
print(df)
I think it will avoid to copy memory few times when read the text content and convert it to a file-like object buffer.
Hope it helps.

Python: How to read and load an excel file from AWS S3?

I have uploaded an excel file to AWS S3 bucket and now I want to read it in python. Any help would be appreciated. Here is what I have achieved so far,
import boto3
import os
aws_id = 'aws_id'
aws_secret = 'aws_secret_key'
client = boto3.client('s3', aws_access_key_id=aws_id, aws_secret_access_key=aws_secret)
bucket_name = 'my_bucket'
object_key = 'my_excel_file.xlsm'
object_file = client.get_object(Bucket=bucket_name, Key=object_key)
body = object_file['Body']
data = body.read()
What do I need to do next in order to read this data and work on it?

Spent quite some time on it and here's how I got it working,
import boto3
import io
import pandas as pd
import json
aws_id = ''
aws_secret = ''
bucket_name = ''
object_key = ''
s3 = boto3.client('s3', aws_access_key_id=aws_id, aws_secret_access_key=aws_secret)
obj = s3.get_object(Bucket=bucket_name, Key=object_key)
data = obj['Body'].read()
df = pd.read_excel(io.BytesIO(data), encoding='utf-8')

You can directly read xls file from S3 without having to download or save it locally. xlrd module has a provision to provide raw data to create workbook object.
Following is the code snippet.
from boto3 import Session
from xlrd.book import open_workbook_xls
aws_id = ''
aws_secret = ''
bucket_name = ''
object_key = ''
s3_session = Session(aws_access_key_id=aws_id, aws_secret_access_key=aws_secret)
bucket_object = s3_session.resource('s3').Bucket(bucket_name).Object(object_key)
content = bucket_object.get()['Body'].read()
workbook = open_workbook_xls(file_contents=content)

You can directly read excel files using awswrangler.s3.read_excel. Note that you can pass any pandas.read_excel() arguments (sheet name, etc) to this.
import awswrangler as wr
df = wr.s3.read_excel(path=s3_uri)

Python doesn't support excel files natively. You could use the pandas library pandas library read_excel functionality

Azure Blob - Read using Python

Can someone tell me if it is possible to read a csv file directly from Azure blob storage as a stream and process it using Python? I know it can be done using C#.Net (shown below) but wanted to know the equivalent library in Python to do this.
CloudBlobClient client = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = client.GetContainerReference("outfiles");
CloudBlob blob = container.GetBlobReference("Test.csv");*

Yes, it is certainly possible to do so. Check out Azure Storage SDK for Python
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='myaccount', account_key='mykey')
block_blob_service.get_blob_to_path('mycontainer', 'myblockblob', 'out-sunset.png')
You can read the complete SDK documentation here: http://azure-storage.readthedocs.io.

Here's a way to do it with the new version of the SDK (12.0.0):
from azure.storage.blob import BlobClient
blob = BlobClient(account_url="https://<account_name>.blob.core.windows.net"
container_name="<container_name>",
blob_name="<blob_name>",
credential="<account_key>")
with open("example.csv", "wb") as f:
data = blob.download_blob()
data.readinto(f)
See here for details.

One can stream from blob with python like this:
from tempfile import NamedTemporaryFile
from azure.storage.blob.blockblobservice import BlockBlobService
entry_path = conf['entry_path']
container_name = conf['container_name']
blob_service = BlockBlobService(
account_name=conf['account_name'],
account_key=conf['account_key'])
def get_file(filename):
local_file = NamedTemporaryFile()
blob_service.get_blob_to_stream(container_name, filename, stream=local_file,
max_connections=2)
local_file.seek(0)
return local_file

Provide Your Azure subscription Azure storage name and Secret Key as Account Key here
block_blob_service = BlockBlobService(account_name='$$$$$$', account_key='$$$$$$')
This still get the blob and save in current location as 'output.jpg'
block_blob_service.get_blob_to_path('you-container_name', 'your-blob', 'output.jpg')
This will get text/item from blob
blob_item= block_blob_service.get_blob_to_bytes('your-container-name','blob-name')
blob_item.content

I recommend using smart_open.
import os
from azure.storage.blob import BlobServiceClient
from smart_open import open
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
'client': BlobServiceClient.from_connection_string(connect_str),
}
# stream from Azure Blob Storage
with open('azure://my_container/my_file.txt', transport_params=transport_params) as fin:
for line in fin:
print(line)
# stream content *into* Azure Blob Storage (write mode):
with open('azure://my_container/my_file.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'hello world')

Since I wasn't able to find what I needed on this thread, I wanted to follow up on #SebastianDziadzio's answer to retrieve the data without downloading it as a local file, which is what I was trying to find for myself.
Replace the with statement with the following:
from io import BytesIO
import pandas as pd
with BytesIO() as input_blob:
blob_client_instance.download_blob().download_to_stream(input_blob)
input_blob.seek(0)
df = pd.read_csv(input_blob, compression='infer', index_col=0)

Here is the simple way to read a CSV using Pandas from a Blob:
import os
from azure.storage.blob import BlobServiceClient
service_client = BlobServiceClient.from_connection_string(os.environ['AZURE_STORAGE_CONNECTION_STRING'])
client = service_client.get_container_client("your_container")
bc = client.get_blob_client(blob="your_folder/yourfile.csv")
data = bc.download_blob()
with open("file.csv", "wb") as f:
data.readinto(f)
df = pd.read_csv("file.csv")

To Read from Azure Blob
I want to use csv from azure blob storage to openpyxl xlsx
from io import BytesIO
conn_str = os.environ.get('BLOB_CONN_STR')
container_name = os.environ.get('CONTAINER_NAME')
blob = BlobClient.from_connection_string(conn_str, container_name=container_name,
blob_name="YOUR BLOB PATH HERE FROM AZURE BLOB")
data = blob.download_blob()
workbook_obj = openpyxl.load_workbook(filename=BytesIO(data.readall()))
To write in Azure Blob
I struggled lot for this I don't want anyone to do same,
If you are using openpyxl and want to directly write from azure function to blob storage do following steps and you will achieve what you are seeking for.
Thanks. HMU if you need anyhelp.
blob=BlobClient.from_connection_string(conn_str=conString,container_name=container_name, blob_name=r'YOUR_PATH/test1.xlsx')
blob.upload_blob(save_virtual_workbook(wb))

I know this is an old post but if someone wants to do the same.
I was able to access as per below codes
Note: you need to set the AZURE_STORAGE_CONNECTION_STRING which can be obtained from Azure Portal -> Go to your storage -> Settings -> Access keys and then you will get the connection string there.
For Windows:
setx AZURE_STORAGE_CONNECTION_STRING ""
For Linux:
export AZURE_STORAGE_CONNECTION_STRING=""
For macOS:
export AZURE_STORAGE_CONNECTION_STRING=""
import os
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
connect_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
print(connect_str)
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_client = blob_service_client.get_container_client("Your Storage Name Here")
try:
print("\nListing blobs...")
# List the blobs in the container
blob_list = container_client.list_blobs()
for blob in blob_list:
print("\t" + blob.name)
except Exception as ex:
print('Exception:')
print(ex)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading multiple json files from Azure storage into Python dataframe - python

Related

Problem on upload a file from a azure storage container to another container after processing

Read Json files from Azure blob using python?

Memory Error:While reading a large .txt file from BLOB in python

Python: How to read and load an excel file from AWS S3?

Azure Blob - Read using Python

Categories

Resources