Writing pandas DataFrame to Azure Blob Storage from Azure Function - python

I am writing a simple Azure Function to read an input blob, create a pandas DataFrame from it and then write it to Blob Storage again as a CSV. I have the code given below to read the file and convert it into a DataFrame,
import logging
import pandas as pd
import io
import azure.functions as func
def main(inputBlob: func.InputStream):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {inputBlob.name}\n"
f"Blob Size: {inputBlob.length} bytes")
df = pd.read_csv(io.BytesIO(inputBlob.read()), sep='#', encoding='unicode_escape', header=None, names=range(16))
logging.info(df.head())
How can I write this DataFrame out to Blob Storage?

I have uploaded the file with below code, target is the container and target.csv is the blob which we want to write and store.
blob_service_client = BlobServiceClient.from_connection_string(CONN_STR)
# WRITE HEADER TO A OUT PUTFILE
output_file_dest = blob_service_client.get_blob_client(container="target", blob="target.csv")
#INITIALIZE OUTPUT
output_str = ""
#STORE COULMN HEADERS
data= list()
data.append(list(["column1", "column2", "column3", "column4"]))
# Adding data to a variable. Here you can pass the input blob. Also look for the parameters that sets your requirement in upload blob.
output_str += ('"' + '","'.join(data[0]) + '"\n')
output_file_dest.upload_blob(output_str,overwrite=True)
From the above code you can ignore #STORE COULMN HEADERS and replace with input blob read data which you have done it using pandas.

Related

Reading multiple json files from Azure storage into Python dataframe

I m using below code to read json file from Azure storage into a dataframe in Python.
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import json
import json
import pandas as pd
from pandas import DataFrame
from datetime import datetime
import uuid
filename = "raw/filename.json"
container_name="test"
constr = ""
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader = blob_client.download_blob()
fileReader = json.loads(streamdownloader.readall())
df = pd.DataFrame(fileReader)
rslt_df = df[df['ID'] == 'f2a8141f-f1c1-42c3-bb57-910052b78110']
rslt_df.head()
This works fine. But I want to read multiple files into a dataframe. Is there any way we can pass a pattern in the file name to read multiple files from Azure storage like below to read the files recursively.
filename = "raw/filename*.json"
Thank you
I tried in my environment which can read multiple json files got result successfully:
ServiceClient = BlobServiceClient.from_connection_string("< CONNECTION STRING>")
ContainerClient=ServiceClient.get_container_client("container1")
BlobList=ContainerClient.list_blobs(name_starts_with="directory1")
for blob in BlobList:
print()
print("The file "+blob.name+" containers:")
blob_client = ContainerClient.get_blob_client(blob.name)
downloaderpath = blob_client.download_blob()
fileReader = json.loads(downloaderpath.readall())
dataframe = pd.DataFrame(fileReader)
print(dataframe.to_string())
I uploaded my three json files in my container you can see below:
Output:

writing a simple text file with no key value pair to cloud storage storage

My requirement is to export the data from BQ to GCS in a particular sorted order which I am not able to get using automatic export and hence trying to write a manual export for this.
File format is like below:
HDR001||5378473972abc||20101|182082||
DTL001||436282798101|
DTL002||QS
DTL005||3733|8
DTL002||QA
DTL005||3733|8
DTL002||QP
DTL005||3733|8
DTL001||436282798111|
DTL002||QS
DTL005||3133|2
DTL002||QA
DTL005||3133|8
DTL002||QP
DTL005||3133|0
I am very new to this and am able to write the file in local disk but I am not sure how I can write this to file to GCS. I tried to use the write_to_file but I seem to be missing something.
import pandas as pd
import pickle as pkl
import tempfile
from google.colab import auth
from google.cloud import bigquery, storage
#import cloudstorage as gcs
auth.authenticate_user()
df = pd.DataFrame(data=job)
sc = storage.Client(project='temp-project')
with tempfile.NamedTemporaryFile(mode='w+b', buffering=- 1,prefix='test',suffix='temp') as fh:
with open(fh.name,'w+',newline='') as f:
dfAsString = df.to_string(header=" ", index=False)
fh.name = fh.write(dfAsString)
fh.close()
bucket = sc.get_bucket('my-bucket')
target_fn = 'test.csv'
source_fn = fh.name
destination_blob_name = bucket.blob('test.csv')
bucket.blob(destination_blob_name).upload_from_file(source_fn)
Can someone please help?
Thank You.
I would suggest to upload an object through a Cloud Storage bucket. Instead of upload_from_file, you need to use upload_from_filename. Your code should look like this:
bucket.blob(destination_blob_name).upload_from_filename(source_fn)
Here are links for the documentation on how to upload an object to Cloud Storage bucket and Client library docs.
EDIT:
The reason why you're getting that is because somewhere in your code, you're passing a Blob object, rather than a String. Currently your destination variable is a Blob Object, change it to String instead:
destination_blob_name = bucket.blob('test.csv')
to
destination_blob_name = 'test.csv'

Perform Text Sentiment analysis and Keyphrase extraction from excel and store in Azure Blob storage

I want to perform Sentiment analysis and keyphrase extraction on text data stored in an excel format. The sentiments and the extracted keyphrases also need to be appended to the same excel and the final excel needs to be stored in Azure blob storage. Finally this needs be made into a flask app. Would be grateful if anyone can help me on this. Thanks in advance..
Your question scope is too wide, so I write a simple demo for you.
Just try the code below to read data from .csv and use Sentiment analysis and then write back to .csv and upload to blob, the only thing you need is to integrate the code with your flask app:
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
from azure.storage import blob
from azure.storage.blob import BlobClient
import pandas as pd
region = ''
key = ''
excelFilePath = "<local file path>/test.csv"
storageConnStr = '<storage conn str>'
containerName = '<container name>'
destBlob = 'test-upload.csv'
csv = pd.read_csv(excelFilePath,'rb')
data =csv['text']
documents = data.array
blob = BlobClient.from_connection_string(storageConnStr,containerName,destBlob)
credential = AzureKeyCredential(key)
text_analytics_client = TextAnalyticsClient(endpoint="https://"+ region +".api.cognitive.microsoft.com/", credential=credential)
response = text_analytics_client.analyze_sentiment(documents, language="en")
sentiments = [res.sentiment for res in response ]
csv.insert(1, "sentiment", sentiments)
csv.to_csv(excelFilePath, index=False)
blob.upload_blob(open(excelFilePath,'rb').read())
Result:
My .csv:
After running :
and it has been uploaded to storage :

Azure Function - Pandas dataframe to Excel, write to outputBlob stream

Am trying to write a DataFrame to an outputBlob from an Azure Function. I'm having trouble figuring out which io stream to use.
My function looks like this:
import io
import xlrd
import pandas as pd
def main(myblob: func.InputStream, outputBlob: func.Out[func.InputStream]):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {myblob.name}\n"
f"Blob Size: {myblob.length} bytes")
input_file = xlrd.open_workbook(file_contents = myblob.read())
df = pd.read_excel(input_file)
if not df.empty:
output = io.BytesIO()
outputBlob.set(runway1.to_excel(output))
How do we save the DataFrame to a stream that is recognisable by the Azure Function to write the excel to a Storage Container?
If you want to save DataFrame as excel to Azure blob storage, please refer to the following example
SDK
azure-functions==1.3.0
numpy==1.19.0
pandas==1.0.5
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
xlrd==1.2.0
XlsxWriter==1.2.9
Code
import logging
import io
import xlrd
import pandas as pd
import xlsxwriter
import azure.functions as func
async def main(myblob: func.InputStream,outputblob: func.Out[func.InputStream]):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {myblob.name}\n")
input_file = xlrd.open_workbook(file_contents = myblob.read())
df = pd.read_excel(input_file)
if not df.empty:
xlb=io.BytesIO()
writer = pd.ExcelWriter(xlb, engine= 'xlsxwriter')
df.to_excel(writer,index=False)
writer.save()
xlb.seek(0)
outputblob.set(xlb)
logging.info("OK")

GCS - Read a text file from Google Cloud Storage directly into python

I feel kind of stupid right now. I have been reading numerous documentations and stackoverflow questions but I can't get it right.
I have a file on Google Cloud Storage. It is in a bucket 'test_bucket'. Inside this bucket there is a folder, 'temp_files_folder', which contains two files, one .txt file named 'test.txt' and one .csv file named 'test.csv'. The two files are simply because I try using both but the result is the same either way.
The content in the files is
hej
san
and I am hoping to read it into python the same way I would do on a local with
textfile = open("/file_path/test.txt", 'r')
times = textfile.read().splitlines()
textfile.close()
print(times)
which gives
['hej', 'san']
I have tried using
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test_bucket')
blob = bucket.get_blob('temp_files_folder/test.txt')
print(blob.download_as_string)
but it gives the output
<bound method Blob.download_as_string of <Blob: test_bucket, temp_files_folder/test.txt>>
How can I get the actual string(s) in the file?
download_as_string is a method, you need to call it.
print(blob.download_as_string())
More likely, you want to assign it to a variable so that you download it once and can then print it and do whatever else you want with it:
downloaded_blob = blob.download_as_string()
print(downloaded_blob)
do_something_else(downloaded_blob)
The method 'download_as_string()' will read in the content as byte.
Find below an example to process a .csv file.
import csv
from io import StringIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket(YOUR_BUCKET_NAME)
blob = bucket.blob(YOUR_FILE_NAME)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
According to the documentation (https://googleapis.dev/python/storage/latest/blobs.html), As of the time of writing (2021/08), the download_as_string method is a depreciated alias for the download_as_byte method which - as suggested by the name - returns a byte object.
You can instead use the download_as_text method to return a str object.
For instances, to download the file MYFILE from bucket MYBUCKET and store it as an utf-8 encoded string:
from google.cloud.storage import Client
client = Client()
bucket = client.get_bucket(MYBUCKET)
blob = bucket.get_blob(MYFILE)
downloaded_file = blob.download_as_text(encoding="utf-8")
You can then also use this in order to read different file formats. For json, replace the last line to
import json
downloaded_json_file = json.loads(blob.download_as_text(encoding="utf-8"))
For yaml files, replace the last line to :
import yaml
downloaded_yaml_file = yaml.safe_load(blob.download_as_text(encoding="utf-8"))
DON'T USE: blob.download_as_string()
USE: blob.download_as_text()
blob.download_as_text() does indeed return a string.
blob.download_as_string() is deprecated and returns a bytes object instead of a string object.
Works out when reading a docx / text file
from google.cloud import storage
# create storage client
storage_client = storage.Client.from_service_account_json('**PATH OF JSON FILE**')
bucket = storage_client.get_bucket('**BUCKET NAME**')
# get bucket data as blob
blob = bucket.blob('**SPECIFYING THE DOXC FILENAME**')
downloaded_blob = blob.download_as_string()
downloaded_blob = downloaded_blob.decode("utf-8")
print(downloaded_blob)

Categories