Using openpyxl with lambda - python

Python rookie here. I have a requirement for which i have been researching for a couple of days now. The requirement goes as below.
I have an S3 location where I have few excel sheets with unformatted data. I am writing a lambda function to format and convert them to csv format. Now I already have the code for this, but it works on local machine where I pick the excel files from local directory, format/transform them and put them to target folder. We are using openpyxl package for transforming. Now I am migrating this to AWS and there comes the problem. Instead of local directories the source and target will be s3 locations.
The data transforming logic is way too lengthy and I really dont want to rewrite them.
Is there a way I can handle these excel files just like how we does in local machine.
For instance,
wb = openpyxl.load_workbook('C:\User\test.xlsx, data_only=True)
How can I recreate this statement or what it does in lambda with python?

You can do this with BytesIO like so:
file = readS3('test.xlsx') # load file with Boto3
wb = openpyxl.load_workbook(BytesIO(file), data_only=True)
With readS3() being implemented for example like this:
import boto3
bucket = #bucket name
def readS3(file):
s3 = boto3.client('s3')
s3_data = s3.get_object(Bucket=bucket, Key=file)
return s3_data['Body'].read()
Configure Boto3 like so: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html

Related

What is the correct process for loading an excel workbook from blob storage?

I have an excel file stored in a azure blob storage container. I need this file in order to generate another excel file based on it. The issue I keep running into is when using openpyxl I use the following line of code to load the workbook:
wb = load_workbook(filename=)
I am not sure what to put in the area filename=. I thought it might need the URL of the excel blob inside the container.
That URL looks something like this: 'https://mystorage.blob.core.cloudapi.net/excel-files/myexcelfile.xlsx'
When I ran that code inside my azure notebook it throws this error:
FileNotFoundError: [Errno 2]: No such file or directory: 'https://mystorage.blob.core.cloudapi.net/excel-files/myexcelfile.xlsx'
Other solutions I read online said to use local memory, but this option will not work. I need to be able to do everything within the Azure ecosystem. If anyone knows how to load a workbook using openpyxl or another way for a file that exists inside an azure storage container I could use your assistance.
I am able to access the excel files and load them as panda df's by using the connection string, container name, and blob name and then connecting to the container_client and downloading like so:
conn_str = "abc123"
container = "a_container"
xl_blob = "a_xl"
download_blob = container_client.download_blob(a_xl)
df = pd.read_excel(download_blob.readall(), index=1)
Through this I can see the excel as a pandas df, but loading it through the workbook is tricky.
When I used that same "download_blob" variable in place of the filename= part, it throws the error:
TypeError: expected str, bytes or os.PathLike object, not StorageStreamDownloader
Thanks
download_blob is of type StorageStreamDownloader, so passing that into load_workbook is not going to work. Even passing download_blob.readall(), which is of type bytes, into load_workbook is not going to work. You need to write the bytes into a io.BytesIO, which is a file-like object, and pass that into load_workbook.
A io.BytesIO object is like a file that exists in memory only.
Something like this should work:
import io
import openpyxl
conn_str = "abc123"
container = "a_container"
xl_blob = "a_xl"
download_blob = container_client.download_blob(a_xl)
file = io.BytesIO(download_blob.readall())
wb = openpyxl.load_workbook(file)
ws = wb.active

Databricks - pyspark.pandas.Dataframe.to_excel does not recognize abfss protocol

I want to save a Dataframe (pyspark.pandas.Dataframe) as an Excel file on the Azure Data Lake Gen2 using Azure Databricks in Python.
I've switched to the pyspark.pandas.Dataframe because it is the recommended one since Spark 3.2.
There's a method called to_excel (here the doc) that allows to save a file to a container in ADL but I'm facing problems with the file system access protocols.
From the same class I use the methods to_csv and to_parquet using abfss and I'd like to use the same for the excel.
So when I try so save it using:
import pyspark.pandas as ps
# Omit the df initialization
file_name = "abfss://CONTAINER#SERVICEACCOUNT.dfs.core.windows.net/FILE.xlsx"
sheet = "test"
df.to_excel(file_name, test)
I get the error from fsspec:
ValueError: Protocol not known: abfss
Can someone please help me?
Thanks in advance!
The pandas dataframe does not support the protocol. It seems on Databricks you can only access and write the file on abfss via Spark dataframe. So, the solution is to write file locally and manually move to abfss. See this answer here.
You can not save it directly but you can have it as its stored in temp location and move it to your directory. My code piece is:
import xlsxwriter import pandas as pd1
workbook = xlsxwriter.Workbook('data_checks_output.xlsx')
worksheet = workbook.add_worksheet('top_rows')
Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter')
output = dataset.limit(10)
output = output.toPandas()
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)
writer.save()
After write.save
run below code, which is nothing but moves temp location of file to your desginated location .
Below code does the work of moving files.
%sh
sudo mv file_name.xlsx /dbfs/mnt/fpmount/

Reading excel files from "input" blob storage container and exporting to csv in "output" container with python

I'm trying to develop a script in python to read a file in .xlsx from a blob storage container called "source", convert it in .csv and store it in a new container (I'm testing the script locally, if working I should include it in an ADF pipeline). So far, I managed to access to the blob storage, but I'm having problems in reading the file content.
from azure.storage.blob import BlobServiceClient, ContainerClient, BlobClient
import pandas as pd
conn_str = "DefaultEndpointsProtocol=https;AccountName=XXXXXX;AccountKey=XXXXXX;EndpointSuffix=core.windows.net"
container = "source"
blob_name = "prova.xlsx"
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_excel(downloaded_blob)
print(df)
I get following error:
ValueError: Invalid file path or buffer object type: <class 'azure.storage.blob._download.StorageStreamDownloader'>
I tried with a .csv file as input and writing the parsing code as follows:
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()) )
and it works.
Any suggestion on how to modify the code so that the excel file becomes readable?
I summary the solution as below.
When we use the method pd.read_excel() in sdk pandas, we need to provide bytes as input. But when we use download_blob to download the excel file from azure blob, we just get azure.storage.blob.StorageStreamDownloader. So we need to use the method readall() or content_as_bytes() to convert it to bytes. For more details, please refer to the document and the document
Change
df = pd.read_excel(downloaded_blob)
to
df = pd.read_excel(downloaded_blob.content_as_bytes())

How to load pickled dataframes from GCS into App Engine

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.
The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

Save pandas dataframe to s3

I want to save BIG pandas dataframes to s3 using boto3.
Here is what I am doing now:
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
s3.put_object(Bucket="bucket-name", Key="file-name", Body=csv_buffer.getvalue())
This generates a file with the following permissions:
---------- 1 root root file-name
Hw can I change that in order for the file to be owned by the user that executes the script? i.e user "ubuntu" on a AWS instance
This is what I want:
-rw-rw-r-- 1 ubuntu ubuntu file-name
Another thing, did anyone try this method with big dataframes? (millions of rows) and does it perform well?
How does it compare to just saving the file locally and using the boto3 copy file method?
Thanks a lot.
AWS S3 requires multiple-part upload for files above 5GB.
You can implement it with boto3 using a reusable configuration object from TransferConfig class :
import boto3
from boto3.s3.transfer import TransferConfig
# default value for parameter multipart_threshold is 8MB (=number of bytes per part)
config = TransferConfig(multipart_threshold=5GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html
Then there is some limits here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

Categories