I want to save a Dataframe (pyspark.pandas.Dataframe) as an Excel file on the Azure Data Lake Gen2 using Azure Databricks in Python.
I've switched to the pyspark.pandas.Dataframe because it is the recommended one since Spark 3.2.
There's a method called to_excel (here the doc) that allows to save a file to a container in ADL but I'm facing problems with the file system access protocols.
From the same class I use the methods to_csv and to_parquet using abfss and I'd like to use the same for the excel.
So when I try so save it using:
import pyspark.pandas as ps
# Omit the df initialization
file_name = "abfss://CONTAINER#SERVICEACCOUNT.dfs.core.windows.net/FILE.xlsx"
sheet = "test"
df.to_excel(file_name, test)
I get the error from fsspec:
ValueError: Protocol not known: abfss
Can someone please help me?
Thanks in advance!
The pandas dataframe does not support the protocol. It seems on Databricks you can only access and write the file on abfss via Spark dataframe. So, the solution is to write file locally and manually move to abfss. See this answer here.
You can not save it directly but you can have it as its stored in temp location and move it to your directory. My code piece is:
import xlsxwriter import pandas as pd1
workbook = xlsxwriter.Workbook('data_checks_output.xlsx')
worksheet = workbook.add_worksheet('top_rows')
Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter')
output = dataset.limit(10)
output = output.toPandas()
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)
writer.save()
After write.save
run below code, which is nothing but moves temp location of file to your desginated location .
Below code does the work of moving files.
%sh
sudo mv file_name.xlsx /dbfs/mnt/fpmount/
Related
I have an excel file stored in a azure blob storage container. I need this file in order to generate another excel file based on it. The issue I keep running into is when using openpyxl I use the following line of code to load the workbook:
wb = load_workbook(filename=)
I am not sure what to put in the area filename=. I thought it might need the URL of the excel blob inside the container.
That URL looks something like this: 'https://mystorage.blob.core.cloudapi.net/excel-files/myexcelfile.xlsx'
When I ran that code inside my azure notebook it throws this error:
FileNotFoundError: [Errno 2]: No such file or directory: 'https://mystorage.blob.core.cloudapi.net/excel-files/myexcelfile.xlsx'
Other solutions I read online said to use local memory, but this option will not work. I need to be able to do everything within the Azure ecosystem. If anyone knows how to load a workbook using openpyxl or another way for a file that exists inside an azure storage container I could use your assistance.
I am able to access the excel files and load them as panda df's by using the connection string, container name, and blob name and then connecting to the container_client and downloading like so:
conn_str = "abc123"
container = "a_container"
xl_blob = "a_xl"
download_blob = container_client.download_blob(a_xl)
df = pd.read_excel(download_blob.readall(), index=1)
Through this I can see the excel as a pandas df, but loading it through the workbook is tricky.
When I used that same "download_blob" variable in place of the filename= part, it throws the error:
TypeError: expected str, bytes or os.PathLike object, not StorageStreamDownloader
Thanks
download_blob is of type StorageStreamDownloader, so passing that into load_workbook is not going to work. Even passing download_blob.readall(), which is of type bytes, into load_workbook is not going to work. You need to write the bytes into a io.BytesIO, which is a file-like object, and pass that into load_workbook.
A io.BytesIO object is like a file that exists in memory only.
Something like this should work:
import io
import openpyxl
conn_str = "abc123"
container = "a_container"
xl_blob = "a_xl"
download_blob = container_client.download_blob(a_xl)
file = io.BytesIO(download_blob.readall())
wb = openpyxl.load_workbook(file)
ws = wb.active
In my Azure Function, I'm currently writing data frame into CSV using df.to_csv() and passing this object to the append blob function.
output_data = df.to_csv(index=False, encoding="utf=8")
blob_client.append_block(output)
Now I want to store data into .xlsx format and then add auto filter to excel file using xlsxwriter
This is what, I tried but was unable to understand what is wrong here
writer = io.BytesIO()
df.to_excel(writer, index=False)
writer.seek(0)
blob_client.upload_blob(writer.getvalue())
I have already tried the following solution but it didn't work for me.
Either file is created but empty or file not readable in excel apps is happening
Azure Function - Pandas Dataframe to Excel is Empty
Writing pandas dataframe as xlsx file to an azure blob storage without creating a local file
I managed to read data from a Google Sheet file using this method:
# ACCES GOOGLE SHEET
googleSheetId = 'myGoogleSheetId'
workSheetName = 'mySheetName'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheetId,
workSheetName
)
df = pd.read_csv(URL)
However, after generating a pd.DataFrame that fetches info from the web using selenium, I need to append that data to the Google Sheet.
Question: Do you know a way to export that DataFrame to Google Sheets?
Yes, there is a module called "gspread". Just install it with pip and import it into your script.
Here you can find the documentation:
https://gspread.readthedocs.io/en/latest/
In particular their section on Examples of gspread with pandas.
worksheet.update([dataframe.columns.values.tolist()] + dataframe.values.tolist())
This might be a little late answer to the original author but will be of a help to others. Following is a utility function which can help write any python pandas dataframe to gsheet.
import pygsheets
def write_to_gsheet(service_file_path, spreadsheet_id, sheet_name, data_df):
"""
this function takes data_df and writes it under spreadsheet_id
and sheet_name using your credentials under service_file_path
"""
gc = pygsheets.authorize(service_file=service_file_path)
sh = gc.open_by_key(spreadsheet_id)
try:
sh.add_worksheet(sheet_name)
except:
pass
wks_write = sh.worksheet_by_title(sheet_name)
wks_write.clear('A1',None,'*')
wks_write.set_dataframe(data_df, (1,1), encoding='utf-8', fit=True)
wks_write.frozen_rows = 1
Steps to get service_file_path, spreadsheet_id, sheet_name:
Click Sheets API | Google Developers
Create new project under Dashboard (provide relevant project name and other required information)
Go to Credentials
Click on “Create Credentials” and Choose “Service Account”. Fill in all required information viz. Service account name, id, description et. al.
Go to Step 2 and 3 and Click on “Done”
Click on your service account and Go to “Keys”
Click on “Add Key”, Choose “Create New Key” and Select “Json”. Your Service Json File will be downloaded. Put this under your repo folder and path to this file is your service_file_path.
In that Json, “client_email” key can be found.
Create a new google spreadsheet. Note the url of the spreadsheet.
Provide an Editor access to the spreadsheet to "client_email" (step 8) and Keep this service json file while running your python code.
Note: add json file to .gitignore without fail.
From url (e.g. https://docs.google.com/spreadsheets/d/1E5gTTkuLTs4rhkZAB8vvGMx7MH008HjW7YOjIOvKYJ1/) extract part between /d/ and / (e.g. 1E5gTTkuLTs4rhkZAB8vvGMx7MH008HjW7YOjIOvKYJ1 in this case) which is your spreadsheet_id.
sheet_name is the name of the tab in google spreadsheet. By default it is "Sheet1" (unless you have modified it.
Google Sheets has a nice api you can use from python (see the docs here), which allows you to append single rows or entire batch updates to a Sheet.
Another way of doing it without that API would be to export the data to a csv file using the python csv library, and then you can easily import that csv file into a Google Sheet.
I am facing the problem, that I would like to push data (one large dataframe and one image) from my python web app (running on Tornado Webserver and Ubuntu) into a spreadsheet, calculate, save as pdf and the deliver to the frontend.
I took a look at several libs like openpyxl for writing Sheets in MS Excel, but that would solve just one part. I was thinking about using LibreOffice and pyoo, but it seems that I need the same python version on my backend as shipped with LibeOffice when importing pyuno.
Does somebody has solved a similar issue and have a recommendation how to solve this?
Thanks
I came up to a let's say not pretty, but rare solution that works very flexible for me.
use openpyxl to open an existing Excel workbook that includes layout (Template)
insert the dataframe into a separate sheet in that workbook
use openpyxl to save as temporary_file.xlsx
call LibeOffice with --headless --convert-to pdf temporary_file.xlsx
While executing the last call, all integrated formulas are recalculated/updated and the pdf is created (you have to configure calc so that auto calc is enabled when files are opened)
deliver pdf to frontend or process as you like
delete temporary_file.xlsx
import openpyxl
import pandas as pd
from subprocess import call
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
now = datetime.datetime.now().strftime("%Y%m%d_%H%M_%f")
wb_template_name = 'Template.xlsx'
wb_temp_name = now + wb_template_name
wb = openpyxl.load_workbook(wb_template_name)
ws = wb['dataframe_sheet']
pdf_convert_cmd = 'soffice --headless --convert-to pdf ' + wb_temp_name
for r in dataframe_to_rows(df, index=True, header=True):
ws.append(r)
wb.save(wb_temp_name)
call(pdf_convert_cmd, shell=True)
The reason why I'm doing this, is that I would like to be able to style the layout of the pdf independently from the data. I use named ranges or lookups that are referenced to the separate dataframe-sheet in excel.
I didn't try the image insertion yet, but this should work similar. I think there could be a way to increase the performance while simply dump the dataframe into the xlsx file (which is a zipped file of xmls), so that you don't need openpyxl.
I am trying to add data to an existing excel file, the problem I am facing is that the data is getting imported but the equation and the format is being deleted in original file.
I attached my code below
import xlwt
import xlrd
from xlutils.copy import copy
#open the excel file
rb=xlrd.open_workbook('Voltage_T.xlsx')
#make a writable copy of the opened excel file
wb=copy(rb)
#read the first sheet to write to within the writable copy
w_sheet=wb.get_sheet(0)
#write or modify the value at 2nd row first column
w_sheet.write(0,1,'WWW.GOOGLE.COM')
#the last step saving the work book
wb.save('Voltage_WW.xls')
You need to set formatting_info to true
rb=xlrd.open_workbook('Voltage_T.xlsx', formatting_info = True)
However xlrd doesn't support xlsx with formatting_info at the moment. So if you really have to use .xlsx you will need another library.
I didn't used it myself so I can't tell you if it's a good library but thanks to a quick search on google XlsxWriter seems to answer your needs.