Reading .xlsx format in python

Reading .xlsx format in python - python

I've got to read .xlsx file every 10min in python.
What is the most efficient way to do this?
I've tried using xlrd, but it doesn't read .xlsx - according to documentation he does, but I can't do this - getting Unsupported format, or corrupt file exceptions.
What is the best way to read xlsx?
I need to read comments in cells too.

xlrd hasn't released the version yet to read xlsx. Until then, Eric Gazoni built a package called openpyxl - reads xlsx files, and does limited writing of them.

Use Openpyxl some basic examples:
import openpyxl
# Open Workbook
wb = openpyxl.load_workbook(filename='example.xlsx', data_only=True)
# Get All Sheets
a_sheet_names = wb.get_sheet_names()
print(a_sheet_names)
# Get Sheet Object by names
o_sheet = wb.get_sheet_by_name("Sheet1")
print(o_sheet)
# Get Cell Values
o_cell = o_sheet['A1']
print(o_cell.value)
o_cell = o_sheet.cell(row=2, column=1)
print(o_cell.value)
o_cell = o_sheet['H1']
print(o_cell.value)
# Sheet Maximum filled Rows and columns
print(o_sheet.max_row)
print(o_sheet.max_column)

There are multiple ways to read XLSX formatted files using python. Two are illustrated below and require that you install openpyxl at least and if you want to parse into pandas directly you want to install pandas, eg. pip install pandas openpyxl
Option 1: pandas direct
Primary use case: load just the data for further processing.
Using read_excel() function in pandas would be your best choice. Note that pandas should fall back to openpyxl automatically but in the event of format issues its best to specify the engine directly.
df_pd = pd.read_excel("path/file_name.xlsx", engine="openpyxl")
Option 2 - openpyxl direct
Primary use case: getting or editing specific Excel document elements such as comments (requested by OP), formatting properties or formulas.
Using load_workbook() followed by comment extraction using the comment attribute for each cell would be achieved by the following.
from openpyxl import load_workbook
wb = load_workbook(filename = "path/file_name.xlsx")
ws = wb.active
ws["A1"].comment # <- loop through row & columns to extract all comments

Related

How to split an Excel workbook by worksheet while preserving grouping

I am doing some excel reports for work and am given a book exported from SSRS daily. The book is nicely set up, with groupings applied to every sheet for an effect similar to pivot tables.
However the book comes with 32 sheets, and I eventually need to send out each sheet individually as a distinct report. Right now I am splitting them up manually, but I am wondering if there is a way to automate this while preserving the grouping.
I previously tried something like:
import xlrd
import pandas as pd
targetWorkbook = xlrd.open_workbook(r'report.xlsx', on_demand=True)
xlsxDoc = pd.ExcelFile('report.xlsx')
for sheet in targetWorkbook.sheet_names():
reportDF = pd.read_excel(xlsxDoc, sheet)
reportDF.to_excel("report - {}.xlsx".format(sheet))
However since I'm converting each sheet to a pandas datagrams, the grouping is lost.
There are multiple ways to read/interact with excel docs in python, but I can't find a clear way to pick out a sheet and save it as its own document without losing the grouping.

This is my full answer. I have used the Worksheets().Move() method. The main idea is to use win32com.client library.
This was tested and works on my Windows 10 system with Excel 2013 installed, and Python 3.7. The grouping format was moved intact with the worksheets. I am still working on getting the looping to work. I will revise my answer again when I get the looping to work.
My example has 3 worksheets, each with different grouping (subtotal) formats.
#
# Refined .Move() method, save new file using Active Worksheet property.
#
import win32com.client as win32
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb0 = excel.Workbooks.Open(r'C:\python\so\original.xlsx')
excel.Visible = True
# Move sheet1.
wb0.Worksheets(1).Move()
excel.Application.ActiveWorkbook.SaveAs(r'C:\python\so\sheet1.xlsx')
# Move sheet2, which is now the front sheet.
wb0.Worksheets(1).Move()
excel.Application.ActiveWorkbook.SaveAs(r'C:\python\so\sheet2.xlsx')
# Save single remaining sheet as sheet3.
wb0.SaveAs(r'C:\python\so\sheet3.xlsx')
wb0.Close()
excel.Application.Quit()
You would also need to install pywin32, which is not a standard library item.
https://github.com/mhammond/pywin32
pip install pywin32

Python transfer excel formatting between two Excel documents

I'd like to copy the formatting between two Excel sheets in python.
Here is the situation:
I have a script that effectively "alters" (ie overwrites) an excel file by opening it using pd.ExcelWriter, then updates values in the rows. Finally, file is overwritten using ExcelWriter.
The Excel file is printed/shared/read by humans between updates done by the code. Humans will do things like change number formatting, turn on/off word wrap, and alter column widths.
My goal is the code updates should only alter the content of the file, not the formatting of the columns.
Is there a way I can read/store/write the sheet format within python so the output file has the same column formatting as the input file?
Here's the basic idea of what I am doing right now:
df_in= pd.read_excel("myfile.xlsx")
# Here is where I'd like to read in format of the first sheet of this file
xlwriter = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
df_out = do_update(df_in)
df_out.to_excel(xlwriter,'sheet1')
# Here is where I'd like to apply the format I read earlier to the sheet
xlwriter.save()
Note: I have played with xlsxwriter.set_column and add_format. As far as I can tell, these don't help me read the format from the current file

Pandas uses xlrd package for parsing Excel documents to DataFrames.
Interoperability between other xlsx packages and xlrd could be problematic when it comes to the data structure used to represent formatting information.
I suggest using openpyxl as your engine when instantiating pandas.ExcelWriter. It comes with reader and writer classes that are interoperable.
import pandas as pd
from openpyxl.styles.stylesheet import apply_stylesheet
from openpyxl.reader.excel import ExcelReader
xlreader = ExcelReader('myfile.xlsx', read_only=True)
xlwriter = pd.ExcelWriter('myfile.xlsx', engine='openpyxl')
df_in = pd.read_excel("myfile.xlsx")
df_out = do_update(df_in)
df_out.to_excel(xlwriter,'sheet1')
apply_stylesheet(xlreader.archive, xlwriter.book)
xlwriter.save()

Excel formatting in python without loading workbook

I am trying to format an excel document within python that I am creating in the same script. All of the answers I have found have involved loading an existing workbook into python and formatting from there. In my script, I am currently writing the entire unformatted excel sheet, saving the file, then immediately reloading the document in to python to format. This is the only workaround I can find so that I can have an active sheet.
writer=pd.ExcelWriter(file_name, engine='openpyxl')
writer.save()#saving my file
wb=load_workbook(file_name) #reloading file to format
ws=wb.active
ws.column_dimensions['A'].width=33
ws.column_dimensions['B'].width=16
wb.save(file_name)
This works to change aspects such as column width, but I would like a way to format the page without saving and reloading. Is there a way to get around the need for an active sheet when there is no file_name written yet? I want a way to remove line 2 and 3, however that may be.

The object that Pandas is creating in ExcelWriter depends on the "engine" you give it. In this case, you're passing along "openpyxl", so ExcelWriter is making an openpyxl.Workbook() object. You can create a new Workbook in openpyxl using "Workbook()" Like so:
https://openpyxl.readthedocs.io/en/default/tutorial.html#create-a-workbook
It is created with 1 active sheet. Basically:
import openpyxl
wb = openpyxl.Workbook()
ws=wb.active
ws.column_dimensions['A'].width=33
ws.column_dimensions['B'].width=16
wb.save(file_name)
...would do the job

Your title is misleading: you're working in Pandas and dumping to Excel. Pandas does allow some formatting for this but, because it tries to support different Python libraries (openpyxl, xlsxwriter and xlwt) there are restrictions on this.
For full control openpyxl provides support for Pandas' DataFrame objects: http://openpyxl.readthedocs.io/en/latest/pandas.html

pandas read excel values not formulas

Is there a way to have pandas read in only the values from excel and not the formulas? It reads the formulas in as NaN unless I go in and manually save the excel file before running the code. I am just working with the basic read excel function of pandas,
import pandas as pd
df = pd.read_excel(filename, sheetname="Sheet1")
This will read the values if I have gone in and saved the file prior to running the code. But after running the code to update a new sheet, if I don't go in and save the file after doing that and try to run this again, it will read the formulas as NaN instead of just the values. Is there a work around that anyone knows of that will just read values from excel with pandas?

That is strange. The normal behaviour of pandas is read values, not formulas. Likely, the problem is in your excel files. Probably your formulas point to other files, or they return a value that pandas sees as nan.
In the first case, the sheet needs to be updated and there is nothing pandas can do about that (but read on).
In the second case, you could solve by setting explicit nan values in read_excel:
pd.read_excel(path, sheetname="Sheet1", na_values = [your na identifiers])
As for the first case, and as a workaround solution to make your work easier, you can automate what you are doing by hand using xlwings:
import pandas as pd
import xlwings as xl
def df_from_excel(path):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path)
df = df_from_excel(path to your file)
If you want to keep those formulas in your excel file just save the file in a different location (book.save(different location)). Then you can get rid of the temporary files with shutil.

I had this problem and I resolve it by moving a graph below the first row I was reading. Looks like the position of the graphs may cause problems.

you can use xlrd to read the values.
first you should refresh your excel sheet you are also updating the values automatically with python. you can use the function below
file = myxl.xls
import xlrd
import win32com.client
import os
def refresh_file(file):
xlapp = win32com.client.DispatchEx("Excel.Application")
path = os.path.abspath(file)
wb = xlapp.Wordbooks.Open(path)
wb.RefreshAll()
xlapp.CalculateUntilAsyncqueriesDone()
wb.save()
xlapp.Quit()
after the file refresh, you can start reading the content.
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_index(0)
for rowid in range(worksheet.nrows):
row = worksheet.row(rowid)
for colid, cell in enumerate(row):
print(cell.value)
you can loop through however you need the data. and put conditions while you are reading the data. lot more flexibility

How do I write data to excel file without changing the format cells in orginal file using python? [duplicate]

This question already has answers here:
Preserving styles using python's xlrd,xlwt, and xlutils.copy
(2 answers)
Closed 7 years ago.
I am using xlwt and xlrd with Python to get some data from something and writing it into an xls file, but I need write data to excel file, just the data, without changing the format cells in (original file)
Here's the code for that : ***
from xlrd import open_workbook
from xlwt import easyxf
from xlutils.copy import copy
data = [['2008','2009'],['Jan','Feb','Feb'],[100,200,300]]
rb = open_workbook('this1.xls')
rs = rb.sheet_by_index(0)
wb = copy(rb)
ws = wb.get_sheet(0)
# I want write data to excel file, just the data, without changing the format cells in (original file)
for i,row in enumerate(data):
for j,cell in enumerate(row):
ws.write(j,i,cell)

You could try the openpyxl module which allows gives you the load_workbook option for using existing excel documents and keeping the formatting. I was using xlrd and xlwt but when I found that xlutils was not supported in python 3.5, and there was no other ways to add to a spreadsheet with those modules, I switched to openpyxl. Keep in mind though if you do switch over, openpyxl only supports .xlsx documents, so you would have to switch your document to be an .xlsx file. If you would like to read more about openpyxl take a look at this website: https://automatetheboringstuff.com/chapter12/ You will find a lot of good information on how to use openpyxl. Hopefully this helps!

Writing to a cell removes its style information, but you can preserve it and reassign it after the write. I've done this in the past with the undocumented internal rows and _Row__cells data structures:
def write_with_style(ws, row, col, value):
if ws.rows[row]._Row__cells[col]:
old_xf_idx = ws.rows[row]._Row__cells[col].xf_idx
ws.write(row, col, value)
ws.rows[row]._Row__cells[col].xf_idx = old_xf_idx
else:
ws.write(row, col, value)
Obviously this is vulnerable to version incompatibilities, since it uses internal data structures.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading .xlsx format in python - python

xlrd hasn't released the version yet to read xlsx. Until then, Eric Gazoni built a package called openpyxl - reads xlsx files, and does limited writing of them.

Related

How to split an Excel workbook by worksheet while preserving grouping

Python transfer excel formatting between two Excel documents

Excel formatting in python without loading workbook

pandas read excel values not formulas

How do I write data to excel file without changing the format cells in orginal file using python? [duplicate]

Categories

Resources