How to convert OpenDocument spreadsheets to a pandas DataFrame? - python

The Python library pandas can read Excel spreadsheets and convert them to a pandas.DataFrame with pandas.read_excel(file) command. Under the hood, it uses xlrd library which does not support ods files.
Is there an equivalent of pandas.read_excel for ods files? If not, how can I do the same for an Open Document Formatted spreadsheet (ods file)? ODF is used by LibreOffice and OpenOffice.

This is available natively in pandas 0.25. So long as you have odfpy installed (conda install odfpy OR pip install odfpy) you can do
pd.read_excel("the_document.ods", engine="odf")

You can read ODF (Open Document Format .ods) documents in Python using the following modules:
odfpy / read-ods-with-odfpy
ezodf
pyexcel / pyexcel-ods
py-odftools
simpleodspy
Using ezodf, a simple ODS-to-DataFrame converter could look like this:
import pandas as pd
import ezodf
doc = ezodf.opendoc('some_odf_spreadsheet.ods')
print("Spreadsheet contains %d sheet(s)." % len(doc.sheets))
for sheet in doc.sheets:
print("-"*40)
print(" Sheet name : '%s'" % sheet.name)
print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )
# convert the first sheet to a pandas.DataFrame
sheet = doc.sheets[0]
df_dict = {}
for i, row in enumerate(sheet.rows()):
# row is a list of cells
# assume the header is on the first row
if i == 0:
# columns as lists in a dictionary
df_dict = {cell.value:[] for cell in row}
# create index for the column headers
col_index = {j:cell.value for j, cell in enumerate(row)}
continue
for j, cell in enumerate(row):
# use header instead of column index
df_dict[col_index[j]].append(cell.value)
# and convert to a DataFrame
df = pd.DataFrame(df_dict)
P.S.
ODF spreadsheet (*.ods files) support has been requested on the pandas issue tracker: https://github.com/pydata/pandas/issues/2311, but it is still not implemented.
ezodf was used in the unfinished PR9070 to implement ODF support in pandas. That PR is now closed (read the PR for a technical discussion), but it is still available as an experimental feature in this pandas fork.
there are also some brute force methods to read directly from the XML code (here)

Here is a quick and dirty hack which uses ezodf module:
import pandas as pd
import ezodf
def read_ods(filename, sheet_no=0, header=0):
tab = ezodf.opendoc(filename=filename).sheets[sheet_no]
return pd.DataFrame({col[header].value:[x.value for x in col[header+1:]]
for col in tab.columns()})
Test:
In [92]: df = read_ods(filename='fn.ods')
In [93]: df
Out[93]:
a b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
NOTES:
all other useful parameters like header, skiprows, index_col, parse_cols are NOT implemented in this function - feel free to update this question if you want to implement them
ezodf depends on lxml make sure you have it installed

pandas now supports .ods files. you must install the odfpy module first. then it will work like a normal .xls file.
conda install -c conda-forge odfpy
then
pd.read_excel('FILE_NAME.ods', engine='odf')

Edit: Happily, this answer below is now out of date, if you can update to a recent Pandas version.
If you'd still like to work from a Pandas version of your data, and update it from ODS only when needed, read on.
It seems the answer is No!
And I would characterize the tools to read in ODS still ragged.
If you're on POSIX, maybe the strategy of exporting to xlsx on the fly before using Pandas' very nice importing tools for xlsx is an option:
unoconv -f xlsx -o tmp.xlsx myODSfile.ods
Altogether, my code looks like:
import pandas as pd
import os
if fileOlderThan('tmp.xlsx','myODSfile.ods'):
os.system('unoconv -f xlsx -o tmp.xlsx myODSfile.ods ')
xl_file = pd.ExcelFile('tmp.xlsx')
dfs = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
df=dfs['Sheet1']
Here fileOlderThan() is a function (see http://github.com/cpbl/cpblUtilities) which returns true if tmp.xlsx does not exist or is older than the .ods file.

Another option: read-ods-with-odfpy. This module takes an OpenDocument Spreadsheet as input, and returns a list, out of which a DataFrame can be created.

If you only have a few .ods files to read, I would just open it in openoffice and save it as an excel file. If you have a lot of files, you could use the unoconv command in Linux to convert the .ods files to .xls programmatically (with bash)
Then it's really easy to read it in with pd.read_excel('filename.xls')

I've had good luck with pandas read_clipboard.
Selecting cells and then copy from excel or opendocument.
In python run the following.
import pandas as pd
data = pd.read_clipboard()
Pandas will do a good job based on the cells copied.

Some responses have pointed out that odfpy or other external packages are needed to get this functionality, but note that in recent versions of Pandas (current is 1.1, August-2020) there is support for ODS format in functions like pd.ExcelWriter() and pd.read_excel(). You only need to specify the propper engine "odf" to be able of working with OpenDocument file formats (.odf, .ods, .odt).

Based heavily on the answer by davidovitch (thank you), I have put together a package that reads in a .ods file and returns a DataFrame. It's not a full implementation in pandas itself, such as his PR, but it provides a simple read_ods function that does the job.
You can install it with pip install pandas_ods_reader. It's also possible to specify whether the file contains a header row or not, and to specify custom column names.

There is support for reading Excel files in Pandas (both xls and xlsx), see the read_excel command. You can use OpenOffice to save the spreadsheet as xlsx. The conversion can also be done automatically on the command line, apparently, using the convert-to command line parameter.
Reading the data from xlsx avoids some of the issues (date formats, number formats, unicode) that you may run into when you convert to CSV first.

If possible, save as CSV from the spreadsheet application and then use pandas.read_csv(). IIRC, an 'ods' spreadsheet file actually is an XML file which also contains quite some formatting information. So, if it's about tabular data, extract this raw data first to an intermediate file (CSV, in this case), which you can then parse with other programs, such as Python/pandas.

Related

Save DataFrame to CSV or Text to Alicloud OSS

I will like to know how to export pandas dataframe as csv/txt file to Alicloud OSS. From the documentation in https://www.alibabacloud.com/help/en/doc-detail/88426.html
the closest way I can find is to export it as csv/txt locally on my computer and copy the file to OSS. E.g.
import oss2
auth = oss2.Auth('yourAccessKeyId', 'yourAccessKeySecret')
bucket = oss2.Bucket(auth, 'yourEndpoint', 'examplebucket')
bucket.put_object_from_file('exampleobject.txt', 'D:\\localpath\\examplefile.txt')
Hence will like to know if there is a way to export the file directly to OSS, without the need to export to my computer first. Thank you!
Managed to solve this in the end. Exporting it as csv without the filename using Pandas library will convert the dataframe in python to text without exporting the dataframe.
bucket.put_object('example.txt', df.to_csv(index = False, encoding = 'utf-8-sig'))

How to use pybind11 to return a DataFrame?

I am writing a Python module using pybind11 and Modern C++.
How do I return a DataFrame from C++ to Python?
It is possible by returning an Apache Arrow table, which can be converted to a Python DataFrame with one line of Python.
For an example of an existing Python library that uses this:
See the Turbodbc docs.
See the Turbodbc github repo and the source code with methods to pass tables from C++ to Python using pybind11.
Other links
How to convert PyArrow table to Arrow table when interfacing between PyArrow in python and Arrow in C++
Sometimes, it's useful to do a quick'n'dirty transfer of a DataFrame from PyBind11/C++ to Python for logging purposes. We don't want speed, we want ease of use.
Construct a string that represents a .csv file in C++, return that, then convert that into a DataFrame on the Python side:
from io import StringIO
logCsv = 'A,B\n2.3,4.5\n' # This string could be generated in PyBind11/C++.
LOGDATA = StringIO(logCsv)
df = pd.read_csv(LOGDATA, sep=",")
df
Output:
A B
0 2.3 4.5
Once we have this data in a DataFrame, we can save it in any format including Excel and Parquet. Once the data is in Excel, it becomes easier to debug.
If the cells are tab separated, then the data can be pasted straight from the log into Excel, and it will correctly divide into multiple cells.
from io import StringIO
logCsv = 'A\tB\n'
logCsv += '2.3\t4.5\n' # This string could be generated in PyBind11/C++.
LOGDATA = StringIO(logCsv)
df = pd.read_csv(LOGDATA, sep="\t")
print(df)
# Can now paste output straight into Excel.

How to split an Excel workbook by worksheet while preserving grouping

I am doing some excel reports for work and am given a book exported from SSRS daily. The book is nicely set up, with groupings applied to every sheet for an effect similar to pivot tables.
However the book comes with 32 sheets, and I eventually need to send out each sheet individually as a distinct report. Right now I am splitting them up manually, but I am wondering if there is a way to automate this while preserving the grouping.
I previously tried something like:
import xlrd
import pandas as pd
targetWorkbook = xlrd.open_workbook(r'report.xlsx', on_demand=True)
xlsxDoc = pd.ExcelFile('report.xlsx')
for sheet in targetWorkbook.sheet_names():
reportDF = pd.read_excel(xlsxDoc, sheet)
reportDF.to_excel("report - {}.xlsx".format(sheet))
However since I'm converting each sheet to a pandas datagrams, the grouping is lost.
There are multiple ways to read/interact with excel docs in python, but I can't find a clear way to pick out a sheet and save it as its own document without losing the grouping.
This is my full answer. I have used the Worksheets().Move() method. The main idea is to use win32com.client library.
This was tested and works on my Windows 10 system with Excel 2013 installed, and Python 3.7. The grouping format was moved intact with the worksheets. I am still working on getting the looping to work. I will revise my answer again when I get the looping to work.
My example has 3 worksheets, each with different grouping (subtotal) formats.
#
# Refined .Move() method, save new file using Active Worksheet property.
#
import win32com.client as win32
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb0 = excel.Workbooks.Open(r'C:\python\so\original.xlsx')
excel.Visible = True
# Move sheet1.
wb0.Worksheets(1).Move()
excel.Application.ActiveWorkbook.SaveAs(r'C:\python\so\sheet1.xlsx')
# Move sheet2, which is now the front sheet.
wb0.Worksheets(1).Move()
excel.Application.ActiveWorkbook.SaveAs(r'C:\python\so\sheet2.xlsx')
# Save single remaining sheet as sheet3.
wb0.SaveAs(r'C:\python\so\sheet3.xlsx')
wb0.Close()
excel.Application.Quit()
You would also need to install pywin32, which is not a standard library item.
https://github.com/mhammond/pywin32
pip install pywin32

Python xlwt produces AttributeError when searching for empty cell in Excel spreadsheet file

I have an Excel file and I am using Python to fill its rows and columns.
I want to use the following function to find the first empty row in the table and fill it:
from xlwt import Workbook, easyxf
def next_available_row(sheet):
str_list = filter(None, sheet.col_values(1)) # error
return str(len(str_list)+1)
wb=Workbook()
sheet=wb.add_sheet('sheet1')
sheet.write(0,0,'item')
sheet.write(0,1,'cost')
sheet.write(next_available_row(sheet),0,'potato')
sheet.write(next_available_row(sheet),1,4)
but I get the following error:
AttributeError: 'sheet' object has no attribute 'col_values'
What should I do?
The library you are using, xlwt, is for writing .xls spreadsheets only, and does not have the method col_values (to read its contents), as the error message already states (correctly).
The function next_available_row() (from How to find the first empty row of a google spread sheet using python GSPREAD?) that you want to use to search for an empty cell is based on a different library, gspread, and that is apparently not for Excel files (e.g. .xls, note there are several versions of this file type).
So you probably are looking for an entirely different library, one that reads and writes Excel files.
http://www.python-excel.org/ lists several libraries (including your xlrd):
https://pypi.python.org/pypi/xlrd
https://pypi.python.org/pypi/xlwt
https://pypi.python.org/pypi/XlsxWriter
https://pypi.python.org/pypi/openpyxl
Or maybe try to manage something by reading the file first, e.g. with xlwt's sister project, xlrd.
Seems that has no col_values method on xlwt API. http://xlwt.readthedocs.io/en/latest/api.html
Maybe using together the xlrd you can reach your goal.
http://xlrd.readthedocs.io/en/latest/api.html?highlight=col_values#xlrd-sheet

Reading .xlsx format in python

I've got to read .xlsx file every 10min in python.
What is the most efficient way to do this?
I've tried using xlrd, but it doesn't read .xlsx - according to documentation he does, but I can't do this - getting Unsupported format, or corrupt file exceptions.
What is the best way to read xlsx?
I need to read comments in cells too.
xlrd hasn't released the version yet to read xlsx. Until then, Eric Gazoni built a package called openpyxl - reads xlsx files, and does limited writing of them.
Use Openpyxl some basic examples:
import openpyxl
# Open Workbook
wb = openpyxl.load_workbook(filename='example.xlsx', data_only=True)
# Get All Sheets
a_sheet_names = wb.get_sheet_names()
print(a_sheet_names)
# Get Sheet Object by names
o_sheet = wb.get_sheet_by_name("Sheet1")
print(o_sheet)
# Get Cell Values
o_cell = o_sheet['A1']
print(o_cell.value)
o_cell = o_sheet.cell(row=2, column=1)
print(o_cell.value)
o_cell = o_sheet['H1']
print(o_cell.value)
# Sheet Maximum filled Rows and columns
print(o_sheet.max_row)
print(o_sheet.max_column)
There are multiple ways to read XLSX formatted files using python. Two are illustrated below and require that you install openpyxl at least and if you want to parse into pandas directly you want to install pandas, eg. pip install pandas openpyxl
Option 1: pandas direct
Primary use case: load just the data for further processing.
Using read_excel() function in pandas would be your best choice. Note that pandas should fall back to openpyxl automatically but in the event of format issues its best to specify the engine directly.
df_pd = pd.read_excel("path/file_name.xlsx", engine="openpyxl")
Option 2 - openpyxl direct
Primary use case: getting or editing specific Excel document elements such as comments (requested by OP), formatting properties or formulas.
Using load_workbook() followed by comment extraction using the comment attribute for each cell would be achieved by the following.
from openpyxl import load_workbook
wb = load_workbook(filename = "path/file_name.xlsx")
ws = wb.active
ws["A1"].comment # <- loop through row & columns to extract all comments

Categories