python - exporting multi-index pandas dataframe to excel - python

I'm trying the following example from this (closed) GitHub issue: https://github.com/pandas-dev/pandas/issues/2701
import pandas as pd
m = pd.MultiIndex.from_tuples([(1,1),(1,2)], names=['a','b'])
df = pd.DataFrame([[1,2],[3,4]], columns=m)
df.to_excel('test.xls')
When I open test.xls, there is a blank line on row 3:
The example image from GitHub doesn't have this blank line:
Is this a bug? And are there workaround available for writing multiindex dataframes to Excel? I'd rather not go the CSV route, as pandas will do the merge-and-center for me.
Using pandas version 0.19.2 on Ubuntu 14.04 and Windows 10.

I am able to reproduce whatever you have done. This is most likely a bug.
No easy way out of this but to delete that row by reading the xlsx in again. Please add this to the closed github chain and reopen it.

Related

GUI for editing and saving a python pandas dataframe

In a python function I want to show the user a pandas dataframe and let the user edit the cells in the dataframe. My function should use the edited values in that dataframe (i.e. they should be saved).
I've tried pandasgui, but it does not seem to return the edits to the function.
Is there a function/library I can use for this?
Recently solved this problem with dtale
import pandas as pd
import dtale
df = pd.read_csv('table_data.csv')
dt = dtale.show(df) # create dtale with our df
dt.open_browser() # bring it to a new tab (optional)
df = dt.data # connect all the updates from dtale gui and our df
# (so rn if u edit any cell, you will immediately get the result saved in ur df)
Yesterday I came across with some bugs while using dtale. Filtering broke my changes and creating some new rows I dont need.
Usually I use dtale and pandasgui together.
Hope it helps!

deleting some rows from .csv file cause adding NaN columns to it

python version: 3.7.11
pandas version: 1.1.3
IDE: Jupyter Notebook
Software for opening and resaving the .csv file: Microsoft Excel
I have a .csv file. You can download it from here: https://icedrive.net/0/35CvwH7gqr
In .csv file, I looked for rows that have blank cells and after finding that rows I deleted them. To do this I follow bellow instruction:
I Opened .csv file with Microsoft Excel.
I pressed F5, then in the "Reference" field I wrote "A1:E9030", then I clicked on ok.
I pressed F5 again, then clicked on "Special..." button, select "Blanks", then clicked on ok
In the "Home" tab from "Cells", I clicked "Delete", then "Delete Sheet Rows"
saved the file and closed it.
This is the file after deleting some rows: https://icedrive.net/0/cfG1dT6bBr
but when I run bellow code, it seems that extra columns are added after deleting some rows.
import pandas as pd
# The file doesn't have any header.
my_file = pd.read_csv(path_to_my_file, header=None)
my_file.head()
print(my_file.shape)
The output:
(9024, 244)
You can also see the difference by opening the file with notepad:
.csv file before deleting some rows:
.csv file after deleting some rows:
before deleting the rows the my_file.shape shows me 5 columns but after deleting some rows it shows me 244 for number of columns.
Question:
How to remove rows in excel or with other ways so I won't end up with this problem?
Note: I can't remove these rows with pandas because pandas automatically doesn't take into account these rows so I should do this manually.
Thanks in advance for any help.
I am not familiar with the operation you are carrying out in the first part of your question, but I suggest a different solution. Pandas will recognize only np.nan objects as null. So, in this case, we could start by loading the .csv file into Pandas first and replace the empty cells with np.nan values:
>>> import pandas as pd
>>> import numpy as np
>>> my_file = pd.read_csv(path_to_my_file, header=None)
>>> my_file = my_file.replace('', np.nan, inplace=True)
Then, we could ask pandas to drop all the rows containing np.nan:
>>> my_file = my_file.dropna(inplace=True)
This should give you the desired output. I think is a good habit to work on data frames from your IDE directly. Hope this helped!

read_excel from Pandas not reading all data (missing columns from first row)

I have an extremely simple .xlsx file and pandas is not reading the first row completely. It's very strange since it only reads one of the columns, and the others are blank. After A LOT of trial and error, it seems there's something hidden in the Excel file itself, since if I remove completely the row, and I just type it all again, then it works.
However, there's nothing visual that I can see. If I export the file to .csv then pandas works as well.
I'm using python 3.7 with pandas 1.1.5. I tried upgrading pandas but I can't, pip tells me I'm using the latest available version, even though I see that pandas 1.3 is available. Not sure if this is already fixed in a new version, and if it is, how do I get it installed (I'm using the app both on Mac and on Windows via Anaconda).
The xlsx file showing the problem is here:
https://docs.google.com/spreadsheets/d/1Xze2DNCyIARG7vdGFh0aUGHnhfgkciV5/edit?usp=sharing&ouid=117900420544251849196&rtpof=true&sd=true
It just contains the header and a row. That's it.
The script to read it is this:
import pandas as pd
print(f"pandas version is {pd.__version__}")
df = pd.read_excel('Book1.xlsx', dtype=str)
df = df.fillna('')
print(f"columns are {df.columns.tolist()}")
print(df)
And the output is this:
anibal#~/PycharmProjects/CIUSSS$ python3 test.py
pandas version is 1.1.5
columns are ['Source']
Source
SNOMED CT 115161005 Genus Abiotrophia (organism) Abiotrophia Genus Abiotrophia
Where it should be:
anibal#~/PycharmProjects/CIUSSS$ python3 test.py
pandas version is 1.1.5
columns are ['Source', 'f2', 'f3', 'f4', 'f5']
Source f2 f3 f4 f5
0 SNOMED CT 115161005 Genus Abiotrophia (organism) Abiotrophia Genus Abiotrophia
Can somebody please tell me if there's something different that I should be doing in the API to be able to read this? Or if I need to have a newer version of pandas, how do I get a newer version with pip (and then in anaconda)?
Update: the issue was indeed the version. I tried the exact same file with python 3.9.9 and pandas 1.3.4 and everything looks good.

how to read a data file including "pandas.core.frame, numpy.core.multiarray"

I met a DF file which is encoded in binary format. But when I open it using Vim, still I can see characters like "pandas.core.frame", "numpy.core.multiarray". So I guess it is related with Python. However I know little about the Python language. Though I have tried using pandas and numpy modules, I failed to read the file. Could you guys give any suggestion on this issue? Thank you in advance. Here is the Dropbox link to the DF file: https://www.dropbox.com/s/b22lez3xysvzj7q/flux.df
Looks like DataFrame stored with pickle, use read_pickle() to read it:
import pandas as pd
df = pd.read_pickle('flux.df')

How to convert OpenDocument spreadsheets to a pandas DataFrame?

The Python library pandas can read Excel spreadsheets and convert them to a pandas.DataFrame with pandas.read_excel(file) command. Under the hood, it uses xlrd library which does not support ods files.
Is there an equivalent of pandas.read_excel for ods files? If not, how can I do the same for an Open Document Formatted spreadsheet (ods file)? ODF is used by LibreOffice and OpenOffice.
This is available natively in pandas 0.25. So long as you have odfpy installed (conda install odfpy OR pip install odfpy) you can do
pd.read_excel("the_document.ods", engine="odf")
You can read ODF (Open Document Format .ods) documents in Python using the following modules:
odfpy / read-ods-with-odfpy
ezodf
pyexcel / pyexcel-ods
py-odftools
simpleodspy
Using ezodf, a simple ODS-to-DataFrame converter could look like this:
import pandas as pd
import ezodf
doc = ezodf.opendoc('some_odf_spreadsheet.ods')
print("Spreadsheet contains %d sheet(s)." % len(doc.sheets))
for sheet in doc.sheets:
print("-"*40)
print(" Sheet name : '%s'" % sheet.name)
print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )
# convert the first sheet to a pandas.DataFrame
sheet = doc.sheets[0]
df_dict = {}
for i, row in enumerate(sheet.rows()):
# row is a list of cells
# assume the header is on the first row
if i == 0:
# columns as lists in a dictionary
df_dict = {cell.value:[] for cell in row}
# create index for the column headers
col_index = {j:cell.value for j, cell in enumerate(row)}
continue
for j, cell in enumerate(row):
# use header instead of column index
df_dict[col_index[j]].append(cell.value)
# and convert to a DataFrame
df = pd.DataFrame(df_dict)
P.S.
ODF spreadsheet (*.ods files) support has been requested on the pandas issue tracker: https://github.com/pydata/pandas/issues/2311, but it is still not implemented.
ezodf was used in the unfinished PR9070 to implement ODF support in pandas. That PR is now closed (read the PR for a technical discussion), but it is still available as an experimental feature in this pandas fork.
there are also some brute force methods to read directly from the XML code (here)
Here is a quick and dirty hack which uses ezodf module:
import pandas as pd
import ezodf
def read_ods(filename, sheet_no=0, header=0):
tab = ezodf.opendoc(filename=filename).sheets[sheet_no]
return pd.DataFrame({col[header].value:[x.value for x in col[header+1:]]
for col in tab.columns()})
Test:
In [92]: df = read_ods(filename='fn.ods')
In [93]: df
Out[93]:
a b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
NOTES:
all other useful parameters like header, skiprows, index_col, parse_cols are NOT implemented in this function - feel free to update this question if you want to implement them
ezodf depends on lxml make sure you have it installed
pandas now supports .ods files. you must install the odfpy module first. then it will work like a normal .xls file.
conda install -c conda-forge odfpy
then
pd.read_excel('FILE_NAME.ods', engine='odf')
Edit: Happily, this answer below is now out of date, if you can update to a recent Pandas version.
If you'd still like to work from a Pandas version of your data, and update it from ODS only when needed, read on.
It seems the answer is No!
And I would characterize the tools to read in ODS still ragged.
If you're on POSIX, maybe the strategy of exporting to xlsx on the fly before using Pandas' very nice importing tools for xlsx is an option:
unoconv -f xlsx -o tmp.xlsx myODSfile.ods
Altogether, my code looks like:
import pandas as pd
import os
if fileOlderThan('tmp.xlsx','myODSfile.ods'):
os.system('unoconv -f xlsx -o tmp.xlsx myODSfile.ods ')
xl_file = pd.ExcelFile('tmp.xlsx')
dfs = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
df=dfs['Sheet1']
Here fileOlderThan() is a function (see http://github.com/cpbl/cpblUtilities) which returns true if tmp.xlsx does not exist or is older than the .ods file.
Another option: read-ods-with-odfpy. This module takes an OpenDocument Spreadsheet as input, and returns a list, out of which a DataFrame can be created.
If you only have a few .ods files to read, I would just open it in openoffice and save it as an excel file. If you have a lot of files, you could use the unoconv command in Linux to convert the .ods files to .xls programmatically (with bash)
Then it's really easy to read it in with pd.read_excel('filename.xls')
I've had good luck with pandas read_clipboard.
Selecting cells and then copy from excel or opendocument.
In python run the following.
import pandas as pd
data = pd.read_clipboard()
Pandas will do a good job based on the cells copied.
Some responses have pointed out that odfpy or other external packages are needed to get this functionality, but note that in recent versions of Pandas (current is 1.1, August-2020) there is support for ODS format in functions like pd.ExcelWriter() and pd.read_excel(). You only need to specify the propper engine "odf" to be able of working with OpenDocument file formats (.odf, .ods, .odt).
Based heavily on the answer by davidovitch (thank you), I have put together a package that reads in a .ods file and returns a DataFrame. It's not a full implementation in pandas itself, such as his PR, but it provides a simple read_ods function that does the job.
You can install it with pip install pandas_ods_reader. It's also possible to specify whether the file contains a header row or not, and to specify custom column names.
There is support for reading Excel files in Pandas (both xls and xlsx), see the read_excel command. You can use OpenOffice to save the spreadsheet as xlsx. The conversion can also be done automatically on the command line, apparently, using the convert-to command line parameter.
Reading the data from xlsx avoids some of the issues (date formats, number formats, unicode) that you may run into when you convert to CSV first.
If possible, save as CSV from the spreadsheet application and then use pandas.read_csv(). IIRC, an 'ods' spreadsheet file actually is an XML file which also contains quite some formatting information. So, if it's about tabular data, extract this raw data first to an intermediate file (CSV, in this case), which you can then parse with other programs, such as Python/pandas.

Categories