Pandas and xlrd error while reading excel files

Pandas and xlrd error while reading excel files - python

I've been working on a Python script that deals with creating Pandas data frames from Excel files. For the past few days, the Pandas method worked perfectly with the usual pd.read_excel() method.
Today I've been trying to run the same code, but am running into errors. I've tried using the following code on a small test document (just two columns, 5 rows with simple integers):
import pandas as pd
pd.read_excel("tstr.xlsx")
I'm getting this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 867, in __init__
self._reader = self._engines[engine](self._io)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_xlrd.py", line 22, in __init__
super().__init__(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 353, in __init__
self.book = self.load_workbook(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_xlrd.py", line 37, in load_workbook
return open_workbook(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\__init__.py", line 130, in open_workbook
bk = xlsx.open_workbook_2007_xml(
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\xlsx.py", line 812, in open_workbook_2007_xml
x12book.process_stream(zflo, 'Workbook')
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\xlsx.py", line 266, in process_stream
for elem in self.tree.iter() if Element_has_iter else self.tree.getiterator():
AttributeError: 'ElementTree' object has no attribute 'getiterator'
I get the exact same issue when trying to load excel files with xlrd directly. I've tried with several different excel files, and all of my pip installations are up-to-date.
I haven't made any changes to my system since pd.read_excel was last working perfectly (I did reboot my system, but it didn't involve any updates). I'm using a Windows 10 machine, if that's relevant.
Has anyone else had this issue? Any advice on how to proceed?

There can be many different reasons that cause this error, but you should try add engine='xlrd' or other possible values (mostly "openpyxl"). It may solve your issue, as it depends more on the excel file rather then your code.
Also, try to add full path to the file instead of relative one.

openpyxl.utils.exceptions.InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
So for me the argument:
engine="xlrd" worked on .xls
engine="openpyxl" worked on .xlsx

This works for me
#Back to linux prompt and install openpyxl
pip install openpyxl
#Add engine='openpyxl' in the python argument
data = pd.read_excel(path, sheet_name='Sheet1', parse_dates=True, engine='openpyxl')

Related

Unable to read_excel using pandas on CentOS Stream 9 VM: zipfile.BadZipFile: Bad magic number for file header

I've been running a script for several months now where I read and concat several excel exports using the following code:
files = os.listdir(os.path.abspath('exports/'))
for file in files:
if file.startswith('ap_statistics_') and file.endswith('.xlsx'):
excel_list.append(pd.read_excel('exports/' + file, sheet_name='Access Points'))
df = pd.concat(excel_list, axis=0, ignore_index=True)
This has worked just fine until this Saturday when I uploaded new exports to the CentOS Stream 9 VM where I have a cronjob running the script every hour.
Now I always get this error:
Traceback (most recent call last):
File "/root/projects/beacon_check_v8/main.py", line 310, in <module>
ap_check()
File "/root/projects/beacon_check_v8/main.py", line 260, in ap_check
siteaps_result = getaps()
File "/root/projects/beacon_check_v8/main.py", line 30, in getaps
excel_list.append(pd.read_excel('exports/' + file, sheet_name='Access Points'))
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 457, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 1419, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 525, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 518, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 536, in load_workbook
return load_workbook(
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 317, in load_workbook
reader.read()
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 277, in read
self.read_strings()
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 143, in read_strings
with self.archive.open(strings_path,) as src:
File "/usr/lib64/python3.9/zipfile.py", line 1523, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
I develop on my Windows 10 notebook using PyCharm with a Python 3.9 venv, same as on the VM, where the script continued to work just fine.
When researching online all I found was that sometimes .pyc files can cause issues so I created a completely new venv on the VM, installed all libraries (netmiko, pandas, openpyxl, etc.) and tried running the script again before and after deleting all .pyc files in the directory but no luck.
I have extracted the Excel file header using the following code:
with open('exports/' + file, 'rb') as myexcel:
print(myexcel.read(4))
Unfortunately it comes back as the same values on both my Windows venv as well as the CentOS venv:
b'PK\x03\x04'
I don't know if this header value is correct or not but I can read the files on my Windows notebook just fine using pandas or excel.
Any help would be greatly appreciated.

The issue was actually the program I used to transfer the files between my notebook and the VM, WinSCP. I don't know why or how this caused the error but I was able to fix it by transferring directly over pscp.

Pandas read_excel openpyxl generates a ValueError

Recently my code broke, like a week ago. Looks like something is going wrong with the openpyxl dependency. Hoping someone else has this issue and can tell me it's not just me being a bad programmer lol
Edit1:
The excel file I'm reading is generated as a .xlsx from Seeking Alpha's Portfolio Excel Export feature.
Edit2:
The Excel Export file now contains an added row with 2 empty conditionally formatted cells to a sheet that I don't even pass through to the sheet_name arg. The problem seems to be that openpyxl can't parse empty cells that have conditional formatting. How can I make read_excel only parse the sheets I listed? Or maybe drop the row that's causing problems?
After adding a "-" or removing conditional formatting, my script works. But I'd like to not have to do this every time I export the excel file. Also, the following warning appears now.
/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/bin/python3.9 /Users/marcoucolon/Documents/GitHub/opt-portfolio/analysis.py
/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py:308:
UserWarning: Conditional Formatting extension is not supported and will be removed
warn(msg)
My line of code that causes error
dic = pd.read_excel(path, sheet_name=sheet_names)
Error message
ValueError: Value must be one of {'equal', 'greaterThanOrEqual', 'containsText', 'beginsWith', 'notEqual', 'greaterThan', 'between', 'endsWith', 'notContains', 'lessThan', 'lessThanOrEqual', 'notBetween'}
Complete Log
/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/bin/python3.9 /Users/marcoucolon/Documents/GitHub/opt-portfolio/analysis.py
Traceback (most recent call last):
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/analysis.py", line 862, in <module>
df = excel_data(path_excel, sheet_names)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/analysis.py", line 129, in excel_data
dic = pd.read_excel(path, sheet_name=sheet_names)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 344, in read_excel
data = io.parse(
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1170, in parse
return self._reader.parse(
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 492, in parse
data = self.get_sheet_data(sheet, convert_float)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 548, in get_sheet_data
for row_number, row in enumerate(sheet.rows):
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/worksheet/_read_only.py", line 79, in _cells_by_row
for idx, row in parser.parse():
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py", line 145, in parse
dispatcher[tag_name](element)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py", line 288, in parse_formatting
cf = ConditionalFormatting.from_tree(element)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 87, in from_tree
obj = desc.expected_type.from_tree(el)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 103, in from_tree
return cls(**attrib)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/formatting/rule.py", line 201, in __init__
self.operator = operator
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/descriptors/base.py", line 143, in __set__
super(NoneSet, self).__set__(instance, value)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/descriptors/base.py", line 128, in __set__
raise ValueError(self.__doc__)
ValueError: Value must be one of {'equal', 'greaterThanOrEqual', 'containsText', 'beginsWith', 'notEqual', 'greaterThan', 'between', 'endsWith', 'notContains', 'lessThan', 'lessThanOrEqual', 'notBetween'}

I have same problem with Seeking Alpha (SA) xlsx. pandas 1.1.5 does read ok but current pandas 1.3.2 has this problem.
A possible hack workaround is to install older pandas in a different location and import that pandas version 1.1.5 as pd1 just to import the xlsx .
So pd is pandas 1.3.2
and pd1 is pandas 1.1.5
The following has some hints Supporting multiple Python module versions (with the same version of Python)
And use the target option in pip import as explained in Supporting multiple Python module versions (with the same version of Python)
Edited to add sample code
Pandas is a large subsystem and multiple versions are a bit demanding to install. A simpler solution was to create a different environment ( I used conda) and execute a command ( .bat in windows .sh in unix)
sample SA_oldpandas.bat
echo old pandas 1.1.5
conda activate env-oldpandas
cd {code directory}
echo ---GOING----
python SA_OldPandas.py
echo ---FINISHED---- %date% %time%
start excel SA_check.xlsx
basically the old Pandas code writes the data to a excel and the new code uses this
os.getcwd()
os.system('SA_oldpandas.bat')

Pandas cant open csv file :FileNotFoundError: [Errno 2] File xyz.csv does not exist:

import pandas as pd
df=pd.read_csv('Catalogue.csv')
print(df)
I downloaded my earthquake csv file. And pandas dont see the file. I use VS Code and Python 3.8.3 I added csv file in the same py file where I write my code.
Even if I used the same code (csv was in the same folder where my code file was) in Jupyter Notebook folder the result were the same.
I guess if it is excel pip instal xlrd is written. I did pip install python-csv but couldnt achieve installing. Is it needed though? Or do I need to fixe the csv file (commas or spaces)?
total result:
Traceback (most recent call last):
File "c:/Users/Fatma Elik/Documents/VS Code/BTK/CSVCSV.py", line 2, in <module>
df=pd.read_csv('Catalogue.csv')
File "C:\Users\Fatma Elik\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\Fatma Elik\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "C:\Users\Fatma Elik\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "C:\Users\Fatma Elik\AppData\Local\Programs\Python\Python38-32\lib\der(src, **kwds)
File "pandas\_libs\parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
File "pandas\_libs\parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File Catalogue.csv does not exist: 'Catalogue.csv'
Thanks everyone!

Please try
import pandas as pd
df=pd.read_csv("c://Users//Fatma Elik//Documents//VS Code//BTK//Catalogue.csv")
print(df)

There are quite a few scenarios that this situation might occur. Perhaps I can offer a few common suggestions for you to try.
Case 1 - Location where you run Python
Your file path is correct with respect to the location of the .py file, but incorrect with respect to the location from which you call python.
For example, let's say CSVCSV.py is located in ~/script/, and Catalogue.csv is located in ~/script/.
If you run python script/CSVCSV.py from ~/ , you will get the FileNotFound error. However, if from ~/script/ you run python CSVCSV.py, it will work.
In your case specifically, are you perhaps running python from .../BTK or .../VS Code ? I might take a guess that you are running python c:/Users/Fatma Elik/Documents/VS Code/BTK/CSVCSV.py.
Case 2 - Try using full directory path
Have you tried df = pd.read_csv("C://Users//Fatma Elik//Documents//VS Code//BTK//Catalogue.csv") ?

This situation usually occurs when you try to write a file in a particular directory, but the directory is not available. Let's say, you are trying to write records in data/hist/sign_seqs.csv but hist directory is not present.

Python: 'NoneType' object has no attribute 'decompressobj'

I'm using Python 2.7.11 on Ubuntu.
I'm trying to open an Excel file (.xlsx) in Python using xlrd package. However I get the following error when I try to use the open_workbook() function from the package to open my Excel file:
Traceback (most recent call last):
File "TileInserter.py", line 15, in <module>
book = open_workbook(sheetPath, on_demand=True)
File "/usr/local/lib/python2.7/site-packages/xlrd/__init__.py", line 422, in open_workbook
ragged_rows=ragged_rows,
File "/usr/local/lib/python2.7/site-packages/xlrd/xlsx.py", line 761, in open_workbook_2007_xml
zflo = zf.open(component_names['xl/_rels/workbook.xml.rels'])
File "/usr/local/lib/python2.7/zipfile.py", line 1010, in open
close_fileobj=should_close)
File "/usr/local/lib/python2.7/zipfile.py", line 526, in __init__
self._decompressor = zlib.decompressobj(-15)
AttributeError: 'NoneType' object has no attribute 'decompressobj'
I tried to google the cause of this error and found that this could happen if the zlib library is not installed. But when I checked using PHP's phpinfo() function, it shows that zlib is installed. And that too the latest version (version 1.2.8).
So I'm kinda stuck now. Does anyone know how to solve this issue?
EDIT: My actual code in TileInserter.py goes like this (TileInserter.py and TileList.xlsx being in the same directory):
from xlrd import open_workbook
sheetPath = "TileList.xlsx"
#some more variables
#Open Excel file
book = open_workbook(sheetPath, on_demand=True)
for name in book.sheet_names():
if name.endswith('1'):
sheet = book.sheet_by_name(name)

I see on http://www.python-excel.org/ that there's a library openpyxl that is recommended for working with .xlsx files. This may be what you need instead of xlrd.

xlrd cannot read xlsx file downloaded from email attachment

This is a very very strange issue. I have quite a large excel file (the contents of which I cannot discuss as it is sensitive data) that is a .xlsx and IS a valid excel file.
When I download it from my email and save it on my desktop and try to open the workbook using xlrd, xlrd throws an AssertionError and does not show me what went wrong.
When I open the file using my file browser, then save it (without making any changes), it works perfectly with xlrd.
Has anyone faced this issue before? I tried passing in various flags to the open_workbook function to no avail and I tried googling for the error. So far I haven't found anything.
The method I used was as follows
file = open('bigexcelfile.xlsx')
fileString = file.read()
wb = open_workbook(file_contents=filestring)
Please help! The error is as follows
Traceback (most recent call last):
File "./varify/samples/resources.py", line 354, in post
workbook = xlrd.open_workbook(file_contents=fileString)
File "/home/vagrant/varify-env/lib/python2.7/site-packages/xlrd/__init__.py", line 416, in open_workbook
ragged_rows=ragged_rows,
File "/home/vagrant/varify-env/lib/python2.7/site-packages/xlrd/xlsx.py", line 791, in open_workbook_2007_xml
x12sheet.process_stream(zflo, heading)
File "/home/vagrant/varify-env/lib/python2.7/site-packages/xlrd/xlsx.py", line 528, in own_process_stream
self_do_row(elem)
File "/home/vagrant/varify-env/lib/python2.7/site-packages/xlrd/xlsx.py", line 722, in do_row
assert tvalue is not None
AssertionError

rename or Save as your Excel file as .xls instead of .xlsx
Thank You

Use pyopenxl, not xlrd, for this format: https://openpyxl.readthedocs.org/en/latest/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas and xlrd error while reading excel files - python

There can be many different reasons that cause this error, but you should try add engine='xlrd' or other possible values (mostly "openpyxl"). It may solve your issue, as it depends more on the excel file rather then your code. Also, try to add full path to the file instead of relative one.

openpyxl.utils.exceptions.InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format. So for me the argument: engine="xlrd" worked on .xls engine="openpyxl" worked on .xlsx

This works for me #Back to linux prompt and install openpyxl pip install openpyxl #Add engine='openpyxl' in the python argument data = pd.read_excel(path, sheet_name='Sheet1', parse_dates=True, engine='openpyxl')

Related

Unable to read_excel using pandas on CentOS Stream 9 VM: zipfile.BadZipFile: Bad magic number for file header

Pandas read_excel openpyxl generates a ValueError

Pandas cant open csv file :FileNotFoundError: [Errno 2] File xyz.csv does not exist:

Python: 'NoneType' object has no attribute 'decompressobj'

xlrd cannot read xlsx file downloaded from email attachment

Categories

Resources