Why does csvkit's in2csv convert dates to integers? - python

I have a .xlsx file downloaded from this URL, which contains rows like the following:
but when I attempt to convert this file using in2csv, from csvkit version 0.9.1, I get output lines such as:
3,,0.625,42185,42916,912828XJ4,1900-01-26,0.9119,-----,...
4,,0.875,41835,42931,912828WT3,1900-01-27,0.9122,-----,...
Instead of entries recognizable as dates, we get integers. The integers seem to be the number of days between 1900-01-01 and the corresponding date in the xlsx. Additionally, values that should be integers ($26 and $27) show up in date format! Is there a simple way to get in2csv to output these dates in a format where they are recognizable as such?

In Short
Just upgrade openpyxl package. It's a known bug of it and has been fixed.
pip install --upgrade openpyxl
After upgrade:
3,,0.625,2015-06-30,2017-06-30,912828XJ4,26,0.9119,-----,...
4,,0.875,2014-07-15,2017-07-15,912828WT3,27,0.9122,-----,...
In Long
I copied a typical line of the table into a newly created .xlsx file and got the following error when converting:
list index out of range
Trace the exception:
>>> from csvkit import convert
>>> convert.convert(open('test.xlsx', 'rb'), 'xlsx')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/site-packages/csvkit/convert/__init__.py", line 39, in convert
return xlsx2csv(f, **kwargs)
File "/usr/local/lib/python3.4/site-packages/csvkit/convert/xlsx.py", line 51, in xlsx2csv
book = load_workbook(f, use_iterators=True, data_only=True)
File "/usr/local/lib/python3.4/site-packages/openpyxl/reader/excel.py", line 154, in load_workbook
_load_workbook(wb, archive, filename, read_only, keep_vba)
File "/usr/local/lib/python3.4/site-packages/openpyxl/reader/excel.py", line 209, in _load_workbook
parsed_styles = read_style_table(archive)
File "/usr/local/lib/python3.4/site-packages/openpyxl/reader/style.py", line 200, in read_style_table
p.parse()
File "/usr/local/lib/python3.4/site-packages/openpyxl/reader/style.py", line 56, in parse
self.parse_cell_styles()
File "/usr/local/lib/python3.4/site-packages/openpyxl/reader/style.py", line 138, in parse_cell_styles
self._parse_xfs(node)
File "/usr/local/lib/python3.4/site-packages/openpyxl/reader/style.py", line 160, in _parse_xfs
format_code = self.number_formats[numFmtId-165]
IndexError: list index out of range
list index out of range
So things happened in openpyxl package, which is used to read/write Excel 2010 xlsx/xlsm files.
The issue has been reported and fixed in the latest version of openyxl. However in requirements.txt of csvkit:
openpyxl==2.2.0-b1
According to this issue, it's just a workaround back then so I think just upgrade openpyxl (currently 2.2.5) and you are good.

Related

Pandas read_excel openpyxl generates a ValueError

Recently my code broke, like a week ago. Looks like something is going wrong with the openpyxl dependency. Hoping someone else has this issue and can tell me it's not just me being a bad programmer lol
Edit1:
The excel file I'm reading is generated as a .xlsx from Seeking Alpha's Portfolio Excel Export feature.
Edit2:
The Excel Export file now contains an added row with 2 empty conditionally formatted cells to a sheet that I don't even pass through to the sheet_name arg. The problem seems to be that openpyxl can't parse empty cells that have conditional formatting. How can I make read_excel only parse the sheets I listed? Or maybe drop the row that's causing problems?
After adding a "-" or removing conditional formatting, my script works. But I'd like to not have to do this every time I export the excel file. Also, the following warning appears now.
/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/bin/python3.9 /Users/marcoucolon/Documents/GitHub/opt-portfolio/analysis.py
/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py:308:
UserWarning: Conditional Formatting extension is not supported and will be removed
warn(msg)
My line of code that causes error
dic = pd.read_excel(path, sheet_name=sheet_names)
Error message
ValueError: Value must be one of {'equal', 'greaterThanOrEqual', 'containsText', 'beginsWith', 'notEqual', 'greaterThan', 'between', 'endsWith', 'notContains', 'lessThan', 'lessThanOrEqual', 'notBetween'}
Complete Log
/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/bin/python3.9 /Users/marcoucolon/Documents/GitHub/opt-portfolio/analysis.py
Traceback (most recent call last):
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/analysis.py", line 862, in <module>
df = excel_data(path_excel, sheet_names)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/analysis.py", line 129, in excel_data
dic = pd.read_excel(path, sheet_name=sheet_names)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 344, in read_excel
data = io.parse(
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1170, in parse
return self._reader.parse(
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 492, in parse
data = self.get_sheet_data(sheet, convert_float)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 548, in get_sheet_data
for row_number, row in enumerate(sheet.rows):
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/worksheet/_read_only.py", line 79, in _cells_by_row
for idx, row in parser.parse():
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py", line 145, in parse
dispatcher[tag_name](element)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py", line 288, in parse_formatting
cf = ConditionalFormatting.from_tree(element)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 87, in from_tree
obj = desc.expected_type.from_tree(el)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 103, in from_tree
return cls(**attrib)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/formatting/rule.py", line 201, in __init__
self.operator = operator
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/descriptors/base.py", line 143, in __set__
super(NoneSet, self).__set__(instance, value)
File "/Users/marcoucolon/Documents/GitHub/opt-portfolio/.venv/lib/python3.9/site-packages/openpyxl/descriptors/base.py", line 128, in __set__
raise ValueError(self.__doc__)
ValueError: Value must be one of {'equal', 'greaterThanOrEqual', 'containsText', 'beginsWith', 'notEqual', 'greaterThan', 'between', 'endsWith', 'notContains', 'lessThan', 'lessThanOrEqual', 'notBetween'}
I have same problem with Seeking Alpha (SA) xlsx. pandas 1.1.5 does read ok but current pandas 1.3.2 has this problem.
A possible hack workaround is to install older pandas in a different location and import that pandas version 1.1.5 as pd1 just to import the xlsx .
So pd is pandas 1.3.2
and pd1 is pandas 1.1.5
The following has some hints Supporting multiple Python module versions (with the same version of Python)
And use the target option in pip import as explained in Supporting multiple Python module versions (with the same version of Python)
Edited to add sample code
Pandas is a large subsystem and multiple versions are a bit demanding to install. A simpler solution was to create a different environment ( I used conda) and execute a command ( .bat in windows .sh in unix)
sample SA_oldpandas.bat
echo old pandas 1.1.5
conda activate env-oldpandas
cd {code directory}
echo ---GOING----
python SA_OldPandas.py
echo ---FINISHED---- %date% %time%
start excel SA_check.xlsx
basically the old Pandas code writes the data to a excel and the new code uses this
os.getcwd()
os.system('SA_oldpandas.bat')

Pandas and xlrd error while reading excel files

I've been working on a Python script that deals with creating Pandas data frames from Excel files. For the past few days, the Pandas method worked perfectly with the usual pd.read_excel() method.
Today I've been trying to run the same code, but am running into errors. I've tried using the following code on a small test document (just two columns, 5 rows with simple integers):
import pandas as pd
pd.read_excel("tstr.xlsx")
I'm getting this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 867, in __init__
self._reader = self._engines[engine](self._io)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_xlrd.py", line 22, in __init__
super().__init__(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 353, in __init__
self.book = self.load_workbook(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_xlrd.py", line 37, in load_workbook
return open_workbook(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\__init__.py", line 130, in open_workbook
bk = xlsx.open_workbook_2007_xml(
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\xlsx.py", line 812, in open_workbook_2007_xml
x12book.process_stream(zflo, 'Workbook')
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\xlsx.py", line 266, in process_stream
for elem in self.tree.iter() if Element_has_iter else self.tree.getiterator():
AttributeError: 'ElementTree' object has no attribute 'getiterator'
I get the exact same issue when trying to load excel files with xlrd directly. I've tried with several different excel files, and all of my pip installations are up-to-date.
I haven't made any changes to my system since pd.read_excel was last working perfectly (I did reboot my system, but it didn't involve any updates). I'm using a Windows 10 machine, if that's relevant.
Has anyone else had this issue? Any advice on how to proceed?
There can be many different reasons that cause this error, but you should try add engine='xlrd' or other possible values (mostly "openpyxl"). It may solve your issue, as it depends more on the excel file rather then your code.
Also, try to add full path to the file instead of relative one.
openpyxl.utils.exceptions.InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
So for me the argument:
engine="xlrd" worked on .xls
engine="openpyxl" worked on .xlsx
This works for me
#Back to linux prompt and install openpyxl
pip install openpyxl
#Add engine='openpyxl' in the python argument
data = pd.read_excel(path, sheet_name='Sheet1', parse_dates=True, engine='openpyxl')

Pandas cant open csv file :FileNotFoundError: [Errno 2] File xyz.csv does not exist:

import pandas as pd
df=pd.read_csv('Catalogue.csv')
print(df)
I downloaded my earthquake csv file. And pandas dont see the file. I use VS Code and Python 3.8.3 I added csv file in the same py file where I write my code.
Even if I used the same code (csv was in the same folder where my code file was) in Jupyter Notebook folder the result were the same.
I guess if it is excel pip instal xlrd is written. I did pip install python-csv but couldnt achieve installing. Is it needed though? Or do I need to fixe the csv file (commas or spaces)?
total result:
Traceback (most recent call last):
File "c:/Users/Fatma Elik/Documents/VS Code/BTK/CSVCSV.py", line 2, in <module>
df=pd.read_csv('Catalogue.csv')
File "C:\Users\Fatma Elik\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\Fatma Elik\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "C:\Users\Fatma Elik\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "C:\Users\Fatma Elik\AppData\Local\Programs\Python\Python38-32\lib\der(src, **kwds)
File "pandas\_libs\parsers.pyx", line 374, in pandas._libs.parsers.TextReader.__cinit__
File "pandas\_libs\parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File Catalogue.csv does not exist: 'Catalogue.csv'
Thanks everyone!
Please try
import pandas as pd
df=pd.read_csv("c://Users//Fatma Elik//Documents//VS Code//BTK//Catalogue.csv")
print(df)
There are quite a few scenarios that this situation might occur. Perhaps I can offer a few common suggestions for you to try.
Case 1 - Location where you run Python
Your file path is correct with respect to the location of the .py file, but incorrect with respect to the location from which you call python.
For example, let's say CSVCSV.py is located in ~/script/, and Catalogue.csv is located in ~/script/.
If you run python script/CSVCSV.py from ~/ , you will get the FileNotFound error. However, if from ~/script/ you run python CSVCSV.py, it will work.
In your case specifically, are you perhaps running python from .../BTK or .../VS Code ? I might take a guess that you are running python c:/Users/Fatma Elik/Documents/VS Code/BTK/CSVCSV.py.
Case 2 - Try using full directory path
Have you tried df = pd.read_csv("C://Users//Fatma Elik//Documents//VS Code//BTK//Catalogue.csv") ?
This situation usually occurs when you try to write a file in a particular directory, but the directory is not available. Let's say, you are trying to write records in data/hist/sign_seqs.csv but hist directory is not present.

read xlsx file with sheet named as None. pandas xlrd

I have a bunch of xlsx files with sheets named as None (empty string)
When I trying to read the files using pandas, the xlrd raises a list index out of range error.
Traceback (most recent call last):
File "/usr/local/bin/runxlrd.py", line 332, in main
ragged_rows=options.ragged_rows,
File "/Library/Python/2.7/site-packages/xlrd/__init__.py", line 416, in open_workbook
ragged_rows=ragged_rows,
File "/Library/Python/2.7/site-packages/xlrd/xlsx.py", line 791, in open_workbook_2007_xml
x12sheet.process_stream(zflo, heading)
File "/Library/Python/2.7/site-packages/xlrd/xlsx.py", line 528, in own_process_stream
self_do_row(elem)
File "/Library/Python/2.7/site-packages/xlrd/xlsx.py", line 667, in do_row
value = self.sst[int(tvalue)]
IndexError: list index out of range
I found this issue in xrld github that I think is related.
If I change the name of the sheet, pandas successfully reads the file.
I can't share the files as an example (privacy issue), and when I tried to create a demo file with None as the sheet name, the Excel raised an invalid name error.
Packages version.
pkg_resources.get_distribution("xlrd").version
Out[3]: '1.1.0'
pd.__version__
Out[4]: '0.23.0'
Is there a way to read this file with pandas or a script (in any language) that can change the sheet names?
This works for me using Python 2.7, pandas 0.23.3 and xlrd 1.1.0
>>> import xlrd
>>> import pandas
>>> xlrd_workbook = xlrd.open_workbook("test.xlsx")
>>> pandas.read_excel(xlrd_workbook, engine='xlrd')
A B C
0 123 456 789

xlutils.copy [python 2.7 - excel]

I'm new to python (and to programming in general).
I am having a problem using xlrd, xlwt and xlutils for accessing an xlsx workbook (it is a common question but i did not find any solution for me).
Should I change my package for py-excel?
In that case, which one?
Here is my code:
import xlrd
import xlwt
from xlutils.copy import copy as xlutils_copy
rd = xlrd.open_workbook("x:/PROJECTS/Papers/2014_Pasture/a.xlsx")
rdsh = rd.sheet_by_name("FR_PASTURE")
wrb = xlutils_copy(rd)
ws = wrb.get_sheet_by_name("FR_PASTURE")
And the error I am receiving:
Traceback (most recent call last):
File "X:\PROJECTS\Papers\2014_Pasture\AdjustXLSStats.py", line 28, in <module>
wrb = xlutils_copy(rd)
File "C:\Python27\lib\site-packages\xlutils-1.7.0-py2.7.egg\xlutils\copy.py", line 19, in copy
w
File "C:\Python27\lib\site-packages\xlutils-1.7.0-py2.7.egg\xlutils\filter.py", line 937, in process
reader(chain[0])
File "C:\Python27\lib\site-packages\xlutils-1.7.0-py2.7.egg\xlutils\filter.py", line 68, in __call__
filter.cell(row_x,col_x,row_x,col_x)
File "C:\Python27\lib\site-packages\xlutils-1.7.0-py2.7.egg\xlutils\filter.py", line 573, in cell
wtrow.set_cell_number(wtcolx, cell.value, style)
File "build\bdist.win-amd64\egg\xlwt\Row.py", line 203, in set_cell_number
self.__adjust_bound_col_idx(colx)
File "build\bdist.win-amd64\egg\xlwt\Row.py", line 78, in __adjust_bound_col_idx
raise ValueError("column index (%r) not an int in range(256)" % arg)
ValueError: column index (256) not an int in range(256)
Version of xlutils installed : 1.7.0
OS: windows 8
excel: office 20113
xlrd, xlwt, and xlutils are for accessing xls files, and have not been updated for use with xlsx files, which results in your multiple errors.
As a workaround, there is now a Python library openpyxlwhich can easily read and write Excel xlsx/xlsm/xltx/xltm files.

Categories