I'm a new in Docker, and I need to run a Flask app, but in this app, I need to read a xlsx file.
When I build my docker image, I have the excel file but Docker don't read it.
See the part of my code with the xlsx :
try :
# Page PayPlug
infos_pay = pd.read_excel('./app-infos.xlsx', sheet_name='payplug')
secret_key = infos_pay.loc[0].tolist()
payplug.set_secret_key(secret_key[0])
# Page mail
feuille_mail = pd.read_excel('./app-infos.xlsx', sheet_name='mail')
infos_mail = feuille_mail.loc[0].tolist()
except :
print("Can't read file")
And my docker file :
FROM python:3.8
WORKDIR /code
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD [ "python", "./web_AuSpot.py" ]
When I use ls cmd in the cli of the image, I see the xlsx file, so why pandas can't read it ?
Thank you, I hope my explanation is clear.
Reading excel files isn't necessary to pandas operation. So you need to also list the xlrd package in your requirements file. If that doesn't fix it you will need to list out the requirements.txt file and also add the docker error message would be helpful to put into your question.
Other possibilities are:
Could be a file location problem.
could be that pandas wasn't imported as pd (I don't see it in the code)
I remove the try expect and I have this error :
Traceback (most recent call last):
File "./web_AuSpot.py", line 46, in <module>
infos_pay = pd.read_excel('/app-infos.xlsx', sheet_name='payplug')
File "/usr/local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 867, in __init__
self._reader = self._engines[engine](self._io)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
import_optional_dependency("xlrd", extra=err_msg)
File "/usr/local/lib/python3.8/site-packages/pandas/compat/_optional.py", line 110, in import_optional_dependency
raise ImportError(msg) from None
ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.
(I can't put it in a comment so I put it in)
EDIT :
I add the xrld in requirements, change the location path of the xlsx, and now I have this :
Traceback (most recent call last):
File "./web_AuSpot.py", line 46, in <module>
infos_pay = pd.read_excel('app-infos.xlsx', sheet_name='payplug')
File "/usr/local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 867, in __init__
self._reader = self._engines[engine](self._io)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 22, in __init__
super().__init__(filepath_or_buffer)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 353, in __init__
self.book = self.load_workbook(filepath_or_buffer)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 37, in load_workbook
return open_workbook(filepath_or_buffer)
File "/usr/local/lib/python3.8/site-packages/xlrd/__init__.py", line 170, in open_workbook
raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported
Related
I am trying to read an xls file from a url into python dataframe. However I am getting below assertion error.
Traceback (most recent call last):
File "c:\Sample_project\venv\excel_read_v1.py", line 17, in
df = pd.read_excel("file.xls",
File "C:\Sample_project\venv\lib\site-packages\pandas\util_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_base.py", line 457, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_base.py", line 1419, in init
self._reader = self._engines[engine](self.io, storage_options=storage_options)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_xlrd.py", line 25, in init
super().init(filepath_or_buffer, storage_options=storage_options)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_base.py", line 518, in init
self.book = self.load_workbook(self.handles.handle)
File "C:\Sample_project\venv\lib\site-packages\pandas\io\excel_xlrd.py", line 38, in load_workbook
return open_workbook(file_contents=data)
File "C:\Sample_project\venv\lib\site-packages\xlrd_init.py", line 172, in open_workbook
bk = open_workbook_xls(
File "C:\Sample_project\venv\lib\site-packages\xlrd\book.py", line 104, in open_workbook_xls
bk.parse_globals()
File "C:\Sample_project\venv\lib\site-packages\xlrd\book.py", line 1211, in parse_globals
self.handle_sst(data)
File "C:\Sample_project\venv\lib\site-packages\xlrd\book.py", line 1178, in handle_sst
self._sharedstrings, rt_runlist = unpack_SST_table(strlist, uniquestrings)
File "C:\Sample_project\venv\lib\site-packages\xlrd\book.py", line 1472, in unpack_SST_table
assert _unused_i == nstrings - 1
AssertionError
I read some other suggestions on stackoverflow that if I remove the last few empty lines from the excel then it would work. So I tried that out by downloading the file in a local folder , removing the last 2 empty rows and then reading the file from the local folder, this works. But i need the code to somehow able to handle it while reading from the url so that we can automate the process
I have tried using openpyxl and xlrd to read the file.
---Code snapshot below--------
import openpyxl
import xlrd
from xlrd import open_workbook
import requests
import pandas as pd
url = url
r = requests.get(url)
with open('maskefile.xls', 'wb') as output:
output.write(r.content)
df = pd.read_excel("maskedfile.xls",sheet_name = "maskedsheetname")
df.to_csv("C:\Sample_project\maskedfile.csv" ,index = False)
I've been running a script for several months now where I read and concat several excel exports using the following code:
files = os.listdir(os.path.abspath('exports/'))
for file in files:
if file.startswith('ap_statistics_') and file.endswith('.xlsx'):
excel_list.append(pd.read_excel('exports/' + file, sheet_name='Access Points'))
df = pd.concat(excel_list, axis=0, ignore_index=True)
This has worked just fine until this Saturday when I uploaded new exports to the CentOS Stream 9 VM where I have a cronjob running the script every hour.
Now I always get this error:
Traceback (most recent call last):
File "/root/projects/beacon_check_v8/main.py", line 310, in <module>
ap_check()
File "/root/projects/beacon_check_v8/main.py", line 260, in ap_check
siteaps_result = getaps()
File "/root/projects/beacon_check_v8/main.py", line 30, in getaps
excel_list.append(pd.read_excel('exports/' + file, sheet_name='Access Points'))
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 457, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 1419, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 525, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 518, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 536, in load_workbook
return load_workbook(
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 317, in load_workbook
reader.read()
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 277, in read
self.read_strings()
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 143, in read_strings
with self.archive.open(strings_path,) as src:
File "/usr/lib64/python3.9/zipfile.py", line 1523, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
I develop on my Windows 10 notebook using PyCharm with a Python 3.9 venv, same as on the VM, where the script continued to work just fine.
When researching online all I found was that sometimes .pyc files can cause issues so I created a completely new venv on the VM, installed all libraries (netmiko, pandas, openpyxl, etc.) and tried running the script again before and after deleting all .pyc files in the directory but no luck.
I have extracted the Excel file header using the following code:
with open('exports/' + file, 'rb') as myexcel:
print(myexcel.read(4))
Unfortunately it comes back as the same values on both my Windows venv as well as the CentOS venv:
b'PK\x03\x04'
I don't know if this header value is correct or not but I can read the files on my Windows notebook just fine using pandas or excel.
Any help would be greatly appreciated.
The issue was actually the program I used to transfer the files between my notebook and the VM, WinSCP. I don't know why or how this caused the error but I was able to fix it by transferring directly over pscp.
I have a folder with many xlsx files that I'd like to convert to csv files.
During my research, if found several threads about this topic, such as this or that one. Based on this, I formulated the following code using glob and pandas:
import glob
import pandas as pd
path = r'/Users/.../xlsx files'
excel_files = glob.glob(path + '/*.xlsx')
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel) # error occurs here
df.to_csv(out)
But unfortunately, I got the following error message that I could not interpret in this context and I could not figure out how to solve this problem:
Traceback (most recent call last):
File "<input>", line 11, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1131, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 475, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 391, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 486, in load_workbook
return load_workbook(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 317, in load_workbook
reader.read()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 281, in read
apply_stylesheet(self.archive, self.wb)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/styles/stylesheet.py", line 198, in apply_stylesheet
stylesheet = Stylesheet.from_tree(node)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/styles/stylesheet.py", line 103, in from_tree
return super(Stylesheet, cls).from_tree(node)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 87, in from_tree
obj = desc.expected_type.from_tree(el)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 87, in from_tree
obj = desc.expected_type.from_tree(el)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/openpyxl/descriptors/serialisable.py", line 103, in from_tree
return cls(**attrib)
TypeError: __init__() got an unexpected keyword argument 'xfid'
Does anyone know how to fix this? Thanks a lot for your help!
I had the same problem here. After some hours thinking and searching I realized the problem is, actually, the file. I opened it using MS Excel, and save. Alakazan, problem solved.
The file was downloaded, so i think it's a "security" error or just an error from how the file was created. xD
EDIT:
It's not a security problem, but actually an error from the generation of file. The correct has the double of kb the wrong file.
An solution is: if using xlrd==1.2.0 the file can be opened, you can, after doing this, call read_excel to the Book(file opened by xlrd).
import xlrd
# df = pd.read_excel('TabelaPrecos.xlsx')
# The line above is the same result
a = xlrd.open_workbook('TabelaPrecos.xlsx')
b = pd.read_excel(a)
I try to read a xlsx into a data frame:
itut_ir = pd.read_excel('C:\\Users\\Administrator\\Downloads\\reportdata.xlsx')
print(itut_ir.to_string())
I receive this:
Traceback (most recent call last):
File
"C:\Users\Administrator\eclipse-workspace\Reports\GOW\Report.py",
line 44, in
df = pd.read_excel('C:\Users\Administrator\Downloads\reportdata.xlsx')
File
"C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_base.py",
line 304, in read_excel
io = ExcelFile(io, engine=engine) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_base.py",
line 824, in init
self._reader = self.enginesengine File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_xlrd.py",
line 21, in init
super().init(filepath_or_buffer) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_base.py",
line 353, in init
self.book = self.load_workbook(filepath_or_buffer) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel_xlrd.py",
line 36, in load_workbook
return open_workbook(filepath_or_buffer) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd_init.py",
line 117, in open_workbook
zf = zipfile.ZipFile(filename) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\zipfile.py",
line 1222, in init
self._RealGetContents() File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\zipfile.py",
line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file
does anybody have an idea? the file does not seem to be broken, I can open it with Excel.
thanks!
*** UPDATE ***
the file producing the error is being downloaded from FTP. opening the original file works ... if that gives you a hint :) thanks
I had the same issue just a little bit ago with an XLSX that I created in LibreOffice.
The solution was to check the XLSX to make sure it wasn't corrupted. In my case, loading a previous version of the XLSX file corrected the problem.
I am writing a small script that needs to merge many one-page pdf files. I want the script to run with Python3 and to have as few dependencies as possible.
For the PDF merging part, I tried using PyPdf. However, the Python 3 support seems to be buggy; It can't handle inkscape generated PDF files (which I need). I have the current git version of PyPdf installed, and the following test script doesn't work:
import PyPDF2
output_pdf = PyPDF2.PdfFileWriter()
with open("testI.pdf", "rb") as input:
input_pdf = PyPDF2.PdfFileReader(input)
output_pdf.addPage(input_pdf.getPage(0))
with open("test.pdf", "wb") as output:
output_pdf.write(output)
It throws the following stack trace:
Traceback (most recent call last):
File "test.py", line 7, in <module>
output.addPage(input.getPage(0))
File "/usr/lib/python3.3/site-packages/pyPdf/pdf.py", line 420, in getPage
self._flatten()
File "/usr/lib/python3.3/site-packages/pyPdf/pdf.py", line 574, in _flatten
self._flatten(page.getObject(), inherit)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 165, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/lib/python3.3/site-packages/pyPdf/pdf.py", line 616, in getObject
retval = readObject(self.stream, self)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 66, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 526, in readFromStream
value = readObject(stream, pdf)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 57, in readObject
return ArrayObject.readFromStream(stream, pdf)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 152, in readFromStream
obj = readObject(stream, pdf)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 86, in readObject
return NumberObject.readFromStream(stream)
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 231, in readFromStream
return FloatObject(name.decode("ascii"))
File "/usr/lib/python3.3/site-packages/pyPdf/generic.py", line 207, in __new__
return decimal.Decimal.__new__(cls, str(value), context)
TypeError: optional argument must be a context
The same script, however, works flawlessly with Python 2.7.
What am I doing wrong here? Is it a bug in the library? Can I work around it without touching the PyPDF library?
So I found the answer. The decimal.Decimal module in Python3.3 shows some weird behaviour. This is the corresponding StackOverflow question: Instantiate Decimal class I added some workaround to the PyPDF2 library and submitted a pull request.
Just to make sure you are aware of already existing tools that do exactly this:
PDFtk
PDFjam (my favourite, requires LaTeX though)
Directly with GhostScript:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=finished.pdf file1.pdf file2.pdf