XLRD unsupported format found - python

Using version 0.9.2 of XLRD and python 2.7 on WINDOWS...
Im creating a temporary file and then reading the file using XLRD
data = self.excel_file
path = default_storage.save('temp/temp.xls', ContentFile(data.read()))
tmp_file = os.path.join(settings.MEDIA_ROOT, path)
workbook = xlrd.open_workbook(tmp_file)
As soon as I try and open the XLS file it returns with the error
xlrd.biffh.XLRDError: Unsupported format found '\xd0\xcf\x11\xe0\xa1\xb1\x00\x00'
So I guess the file is not saved correctly above or something. Not sure what do do about this, as this works when uploading the file with django admin in a file field.
Where saving the file like this in python to django model creates the issues above:
from django.core.files import File
p = Foo()
p.excel_file.save(file_name, File(data))
p.save()

It looks to me like this could be a unicode issue. I'm guessing there is non-ASCII characters in your strings? try Using .encode("utf-8") with your strings when you save them to the xls.
EDIT: this was a guess, after more investigation by #Harry looks like it's not the correct solution.
EDIT 2: If the file cannot be opened by Excel as discussed below then the data itself is probably the problem.

Related

Python: Unsupported format, or corrupt file

I am trying to make a python program that downloads and XLS file from a website, in this case website is: https://www.blackrock.com/uk/individual/products/291392/, and loads it as a dataframe in pandas, with the correct data structure.
The issue is that when I try to load it via pandas, it gives me an error: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'
I am not quite sure what is causing this error, but presumable something with the file. I can open the file in Excel, even though I get a warning that the file and the file extension do not match, and that the file might be dangerous etc. If I click yes to opening it anyway, it opens up with data displayed correctly. If I use Excel to save the file as .xlsx i can open it in pandas, but I would rather a solution that didn't require manually opening Excel and saving the file.
I have tried renaming the file extension to xlsx, but this does not work, as it won't allow me to open the file with that extension.
I have tried many different extension, but non of them bite - unfortunately.
I am at a loss.
I hope, you can help.
EDIT: The code I use is:
download_path = 'https://www.blackrock.com/uk/individual/products/291392/fund/1527484370694.ajax?fileType=xls&fileName=iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund&dataType=fund'
testing = pd.read_excel(download_path, engine='xlrd', sheet_name = 'Holdings', skiprows = 3)
The actual problem is that the file format is SpreadSheetML which has only been used briefly between 2003 and 2006. It has been overtaken by the XLSX format. Since, it has been around for a short time and while ago, most packages do not support for load/save operations. More about the format can be found here: https://learn.microsoft.com/en-us/previous-versions/office/developer/office-xp/aa140066(v=office.10)?redirectedfrom=MSDN
For this reason, the Pandas or any other XML parser (e.g Etree) will not be able to load properly. The regular MS Office software would still load it correctly. As far as I know, you can deal with SpreadSheetML files using aspose-cells package: https://products.aspose.com/cells/python-java/
For your case:
# Import packages
import jpype
import asposecells
jpype.startJVM()
from asposecells.api import Workbook, FileFormatType
from asposecells.api import HtmlSaveOptions
# Read Workbook
workbook = Workbook('iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.xls')
worksheet = workbook.getWorksheets().get(0)
# Accessing a cell using its name
cells = worksheet.getCells()
cell = cells.get("A1")
# Print Message
print("Cell Value: " + str(cell.getValue())) # Prints Cell Value: 17-Nov-2021
# To save SpreadSheetML in different format (HTML)
saveOptions = HtmlSaveOptions()
saveOptions.setDisableDownlevelRevealedComments(True)
workbook.save("iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.html", saveOptions)
As mentioned by Slybot, this is not a real xls file.
If you inspect the contents in a plain text editor, or a hex editor, the header starts:
<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
which confirms this is an xml document, and not an Office 2007 zipped xlsx office document.
Your next steps depend on whether you have Excel installed on the machine that will be running this code or not, and if not, what other libraries you have access to and are willing to pay for - Slybot has mentioned aspose for example.
The easiest solution - Excel
If you are running this on a Windows machine with Excel installed, you have the free and capable option of automating the operation of opening Excel and saving as xlsx. This is by using Win32com module, described in this answer:
Attempting to Parse an XLS (XML) File Using Python
Alternatively, save your Excel styled XML as xlsx with Workbook.SaveAs method using win32com (only for Windows users) and read in with pandas.read_excel skipping appropriate rows.
The XML solution
You could read in the raw XML and digest it. The relevant nodes are:
<ss:Workbook>
<ss:Worksheet ss:Name="Holdings">
<ss:Table>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">iShares MSCI World SRI UCITS ETF</ss:Data>
The Third-party library solution
I am not familiar with any libraries which provide this functionality, and can't advise on this option.

Load xls files with pandas is failed

I am trying to load an xls file with pandas using:
pd.read_excel(fi_name, sheet_name=None, engine=None)
But i get this error:
"XLRDError: Workbook is encrypted"
But file is not encrypted, i can open it with excel, and read file's text with tika package.
Is someone know how can i solve it ?
Besides, is anyone know a python package for reading all excel files format,
Even if pandas is failed ?
Thanks
I guess ,I found something for your problem:
import msoffcrypto
file = msoffcrypto.OfficeFile (open ('encrypted.xls', 'rb')) # read the original file
file.load_key (password = 'VelvetSweatshop') # Fill in the password, if it can be opened directly, the default password is 'VelvetSweatshop'
file.decrypt (open ('decrypted.xls', 'wb')) # Save it as a new file after decryption
After that, you can use xlrd to open and operate the decrypted file normally.
and you can install the package with
pip install msoffcrypto
and you can see the full documentation here
There are 2 possible reasons for this:
The file that you are getting is not in the same file format as the file extension says.
Either the whole workbook or a sheet of it is password protected and hence the data being read from it is encrypted to protect the data.

Corrupt excel file when serving file using Django and openpyxl

I have an issue with trying to serve my excel file as an httpresponse, it is coming out almost double in size and is corrupt when I try to open it.
My setups is as follows:
I have an excel template which I make a copy of using shutil2
I then open the copy using load_workbook() in openpyxl
I populate the workbook with data
I then (as per the openpyxl docs) save the workbook as a stream and save it as an httpresponse object to be returned to my website through django
with NamedTemporaryFile() as tmp:
wb.save(tmp.name)
tmp.seek(0)
stream = tmp.read()
response = HttpResponse(content=stream, content_type="application/ms-excel")
# I've also tried with application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
response['Content-Disposition'] = 'attachment; filename=Excel.xls'
return response
In my template I have some jquery code which downloads the excel for the user. This is done following the answer from this question: Get excel file (.xlsx) from server response in ajax
This all works fine, but the resulting download file is corrupt and while I am expecting it to be around 7mb in size, its actually 16mb.
Does anyone have any idea what I am doing wrong?
Have you tried to open this in binary mode ? That helped me. I believe below code should help:
with NamedTemporaryFile(mode='rb') as tmp:
..

How to import an old Excel file with extension.xls?

I have a file from SAS that is exported as an older Excel .xls file. I would like to import this file into python 3.5.
when I do:
import pandas as pd
Filewant = pd.read_excel("Filepath\\\Filename.xls")
I get a bunch of error messages culminating in
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<html xm'
if I open up the file and manually save it in a current .xlsx file and us the same command line using:
Filewant =pd.read_excel("Filepath\\Filename\.xlsx")
then the file is imported into Python properly. However, I want the process to be more automated so I don't to have to manually save the file to .xlsx format to make it work.
SAS tech support told me that this won't work and that I'll need to convert the .xls SAS output into a .xlsx file:
Unfortunately, the MSOffice2K destination creates an HTML file even though it uses the .XLS extension here which allows the file to be opened with excel.
You can use VBScript to convert the file to .XLSX, however, there is no way to do this using the MSoffice2K destination.
The error message tells you the problem. found b'<html xm' Your file is an HTML file and not an XLS file. This was commonly done with "old" SAS since it did not support writing XLS files, but Excel did support reading HTML files.

Reading a file from Google Cloud Storage with XLRD (python)

Im trying to read a file stored in one of my buckets in GAE.
The file is stored in a public bucket
I've tried to:
archivo=cloudstorage.open('/bucket/workbook.xlsx')
wb = xlrd.open_workbook(filename=archivo)
but xlrd expect to open the file by itself, so I get a TypeError
TypeError: coercing to Unicode: need string or buffer, ReadBuffer found
Is there any way to give xlrd an open file so I can read the file without having to change xlrd.py
I should read the documentation with more attention before asking stuff...
To provide xlrd with an open file, instead of a filename, I have to give a filecontent.
This is done by:
archivo=cloudstorage.open('/bucket/workbook.xlsx')
wb = xlrd.open_workbook(file_contents=archivo.read())

Categories