Python: Unsupported format, or corrupt file - python

I am trying to make a python program that downloads and XLS file from a website, in this case website is: https://www.blackrock.com/uk/individual/products/291392/, and loads it as a dataframe in pandas, with the correct data structure.
The issue is that when I try to load it via pandas, it gives me an error: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'
I am not quite sure what is causing this error, but presumable something with the file. I can open the file in Excel, even though I get a warning that the file and the file extension do not match, and that the file might be dangerous etc. If I click yes to opening it anyway, it opens up with data displayed correctly. If I use Excel to save the file as .xlsx i can open it in pandas, but I would rather a solution that didn't require manually opening Excel and saving the file.
I have tried renaming the file extension to xlsx, but this does not work, as it won't allow me to open the file with that extension.
I have tried many different extension, but non of them bite - unfortunately.
I am at a loss.
I hope, you can help.
EDIT: The code I use is:
download_path = 'https://www.blackrock.com/uk/individual/products/291392/fund/1527484370694.ajax?fileType=xls&fileName=iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund&dataType=fund'
testing = pd.read_excel(download_path, engine='xlrd', sheet_name = 'Holdings', skiprows = 3)

The actual problem is that the file format is SpreadSheetML which has only been used briefly between 2003 and 2006. It has been overtaken by the XLSX format. Since, it has been around for a short time and while ago, most packages do not support for load/save operations. More about the format can be found here: https://learn.microsoft.com/en-us/previous-versions/office/developer/office-xp/aa140066(v=office.10)?redirectedfrom=MSDN
For this reason, the Pandas or any other XML parser (e.g Etree) will not be able to load properly. The regular MS Office software would still load it correctly. As far as I know, you can deal with SpreadSheetML files using aspose-cells package: https://products.aspose.com/cells/python-java/
For your case:
# Import packages
import jpype
import asposecells
jpype.startJVM()
from asposecells.api import Workbook, FileFormatType
from asposecells.api import HtmlSaveOptions
# Read Workbook
workbook = Workbook('iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.xls')
worksheet = workbook.getWorksheets().get(0)
# Accessing a cell using its name
cells = worksheet.getCells()
cell = cells.get("A1")
# Print Message
print("Cell Value: " + str(cell.getValue())) # Prints Cell Value: 17-Nov-2021
# To save SpreadSheetML in different format (HTML)
saveOptions = HtmlSaveOptions()
saveOptions.setDisableDownlevelRevealedComments(True)
workbook.save("iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.html", saveOptions)

As mentioned by Slybot, this is not a real xls file.
If you inspect the contents in a plain text editor, or a hex editor, the header starts:
<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
which confirms this is an xml document, and not an Office 2007 zipped xlsx office document.
Your next steps depend on whether you have Excel installed on the machine that will be running this code or not, and if not, what other libraries you have access to and are willing to pay for - Slybot has mentioned aspose for example.
The easiest solution - Excel
If you are running this on a Windows machine with Excel installed, you have the free and capable option of automating the operation of opening Excel and saving as xlsx. This is by using Win32com module, described in this answer:
Attempting to Parse an XLS (XML) File Using Python
Alternatively, save your Excel styled XML as xlsx with Workbook.SaveAs method using win32com (only for Windows users) and read in with pandas.read_excel skipping appropriate rows.
The XML solution
You could read in the raw XML and digest it. The relevant nodes are:
<ss:Workbook>
<ss:Worksheet ss:Name="Holdings">
<ss:Table>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">iShares MSCI World SRI UCITS ETF</ss:Data>
The Third-party library solution
I am not familiar with any libraries which provide this functionality, and can't advise on this option.

Related

Python: How to write data to an Excel file using pd.ExcelWriter?

The question:
I'm trying to write data to an Excel file using Python, specifically using the ExcelWriter function provided py Pandas as described here in the docs. I think I've onto something here, but I'm only able to achieve one of two outcomes:
1. If the Excel file is open, access permission is denied.
2. If the Excel file is closed, the code seems to be running just fine, but the following error message is provided when trying to open the file Excel file after execution:
Excel cannot open the file excelTest.xlsm because the file format or
file extension is not valid. Verify that the file has not been
corrupted and that the file extenstion matches the format of the file
Does anyone know what's going on here? Or is there perhaps a better way to do this than using pd.ExcelWrite?
The details:
I've got three files in the directory C:\pythontest:
1. input.txt
2. excelTest.xlsm
1. pythonTest.py
input.txt is a comma separated text file with this content:
A,B,C
1,4,6
2,5,5
3,5,6
excelTest.xlsm is an Excel file that is completely empty with the exception of of one empty sheet named Sheet1.
pythonTest.py is a script where I'm trying to read the txt file using Python, and then write a pandas dataframe to the Excel file:
import os
import pandas as pd
os.getcwd()
os.chdir('C:/pythonTest')
os.listdir(os.getcwd())
df = pd.read_csv('C:\\pythonTest\\input.txt')
writer = pd.ExcelWriter('excelTest.xlsm')
df.to_excel(writer,'Sheet2')
writer.save()
But as I've mentioned, it fails spectacularly. Any suggestions?
System info:
Windows 7, 64 bit
Excel Version 1803
Python 3.6.6 | Anaconda custom (64-bit) |
Pandas 0.23.4
EDIT 1 - print(df) output as requested in the comments:
Pandas requires that a workbook name ends in .xls or .xlsx. It uses the extension to choose which Excel engine to use.
So the problem you've got is the extension, due to "extension hardening" Excel won't open this file since it knows that it doesn't contain a macro and isn't actually an xlsm file. Writing to excelTest.xlsx should work!

How to import an old Excel file with extension.xls?

I have a file from SAS that is exported as an older Excel .xls file. I would like to import this file into python 3.5.
when I do:
import pandas as pd
Filewant = pd.read_excel("Filepath\\\Filename.xls")
I get a bunch of error messages culminating in
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<html xm'
if I open up the file and manually save it in a current .xlsx file and us the same command line using:
Filewant =pd.read_excel("Filepath\\Filename\.xlsx")
then the file is imported into Python properly. However, I want the process to be more automated so I don't to have to manually save the file to .xlsx format to make it work.
SAS tech support told me that this won't work and that I'll need to convert the .xls SAS output into a .xlsx file:
Unfortunately, the MSOffice2K destination creates an HTML file even though it uses the .XLS extension here which allows the file to be opened with excel.
You can use VBScript to convert the file to .XLSX, however, there is no way to do this using the MSoffice2K destination.
The error message tells you the problem. found b'<html xm' Your file is an HTML file and not an XLS file. This was commonly done with "old" SAS since it did not support writing XLS files, but Excel did support reading HTML files.

python excel processing error

I am working on the excel processing using python.
I am using xlrd module (version 0.6.1) for the same.
I am abe to fetch most of the excel files but for some excel files it gives me error as :
XLRDError: Expected BOF record; found 0x213c
Can anyone let me know about how to solve this issue?
thanks in advance.
What you have is most probably an "XML Spreadsheet 2003 (*.xml)" file ... "<!" aka "\x3c\x21" (which is what XML streams start with) is being interpreted as the little-endian number 0x213c.
Notepad: First two lines:
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
You can also check this by opening the file with Excel and then click on Save As and look at the file-type that is displayed. While you are there, save it as an XLS file so that your xlrd can read it.
Note: this XML file is NOT the Excel 2007+ XLSX file. An XLSX is actually a ZIP file (starts with "PK", not "<?") containing a bunch of XML streams.

Python: Writing to Excel 2007+ files (.xlsx files)

Is there a Python module that writes Excel 2007+ files?
I'm interested in writing a file longer than 65535 lines and only Excel 2007+ supports it.
Take a look at Eric' Gazoni's openpyxl project. The code can be found on bitbucket.
There are two libraries you can take a look at.
Python-xlsx and
PyXLSX
EDIT: As the comments mention, for writing you check out openpyxl
You should take a look at xlsxcessive. It's for writing xlsx files, and is, perhaps, a bit more pythonic.
The XlsxWriter Python module writes 2007+ xlsx files.
If you are on Windows and have Excel 2007+ installed, you should be able to use pywin32 and COM to write XLSX files using almost the same code as you would would to write XLS files ... just change the "save as ...." part at the end.
Probably, you can also write XLSX files using Excel 2003 with the freely downloadable add-on kit but the number of rows per sheet would be limited to 64K.
Pyvot: http://packages.python.org/Pyvot/tutorial.html, although it is only for Excel 2010+
This should give an idea how to do that:
import xlsxwriter
workbook = xlsxwriter.Workbook("output_file.xlsx"))
# new sheet
worksheet = workbook.add_worksheet("sheet_name")
# Add a number format: 3 digits precision
precision = workbook.add_format({'num_format': '0.000'})
for(iterate your data):
worksheet.write(row, col, title)
worksheet.write_number(row, col, value, precision)
...
workbook.close()
So you want to write xlsx file, into my mind the Microsoft.office.excel.interop dll come to my mind, but not use it on a server.
I know you can call dll from python : http://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel(office.11).aspx

How do i extract specific lines of data from a huge Excel sheet using Python?

I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?
I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]
How big is the file in MB? ["Huge" is not a useful answer]
What software created the file?
How much memory do you have on your computer?
Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".
Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message
What operating system, what version of Python, what version of xlrd?
Do you know how many worksheets there are in the file?
It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.
Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.
If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.
Assuming the first case:
import xlrd
wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'
wb = xlrd.open(wb_path)
ws = wb.sheets()[0] # assuming you want to work with the first sheet in the workbook
with open(output_path, 'w') as output_file:
for i in xrange(ws.nrows):
row = [cell.value for cell in ws.row(i)]
# ... replace the following if statement with your own conditions ...
if row[0] == u'interesting':
output_file.write('\t'.join(row) + '\r\n')
This will give you a tab-delimited output file that should open in Excel.
Edit:
Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.
I haven't used it, but xlrd looks like it does a good job reading Excel data.
Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.

Categories