How to OPEN XLS via steam input in python? - python

I'm using a software product that is opening and XLS file and sending the XLS as this string
PK\x03\x04\x14\x00\x06\x00\b\x00\x00\x00!\x00{\x92..
I've tried the below code and get the error
Unsupported format, or corrupt file: Expected BOF record; found b'<\xac\xc2\xa2{^\x9e\xd4' [file '/home/vflow/.local/lib/python3.6/site-packages/xlrd/book.py', line 1278]
Question is how to open this in xlsx in python when getting this string?
The XLS can be opened using MSoffice. I created a simple xlsx with the word 'test' in on column as a 2nd try. The software we have serves the above string PK... Note the above excel string starts with PK - which indicates a zip file ( xlsx excel files are zipped btw so it looks ok from a byte string point of view.)
Similar question from which the above approach is sourced.
Read byte string as xls file
import xlrd
import base64
myvar = data.body # contains : PK\x03\x04\x14\x00\x06\x00\b\x00\x00\x00!\x00{\x92..
decoded_bytes = base64.b64decode(myvar) # contains : <\xac¢{^\x9e\xd4\xf2\xa5\xeb1\x9aY\xf4\x13[PPI\xb2..
x = xlrd.open_workbook(file_contents=decoded_bytes)
# Unsupported format, or corrupt file: Expected BOF record; found b'<\xac\xc2\xa2{^\x9e\xd4' [file '/home/vflow/.local/lib/python3.6/site-packages/xlrd/book.py', line 1278]
x = xlrd.open_workbook(file_contents=data.body) # use data.body instead
# Excel xlsx file; not supported [file '/home/vflow/.local/lib/python3.6/site-packages/xlrd/__init__.py', line 170]

Related

How to parse uploaded Excel file?

I'm trying to use Pandas to parse an Excel file someone uploaded to a Flask web application but not having much success.
I save the raw stream to a temporary file and then try to read it but pandas complains about the raw byte array:
tmpfile = tempfile.NamedTemporaryFile()
tmpfile.write(request.file['spreadsheet'].read())
sheet = pandas.ExcelFile(tmpfile.name)
results in the error:
*** XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\\xd0\\xcf'
Is there a way to do this?
Found the answer: I need to tmpfile.seek(0) before being able to read from it.

python: converting corrupt xls file

I have downloaded few sales dataset from a SAP application. SAP has automatically converted the data to .XLS file. Whenever I open it using Pandas library I am getting the following error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'
When I opened the .XLS file using MSEXCEL it is shows a popup saying that the file is corrupt or unsupported extension do you want to continue when I clicked 'Yes' its showing the correct data. When I saved the file again as .xls using msexcel I am able to use it using Pandas.
So, I tried renaming the file using os.rename() but it dint work. I tried opening the file and removing \xff\xfe\r\x00\n\x00\r\x00, but then also it dint work.
The solution is to open MSEXCEL and save the file again as .xls manually, is there any way to automate this. Kindly help.
Finally I converted the corrupt .xls to a correct .xls file. The following is the code:
# Changing the data types of all strings in the module at once
from __future__ import unicode_literals
# Used to save the file as excel workbook
# Need to install this library
from xlwt import Workbook
# Used to open to corrupt excel file
import io
filename = r'SALEJAN17.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()
# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save('myexcel.xls')
import pandas as pd
df = pd.ExcelFile('myexcel.xls').parse('Sheet1')
No errors.
The other way to solve this problem is using win32com.client library:
import win32com.client
import os
o = win32com.client.Dispatch("Excel.Application")
o.Visible = False
filename = os.getcwd() + '/' + 'SALEJAN17.xls'
output = os.getcwd() + '/' + 'myexcel.xlsx'
wb = o.Workbooks.Open(filename)
wb.ActiveSheet.SaveAs(output,51)
In my example you save to .xlsx format but you can save as .xls as well.

How to import an old Excel file with extension.xls?

I have a file from SAS that is exported as an older Excel .xls file. I would like to import this file into python 3.5.
when I do:
import pandas as pd
Filewant = pd.read_excel("Filepath\\\Filename.xls")
I get a bunch of error messages culminating in
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<html xm'
if I open up the file and manually save it in a current .xlsx file and us the same command line using:
Filewant =pd.read_excel("Filepath\\Filename\.xlsx")
then the file is imported into Python properly. However, I want the process to be more automated so I don't to have to manually save the file to .xlsx format to make it work.
SAS tech support told me that this won't work and that I'll need to convert the .xls SAS output into a .xlsx file:
Unfortunately, the MSOffice2K destination creates an HTML file even though it uses the .XLS extension here which allows the file to be opened with excel.
You can use VBScript to convert the file to .XLSX, however, there is no way to do this using the MSoffice2K destination.
The error message tells you the problem. found b'<html xm' Your file is an HTML file and not an XLS file. This was commonly done with "old" SAS since it did not support writing XLS files, but Excel did support reading HTML files.

how to convert Excel file to CSV and prevent UTF-8 encoding

I have 5 Excel files that have to be compiled into one csv file that can be uploaded to our website for our affiliated stores database. Until now we've had someone manually cut and paste the rows of each file into one master csv file in Excel then they upload that file to the website.
I've been trying to use Python to consolidate the files so the user would just have to run the Python script that would do this for her. The problem is that the Excel files are encoded in Shift-JIS and when I use CSV writer in Python they get converted to UTF-8. The website we upload them to will only accept files in Shift-JIS, so I have to keep all of this data in Shift-JIS.
Since DOS automatically defaults to ascii encoding, I first have to run this:
import codecs, sys, xlrd, csv
reload(sys)
sys.setdefaultencoding('shift_jis')
Here is a sample of the code for one of the Excel files, which has data on 2 separate worksheets:
with xlrd.open_workbook('Circle.xls') as wb:
for sheet in wb.sheets():
fn = 'store-'
print "Converting files.."
with open(fn + sheet.name + ".csv","wb") as f:
c = csv.writer(f,dialect="excel")
for r in range(sheet.nrows):
c.writerow(sheet.row_values(r))
The conversion runs until it finds a UTF-8 character that doesn't exist in shift-JIS, then it errors out.
Is there a way to convert from Excel to a csv purely in shift-JIS?
(If my question has a flaw, please ask me to edit it before marking it down! I will edit it!)

python excel processing error

I am working on the excel processing using python.
I am using xlrd module (version 0.6.1) for the same.
I am abe to fetch most of the excel files but for some excel files it gives me error as :
XLRDError: Expected BOF record; found 0x213c
Can anyone let me know about how to solve this issue?
thanks in advance.
What you have is most probably an "XML Spreadsheet 2003 (*.xml)" file ... "<!" aka "\x3c\x21" (which is what XML streams start with) is being interpreted as the little-endian number 0x213c.
Notepad: First two lines:
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
You can also check this by opening the file with Excel and then click on Save As and look at the file-type that is displayed. While you are there, save it as an XLS file so that your xlrd can read it.
Note: this XML file is NOT the Excel 2007+ XLSX file. An XLSX is actually a ZIP file (starts with "PK", not "<?") containing a bunch of XML streams.

Categories