How to parse uploaded Excel file? - python

I'm trying to use Pandas to parse an Excel file someone uploaded to a Flask web application but not having much success.
I save the raw stream to a temporary file and then try to read it but pandas complains about the raw byte array:
tmpfile = tempfile.NamedTemporaryFile()
tmpfile.write(request.file['spreadsheet'].read())
sheet = pandas.ExcelFile(tmpfile.name)
results in the error:
*** XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\\xd0\\xcf'
Is there a way to do this?

Found the answer: I need to tmpfile.seek(0) before being able to read from it.

Related

How to OPEN XLS via steam input in python?

I'm using a software product that is opening and XLS file and sending the XLS as this string
PK\x03\x04\x14\x00\x06\x00\b\x00\x00\x00!\x00{\x92..
I've tried the below code and get the error
Unsupported format, or corrupt file: Expected BOF record; found b'<\xac\xc2\xa2{^\x9e\xd4' [file '/home/vflow/.local/lib/python3.6/site-packages/xlrd/book.py', line 1278]
Question is how to open this in xlsx in python when getting this string?
The XLS can be opened using MSoffice. I created a simple xlsx with the word 'test' in on column as a 2nd try. The software we have serves the above string PK... Note the above excel string starts with PK - which indicates a zip file ( xlsx excel files are zipped btw so it looks ok from a byte string point of view.)
Similar question from which the above approach is sourced.
Read byte string as xls file
import xlrd
import base64
myvar = data.body # contains : PK\x03\x04\x14\x00\x06\x00\b\x00\x00\x00!\x00{\x92..
decoded_bytes = base64.b64decode(myvar) # contains : <\xac¢{^\x9e\xd4\xf2\xa5\xeb1\x9aY\xf4\x13[PPI\xb2..
x = xlrd.open_workbook(file_contents=decoded_bytes)
# Unsupported format, or corrupt file: Expected BOF record; found b'<\xac\xc2\xa2{^\x9e\xd4' [file '/home/vflow/.local/lib/python3.6/site-packages/xlrd/book.py', line 1278]
x = xlrd.open_workbook(file_contents=data.body) # use data.body instead
# Excel xlsx file; not supported [file '/home/vflow/.local/lib/python3.6/site-packages/xlrd/__init__.py', line 170]

Unable to Read Corrupted pdf using PyPDF2

I am trying to merge 2 pdfs and write to one pdf using pypdf2.
Below is the code to read the file content
output = PyPDF2.PdfFileWriter()
for pdffile in self.files:
input = PyPDF2.PdfFileReader(pdffile, strict=False)
where self.files is file objects
I am getting below error when trying to read one particular pdf file
TypeError: 'NumberObject' object has no attribute 'getitem'
When i ran ghostscript on the pdf file i found that the file is corrupted and the repaired one i am able to read without errors. I wanted to check if there is any way i can read the corrupted pdf file using pypdf2 only without errors?
Thanks in Advance.

Pandas.read_excel: Unsupported format, or corrupt file: Expected BOF record

I'm trying to use pandas.read_excel to read in .xls files. It succeeds on most of my .xls files, but then for some it errors out with the following error message:
Unsupported format, or corrupt file: Expected BOF record; found '\x00\x05\x16\x07\x00\x02\x00\x00'
I've been trying to research why this is happening to some, but not all files. The xlrd version is 1.0.0. I tried to manually read in with xlrd.open_workbook and I get the same result.
Does anyone know what file type, this BOF record is referring to?
There are various reasons to why that error message appeared. However, the main reason could be due to the Excel file itself. Sometimes, especially if you're pulling an Excel file from some Reporting Portal, the Excel file could be corrupt so the best thing would be to open the Excel file and save it as a new .xls file then retry running pandas.read_excel.
Lemme know if it works.
I solved this problem loading it with pd.read_table (it loads everything into one column)
df = pd.read_table('path/to/xls_file/' + 'my_file.xls')
then I split this column with
df = df['column_name'].str.split("your_separator", expand=True)
Please check if you have given the right extension of the file either xlsx or csv. a wrong extension specified of the file may cause this issue.

How to import an old Excel file with extension.xls?

I have a file from SAS that is exported as an older Excel .xls file. I would like to import this file into python 3.5.
when I do:
import pandas as pd
Filewant = pd.read_excel("Filepath\\\Filename.xls")
I get a bunch of error messages culminating in
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<html xm'
if I open up the file and manually save it in a current .xlsx file and us the same command line using:
Filewant =pd.read_excel("Filepath\\Filename\.xlsx")
then the file is imported into Python properly. However, I want the process to be more automated so I don't to have to manually save the file to .xlsx format to make it work.
SAS tech support told me that this won't work and that I'll need to convert the .xls SAS output into a .xlsx file:
Unfortunately, the MSOffice2K destination creates an HTML file even though it uses the .XLS extension here which allows the file to be opened with excel.
You can use VBScript to convert the file to .XLSX, however, there is no way to do this using the MSoffice2K destination.
The error message tells you the problem. found b'<html xm' Your file is an HTML file and not an XLS file. This was commonly done with "old" SAS since it did not support writing XLS files, but Excel did support reading HTML files.

Reading a file from Google Cloud Storage with XLRD (python)

Im trying to read a file stored in one of my buckets in GAE.
The file is stored in a public bucket
I've tried to:
archivo=cloudstorage.open('/bucket/workbook.xlsx')
wb = xlrd.open_workbook(filename=archivo)
but xlrd expect to open the file by itself, so I get a TypeError
TypeError: coercing to Unicode: need string or buffer, ReadBuffer found
Is there any way to give xlrd an open file so I can read the file without having to change xlrd.py
I should read the documentation with more attention before asking stuff...
To provide xlrd with an open file, instead of a filename, I have to give a filecontent.
This is done by:
archivo=cloudstorage.open('/bucket/workbook.xlsx')
wb = xlrd.open_workbook(file_contents=archivo.read())

Categories