Reading a file from Google Cloud Storage with XLRD (python) - python

Im trying to read a file stored in one of my buckets in GAE.
The file is stored in a public bucket
I've tried to:
archivo=cloudstorage.open('/bucket/workbook.xlsx')
wb = xlrd.open_workbook(filename=archivo)
but xlrd expect to open the file by itself, so I get a TypeError
TypeError: coercing to Unicode: need string or buffer, ReadBuffer found
Is there any way to give xlrd an open file so I can read the file without having to change xlrd.py

I should read the documentation with more attention before asking stuff...
To provide xlrd with an open file, instead of a filename, I have to give a filecontent.
This is done by:
archivo=cloudstorage.open('/bucket/workbook.xlsx')
wb = xlrd.open_workbook(file_contents=archivo.read())

Related

Read PDF tables from memory with Python

I'm trying to read a PDF file extracted from a zip file in memory to get the tables inside the file. Camelot seems a good way to do it, but I'm getting the following error:
AttributeError: '_io.StringIO' object has no attribute 'lower'
Is there some way to read the file and extract the tables with camelot, or should I use another library?
z = zipfile.ZipFile(self.zip_file)
for file in z.namelist():
if file.endswith(".pdf"):
pdf = z.read(file).decode(encoding="latin-1")
pdf = StringIO(pdf)
pdf = camelot.read_pdf(pdf, codec='utf-8')
camelot.read_pdf(filepath,...)
Accepts a file path as the first parameter. It appears to be a bad match for your requirements. Search for another library.
In any case StringIO(pdf), will return the following:
<_io.StringIO object at 0x000002592DD33E20>
For starters, when you read a file from StringIO, do it by calling the read() function
pdf = StringIO(pdf)
pdf.read()
That bit will indeed return the file bytes themselves. Next think about the encoding that the library will accept.

Unable to Read Corrupted pdf using PyPDF2

I am trying to merge 2 pdfs and write to one pdf using pypdf2.
Below is the code to read the file content
output = PyPDF2.PdfFileWriter()
for pdffile in self.files:
input = PyPDF2.PdfFileReader(pdffile, strict=False)
where self.files is file objects
I am getting below error when trying to read one particular pdf file
TypeError: 'NumberObject' object has no attribute 'getitem'
When i ran ghostscript on the pdf file i found that the file is corrupted and the repaired one i am able to read without errors. I wanted to check if there is any way i can read the corrupted pdf file using pypdf2 only without errors?
Thanks in Advance.

Load xls files with pandas is failed

I am trying to load an xls file with pandas using:
pd.read_excel(fi_name, sheet_name=None, engine=None)
But i get this error:
"XLRDError: Workbook is encrypted"
But file is not encrypted, i can open it with excel, and read file's text with tika package.
Is someone know how can i solve it ?
Besides, is anyone know a python package for reading all excel files format,
Even if pandas is failed ?
Thanks
I guess ,I found something for your problem:
import msoffcrypto
file = msoffcrypto.OfficeFile (open ('encrypted.xls', 'rb')) # read the original file
file.load_key (password = 'VelvetSweatshop') # Fill in the password, if it can be opened directly, the default password is 'VelvetSweatshop'
file.decrypt (open ('decrypted.xls', 'wb')) # Save it as a new file after decryption
After that, you can use xlrd to open and operate the decrypted file normally.
and you can install the package with
pip install msoffcrypto
and you can see the full documentation here
There are 2 possible reasons for this:
The file that you are getting is not in the same file format as the file extension says.
Either the whole workbook or a sheet of it is password protected and hence the data being read from it is encrypted to protect the data.

How to import an old Excel file with extension.xls?

I have a file from SAS that is exported as an older Excel .xls file. I would like to import this file into python 3.5.
when I do:
import pandas as pd
Filewant = pd.read_excel("Filepath\\\Filename.xls")
I get a bunch of error messages culminating in
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<html xm'
if I open up the file and manually save it in a current .xlsx file and us the same command line using:
Filewant =pd.read_excel("Filepath\\Filename\.xlsx")
then the file is imported into Python properly. However, I want the process to be more automated so I don't to have to manually save the file to .xlsx format to make it work.
SAS tech support told me that this won't work and that I'll need to convert the .xls SAS output into a .xlsx file:
Unfortunately, the MSOffice2K destination creates an HTML file even though it uses the .XLS extension here which allows the file to be opened with excel.
You can use VBScript to convert the file to .XLSX, however, there is no way to do this using the MSoffice2K destination.
The error message tells you the problem. found b'<html xm' Your file is an HTML file and not an XLS file. This was commonly done with "old" SAS since it did not support writing XLS files, but Excel did support reading HTML files.

XLRD unsupported format found

Using version 0.9.2 of XLRD and python 2.7 on WINDOWS...
Im creating a temporary file and then reading the file using XLRD
data = self.excel_file
path = default_storage.save('temp/temp.xls', ContentFile(data.read()))
tmp_file = os.path.join(settings.MEDIA_ROOT, path)
workbook = xlrd.open_workbook(tmp_file)
As soon as I try and open the XLS file it returns with the error
xlrd.biffh.XLRDError: Unsupported format found '\xd0\xcf\x11\xe0\xa1\xb1\x00\x00'
So I guess the file is not saved correctly above or something. Not sure what do do about this, as this works when uploading the file with django admin in a file field.
Where saving the file like this in python to django model creates the issues above:
from django.core.files import File
p = Foo()
p.excel_file.save(file_name, File(data))
p.save()
It looks to me like this could be a unicode issue. I'm guessing there is non-ASCII characters in your strings? try Using .encode("utf-8") with your strings when you save them to the xls.
EDIT: this was a guess, after more investigation by #Harry looks like it's not the correct solution.
EDIT 2: If the file cannot be opened by Excel as discussed below then the data itself is probably the problem.

Categories