Unable to Read Corrupted pdf using PyPDF2

Unable to Read Corrupted pdf using PyPDF2 - python

I am trying to merge 2 pdfs and write to one pdf using pypdf2.
Below is the code to read the file content
output = PyPDF2.PdfFileWriter()
for pdffile in self.files:
input = PyPDF2.PdfFileReader(pdffile, strict=False)
where self.files is file objects
I am getting below error when trying to read one particular pdf file
TypeError: 'NumberObject' object has no attribute 'getitem'
When i ran ghostscript on the pdf file i found that the file is corrupted and the repaired one i am able to read without errors. I wanted to check if there is any way i can read the corrupted pdf file using pypdf2 only without errors?
Thanks in Advance.

Related

Read PDF tables from memory with Python

I'm trying to read a PDF file extracted from a zip file in memory to get the tables inside the file. Camelot seems a good way to do it, but I'm getting the following error:
AttributeError: '_io.StringIO' object has no attribute 'lower'
Is there some way to read the file and extract the tables with camelot, or should I use another library?
z = zipfile.ZipFile(self.zip_file)
for file in z.namelist():
if file.endswith(".pdf"):
pdf = z.read(file).decode(encoding="latin-1")
pdf = StringIO(pdf)
pdf = camelot.read_pdf(pdf, codec='utf-8')

camelot.read_pdf(filepath,...)
Accepts a file path as the first parameter. It appears to be a bad match for your requirements. Search for another library.
In any case StringIO(pdf), will return the following:
<_io.StringIO object at 0x000002592DD33E20>
For starters, when you read a file from StringIO, do it by calling the read() function
pdf = StringIO(pdf)
pdf.read()
That bit will indeed return the file bytes themselves. Next think about the encoding that the library will accept.

How to parse uploaded Excel file?

I'm trying to use Pandas to parse an Excel file someone uploaded to a Flask web application but not having much success.
I save the raw stream to a temporary file and then try to read it but pandas complains about the raw byte array:
tmpfile = tempfile.NamedTemporaryFile()
tmpfile.write(request.file['spreadsheet'].read())
sheet = pandas.ExcelFile(tmpfile.name)
results in the error:
*** XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\\xd0\\xcf'
Is there a way to do this?

Found the answer: I need to tmpfile.seek(0) before being able to read from it.

How to import an old Excel file with extension.xls?

I have a file from SAS that is exported as an older Excel .xls file. I would like to import this file into python 3.5.
when I do:
import pandas as pd
Filewant = pd.read_excel("Filepath\\\Filename.xls")
I get a bunch of error messages culminating in
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<html xm'
if I open up the file and manually save it in a current .xlsx file and us the same command line using:
Filewant =pd.read_excel("Filepath\\Filename\.xlsx")
then the file is imported into Python properly. However, I want the process to be more automated so I don't to have to manually save the file to .xlsx format to make it work.

SAS tech support told me that this won't work and that I'll need to convert the .xls SAS output into a .xlsx file:
Unfortunately, the MSOffice2K destination creates an HTML file even though it uses the .XLS extension here which allows the file to be opened with excel.
You can use VBScript to convert the file to .XLSX, however, there is no way to do this using the MSoffice2K destination.

The error message tells you the problem. found b'<html xm' Your file is an HTML file and not an XLS file. This was commonly done with "old" SAS since it did not support writing XLS files, but Excel did support reading HTML files.

How to convert a CSV to xlxs file in python

I am trying to convert a CSV to an xlxs file format because I have a code that is meant to read a an excel file, but ended up getting a CSV. Is there a way to convert a CSV file to an TEMP excel file and have it not destroyed until the reading process is done. I have tried using openpyxl but it ends up not working and throwing an error saying it's not a good zip file. I even tried converting the CSV to text and then storing it in a dictionary but it writing to excel using xlrd pakage did not work aswell. I was wondering if there is a way do it in a cc

Seems like you open the file in text mode. Try this to open file
open('sample.csv', "rt", encoding="utf8")
or
open('sample.csv', "rt", encoding="ascii")
depending on the encoding of the file

Reading a file from Google Cloud Storage with XLRD (python)

Im trying to read a file stored in one of my buckets in GAE.
The file is stored in a public bucket
I've tried to:
archivo=cloudstorage.open('/bucket/workbook.xlsx')
wb = xlrd.open_workbook(filename=archivo)
but xlrd expect to open the file by itself, so I get a TypeError
TypeError: coercing to Unicode: need string or buffer, ReadBuffer found
Is there any way to give xlrd an open file so I can read the file without having to change xlrd.py

I should read the documentation with more attention before asking stuff...
To provide xlrd with an open file, instead of a filename, I have to give a filecontent.
This is done by:
archivo=cloudstorage.open('/bucket/workbook.xlsx')
wb = xlrd.open_workbook(file_contents=archivo.read())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to Read Corrupted pdf using PyPDF2 - python

Related

Read PDF tables from memory with Python

How to parse uploaded Excel file?

How to import an old Excel file with extension.xls?

How to convert a CSV to xlxs file in python

Reading a file from Google Cloud Storage with XLRD (python)

Categories

Resources