Getting Assertion error while reading the PDF file python - pypdf2 - python

I am getting the below error when I try to read a PDF file.
Code:
from PyPDF2 import PdfFileReader
import os
os.chdir("Path to dir")
pdf_document = 'sample.pdf'
pdf = PdfFileReader(pdf_document,'rb') #Error here
Error:
Traceback (most recent call last):
File "/home/krishna/PycharmProjects/sample/sample.py", line 9, in
pdf = PdfFileReader(filehandle)
File "/home/krishna/PycharmProjects/AI_DRC/venv/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1084, in init
self.read(stream)
File "/home/krishna/PycharmProjects/AI_DRC/venv/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1838, in read
assert start >= last_end
AssertionError
NOTE: File is 18 MB in size

Here I wrote this and it completely works for me, The pdf is in same folder, you can use os to get a path value of string type too
import PyPDF2
pdf_file = PyPDF2.PdfFileReader("Sample.pdf")#addressing the file, you can use os method it works on that as well
page_content = pdf_file.getPage(0).extractText()# here I get the psge number one(index zero) and then extracted its content
print(page_content)#you can then do whatever you want with it
I think the problem with your program is that "rb" thing, you use it in normal file handling, PyPDF2 already has methods called PdfFileReader, PdfFileWriter and PdfFileMerger.
Hope it helped
If you counter any problem just mention, and I will try to get back at it.

Related

Error while pdf parsing PyPDF2 and textract

I'm trying to build a program that looks for specific words or short phrases in a pdf file. The files load well but I have a problem when searching through the pdf, when the page changes. Here's my code:
import PyPDF2
import glob, os, shutil
import textract
os.chdir(r"C:\Users\Dani\Desktop\patent")
goodfiles=[]
for file in glob.glob("*.pdf"):
pdfFileObj = open(file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
search_word_main = "isophthalic"
word_main=[]
search_word_sub = ["acid index", "acid value", "acid number", "acidity index","acidity"]
word_sub=[]
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
text = pageObj.extractText().encode('utf-8')
search_text = text.lower().split()
toprange=len(search_text)-1
for len in range (toprange):
if search_word_main in search_text[len].decode("utf-8"):
print(search_word_main)
for key in search_word_sub:
if key in search_text[len].decode("utf-8") + " " + search_text[len+1].decode("utf-8"):
print(key)
on the first page of the pdf, everything works well, but, whenever it moves to the second page, I get this error:
Traceback (most recent call last):
File "test.py", line 19, in <module>
toprange=len(search_text)-1
TypeError: 'int' object is not callable
I don't understand why whenever the page changes, this happens. If I just try to print the toprange variable rather than adding it to the loop, there's no problem and I get the values for toprange. There seems to be a problem with the for loop but I don't seem to find where. Could you help me solve this?
Thanks in advance.

Python PIL can't open PDFs for some reason

So my program is able to open PNGs but not PDFs, so I made this just to test, and it still isn't able to open even a simple PDF. And I don't know why.
from PIL import Image
with Image.open(r"Adams, K\a.pdf") as file:
print file
Traceback (most recent call last):
File "C:\Users\Hayden\Desktop\Scans\test4.py", line 3, in <module>
with Image.open(r"Adams, K\a.pdf") as file:
File "C:\Python27\lib\site-packages\PIL\Image.py", line 2590, in open
% (filename if filename else fp))
IOError: cannot identify image file 'Adams, K\\a.pdf'
After trying PyPDF2 as suggested (Thanks for the link by the way), I am getting this error with my code.
import PyPDF2
pdf_file= open(r"Adams, K (6).pdf", "rb")
read_pdf= PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print number_of_pages
Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
Following this article: https://www.geeksforgeeks.org/convert-pdf-to-image-using-python/ you can use the pdf2image package to convert the pdf to a PIL object.
This should solve your problem:
from pdf2image import convert_from_path
fname = r"Adams, K\a.pdf"
pil_image_lst = convert_from_path(fname) # This returns a list even for a 1 page pdf
pil_image = pil_image_lst[0]
I just tried this out with a one page pdf.
As pointed out by #Kevin (see comment below) PIL has support for writing pdfs but not reading them.
To read a pdf you will need some other library. You can look here which is a tutorial for handling PDFs with PyPDF2.
https://pythonhosted.org/PyPDF2/?utm_source=recordnotfound.com

Randomly shuffle the pages of a PDF file using pyPDF or PyPDF2

I'm not very experienced in programming. What I'm trying to do is to randomly shuffle the pages of a pdf and output it to another pdf.
Searching online I found the following two solutions (source 1, source 2):
#!/usr/bin/env python2
import random, sys
from PyPDF2 import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
pages = range(input.getNumPages())
random.shuffle(pages)
for i in pages:
output.addPage(input.getPage(i))
output.write(sys.stdout)
And this one:
#!/usr/bin/python
import sys
import random
from pyPdf import PdfFileWriter, PdfFileReader
# read input pdf and instantiate output pdf
output = PdfFileWriter()
input1 = PdfFileReader(file(sys.argv[1],"rb"))
# construct and shuffle page number list
pages = list(range(input1.getNumPages()))
random.shuffle(pages)
# display new sequence
print 'Reordering pages according to sequence:'
print pages
# add the new sequence of pages to output pdf
for page in pages:
output.addPage(input1.getPage(page))
# write the output pdf to file
outputStream = file(sys.argv[1]+'-mixed.pdf','wb')
output.write(outputStream)
outputStream.close()
I tried both (and both with PyPDF2 and pyPdf) and both indeed create a new pdf file, but this file is simply empty (and has 0KB) (when I enter, let's say "shuffle.py new.pdf").
I'm using PyCharm and one problem I encounter (and not really understand) is that it says: "Cannot find reference 'PdfFileWriter'".
PyCharm tells me that it cannot find the reference
I would appreciate any help understanding what I'm doing wrong :)
EDIT:
As suggested by Tom Dalton, I'm posting what happens when I run the first one:
C:\Users\Anwender\AppData\Local\Temp\shuffle.py\venv\Scripts\python.exe "E:/Shuffle PDF/shuffle.py"
PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079]
Traceback (most recent call last):
File "E:/Shuffle PDF/shuffle.py", line 5, in <module>
input = PdfFileReader(sys.stdin)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
self.read(stream)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1689, in read
stream.seek(-1, 2)
IOError: [Errno 22] Invalid argument
Process finished with exit code 1
From the comments I infer that the fact that a new PDF is created is only due to me typing "shuffle.py newfile.pdf" into the terminal :D
EDIT 2: I now figured it out; this now works:
from PyPDF2 import PdfFileReader, PdfFileWriter
import random, sys
output = PdfFileWriter()
input = PdfFileReader(file("test.pdf", "rb"))
pages = range(input.getNumPages())
random.shuffle(pages)
for i in pages:
output.addPage(input.getPage(i))
outputStream = file(r"output2.pdf", "wb")
output.write(outputStream)
outputStream.close()

Getting NotImplementedError file fetching table data from pdf using python pdftables

I am using Python pdftables to fetch table data from pdf and i followed the instructions as give in git
https://github.com/drj11/pdftables
but when i run the code
filepath = 'tests.pdf'
fileobj = open(filepath,'rb')
from pdftables.pdf_document import PDFDocument
doc = PDFDocument.from_fileobj(fileobj)
i get error like this
File "<stdin>", line 1, in <module>
File "pdftables/pdf_document.py", line 53, in from_fileobj
raise NotImplementedError
can any anyone help me out in this problem
If you look at the file implementing the from_fileobj function you can see the following comment:
# TODO(pwaller): For now, put fh into a temporary file and call
# .from_path. Future: when we have a working stream input function for
# poppler, use that.
If I understand it correctly you should instead use the from_path function as from_fileobj is not implemented yet. This is easy with your current code:
filepath = 'tests.pdf'
from pdftables.pdf_document import PDFDocument
doc = PDFDocument.from_path(filepath)

Converting rtf to pdf using python

I am new to the python language and I am given a task to convert rtf to pdf using python. I googled and found some code- (not exactly rtf to pdf) but I tried working on it and changed it according to my requirement. But I am not able to solve it.
I have used the below code:
import sys
import os
import comtypes.client
#import win32com.client
rtfFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
rtf= comtypes.client.CreateObject('Rtf.Application')
rtf.Visible = True
doc = rtf.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=rtfFormatPDF)
doc.Close()
rtf.Quit()
But its throwing the below error
Traceback (most recent call last):
File "C:/Python34/Lib/idlelib/rtf_to_pdf.py", line 12, in <module>
word = comtypes.client.CreateObject('Rtf.Application')
File "C:\Python34\lib\site-packages\comtypes\client\__init__.py", line 227, in CreateObject
clsid = comtypes.GUID.from_progid(progid)
File "C:\Python34\lib\site-packages\comtypes\GUID.py", line 78, in from_progid
_CLSIDFromProgID(str(progid), byref(inst))
File "_ctypes/callproc.c", line 920, in GetResult
OSError: [WinError -2147221005] Invalid class string
Can anyone help me with this?
I would really appreciate if someone can find the better and fast way of doing it. I have around 200,000 files to convert.
Anisha
I used Marks's advice and changed it back to Word.Application and my source pointing to rtf files. Works perfectly! - the process was slow but still faster than the JAVA application which my team was using. I have attached the final code in my question.
Final Code:
Got it done using the code which works with Word application :
import sys
import os,os.path
import comtypes.client
wdFormatPDF = 17
input_dir = 'input directory'
output_dir = 'output directory'
for subdir, dirs, files in os.walk(input_dir):
for file in files:
in_file = os.path.join(subdir, file)
output_file = file.split('.')[0]
out_file = output_dir+output_file+'.pdf'
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
If you have Libre Office in your system, you got the best solution.
import os
os.system('soffice --headless --convert-to pdf filename.rtf')
# os.system('libreoffice --headless -convert-to pdf filename.rtf')
# os.system('libreoffice6.3 --headless -convert-to pdf filename.rtf')
Commands may vary to different versions and platforms. But this would be the best solution ever I had.

Categories