Error while pdf parsing PyPDF2 and textract

Error while pdf parsing PyPDF2 and textract - python

I'm trying to build a program that looks for specific words or short phrases in a pdf file. The files load well but I have a problem when searching through the pdf, when the page changes. Here's my code:
import PyPDF2
import glob, os, shutil
import textract
os.chdir(r"C:\Users\Dani\Desktop\patent")
goodfiles=[]
for file in glob.glob("*.pdf"):
pdfFileObj = open(file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
search_word_main = "isophthalic"
word_main=[]
search_word_sub = ["acid index", "acid value", "acid number", "acidity index","acidity"]
word_sub=[]
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
text = pageObj.extractText().encode('utf-8')
search_text = text.lower().split()
toprange=len(search_text)-1
for len in range (toprange):
if search_word_main in search_text[len].decode("utf-8"):
print(search_word_main)
for key in search_word_sub:
if key in search_text[len].decode("utf-8") + " " + search_text[len+1].decode("utf-8"):
print(key)
on the first page of the pdf, everything works well, but, whenever it moves to the second page, I get this error:
Traceback (most recent call last):
File "test.py", line 19, in <module>
toprange=len(search_text)-1
TypeError: 'int' object is not callable
I don't understand why whenever the page changes, this happens. If I just try to print the toprange variable rather than adding it to the loop, there's no problem and I get the values for toprange. There seems to be a problem with the for loop but I don't seem to find where. Could you help me solve this?
Thanks in advance.

Related

I'm writing a script in python that merges a list of pdfs . this code worked for so many cases but it doesn't work in this case a few pdfs

With this script I'm going to create one single pdf file that combines the many pdfs in the
folder that I gave it in this line of code input_folder_pdf = sys.argv[1]from the terminal and it creates the output folder if not exists .
this code worked
import PyPDF2
import sys
import os
input_folder_pdf = sys.argv[1]
output_folder_file = sys.argv[2]
if not os.path.exists(output_folder_file):
os.makedirs(output_folder_file)
print(output_folder_file, 'folder crated!')
input_name = input('name the combined pdf : ')
pdf_inputs = []
for filename in os.listdir(input_folder_pdf):
name = f'{input_folder_pdf}{filename}'
pdf_inputs.append(name)
merger = PyPDF2.PdfFileMerger(strict=False)
for pdf in pdf_inputs:
merger.append(pdf)
print(pdf, ' added!!')
combined_name = str(output_folder_file) + str(input_name) + '.pdf'
merger.write(combined_name)
print('Done!')
I ran this code like this in the porwershell :
python3.9.exe {this_script.py} {.\the_path_folder_of_the_pdfs\} {.\the_output_folder\}
giving the folder name with spaces cause a problem with sys.argv1 index for me.
here link of files that works good with this code:
folder of pdfs that works well with this code
here is the link of files that cause this problem:
pdf1
pdf2
this error happens :
C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\PyPDF2\_reader.py:993: PdfReadWarning: Invalid stream (index 35) within object 96 0: Stream has ended unexpectedly
warnings.warn(
Traceback (most recent call last):
File "F:\download telegram desktop\Manga\jujustu_kaisen\Pdf_combiner.py", line 22, in <module>
merger.write(final_name)
File"C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\PyPDF2\generic.py", line 800, in read_from_stream
data["__streamdata__"] = stream.read(length)
TypeError: argument should be integer or None, not 'NullObject'

Having maximum of two page pdfs in a directory?

from PyPDF2 import PdfFileReader, PdfFileWriter
import os as os
listdir = os.listdir(r"C:\Users\Max12\Desktop\xml\pdfminer\UiPath\attachments\75090058\Status\Verwerking")
for file in listdir:
if file.endswith(".pdf"):
pdf_file_path = 'Unknown.pdf'
file_base_name = file.replace('.pdf', '')
pdf = PdfFileReader(file)
pages = [0, 1] # page 1, 2
pdfWriter = PdfFileWriter()
else:
pass
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()
Hi all,
I want to update pdf files in a directory to having two pages max. So updating the file in case they have two pages or more to a max of two pages. I've written the above stated code.
However, my IDE is giving the following error:
Traceback (most recent call last):
File "file.py", line 16, in <module>
pdfWriter.addPage(pdf.getPage(page_num))
File "C:\Python38\lib\site-packages\PyPDF2\pdf.py", line 1177, in getPage
return self.flattenedPages[pageNumber]
IndexError: list index out of range
I don't know what I'm doing wrong.. Can any of you guys help me?

The code will fail if a pdf file consists of only one page. Since pdf.getNumPages() returns the number of pages in the file you can replace pages = [0, 1] with pages = range(min(2, pdf.getNumPages())) to fix this.
Additionally, you iterate over pdf files in the directory, but then you process only the last file which is not what you want to accomplish. The second for loop and the with statement should be inside the if block.
Overall the following should work:
from PyPDF2 import PdfFileReader, PdfFileWriter
import os as os
istdir = os.listdir(r"C:\Users\Max12\Desktop\xml\pdfminer\UiPath\attachments\75090058\Status\Verwerking")
for file in listdir:
if file.endswith(".pdf"):
file_base_name = file.replace('.pdf', '')
pdf = PdfFileReader(file)
pages = range(min(2, pdf.getNumPages()))
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()

Getting Assertion error while reading the PDF file python - pypdf2

I am getting the below error when I try to read a PDF file.
Code:
from PyPDF2 import PdfFileReader
import os
os.chdir("Path to dir")
pdf_document = 'sample.pdf'
pdf = PdfFileReader(pdf_document,'rb') #Error here
Error:
Traceback (most recent call last):
File "/home/krishna/PycharmProjects/sample/sample.py", line 9, in
pdf = PdfFileReader(filehandle)
File "/home/krishna/PycharmProjects/AI_DRC/venv/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1084, in init
self.read(stream)
File "/home/krishna/PycharmProjects/AI_DRC/venv/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1838, in read
assert start >= last_end
AssertionError
NOTE: File is 18 MB in size

Here I wrote this and it completely works for me, The pdf is in same folder, you can use os to get a path value of string type too
import PyPDF2
pdf_file = PyPDF2.PdfFileReader("Sample.pdf")#addressing the file, you can use os method it works on that as well
page_content = pdf_file.getPage(0).extractText()# here I get the psge number one(index zero) and then extracted its content
print(page_content)#you can then do whatever you want with it
I think the problem with your program is that "rb" thing, you use it in normal file handling, PyPDF2 already has methods called PdfFileReader, PdfFileWriter and PdfFileMerger.
Hope it helped
If you counter any problem just mention, and I will try to get back at it.

word count PDF files when walking directory

Hello Stackoverflow community!
I'm trying to build a Python program that will walk a directory (and all sub-directories) and do a accumulated word count total on all .html, .txt, and .pdf files. When reading a .pdf file it requires a little something extra (PdfFileReader) to parse the file. When parsing a .pdf files I'm getting the following error and the program stops:
AttributeError: 'PdfFileReader' object has no attribute 'startswith'
When not parsing .pdf files the problem completely successfully.
CODE
#!/usr/bin/python
import re
import os
import sys
import os.path
import fnmatch
import collections
from PyPDF2 import PdfFileReader
ignore = [<lots of words>]
def extract(file_path, counter):
words = re.findall('\w+', open(file_path).read().lower())
counter.update([x for x in words if x not in ignore and len(x) > 2])
def search(path):
print path
counter = collections.Counter()
if os.path.isdir(path):
for root, dirs, files in os.walk(path):
for file in files:
if file.lower().endswith(('.html', '.txt')):
print file
extract(os.path.join(root, file), counter)
if file.lower().endswith(('.pdf')):
file_path = os.path.abspath(os.path.join(root, file))
print file_path
with open(file_path, 'rb') as f:
reader = PdfFileReader(f)
extract(os.path.join(root, reader), counter)
contents = reader.getPage(0).extractText().split('\n')
extract(os.path.join(root, contents), counter)
pass
else:
extract(path, counter)
print(counter.most_common(50))
search(sys.argv[1])
The full error
Traceback (most recent call last):File line 50, in <module> search(sys.argv[1])
File line 36, in search extract(os.path.join(root, reader), counter)
File line 68, in join if b.startswith('/'):
AttributeError: 'PdfFileReader' object has no attribute 'startswith'
It appears there is a failure when calling the extract function with the .pdf file. Any help/guidance would be greatly appreciated!
Expected Results (works w/out .pdf files)
[('cyber', 5101), ('2016', 5095), ('date', 4912), ('threat', 4343)]

The problems is that this line
reader = PdfFileReader(f)
returns an object of type PdfFileReader. You're then passing this object to the extract() function which is expecting a file path and not a PdfFileReader object.
Suggestion would be to move the PDF related processing that you currently have in the search() function to the extract function() instead. Then, in the extract function, you would check to see if it is a PDF file and then act accordingly. So, something like this:
def extract(file_path, counter):
if file_path.lower().endswith(('.pdf')):
reader = PdfFileReader(file)
contents = reader.getPage(0).extractText().split('\n')
counter.update([x for x in contents if x not in ignore and len(x) > 2])
elif file_path.lower().endswith(('.html', '.txt')):
words = re.findall('\w+', open(file_path).read().lower())
counter.update([x for x in words if x not in ignore and len(x) > 2])
else:
## some other file type...
Haven't tested the code snippet above but hopefully you should get the idea.

Python: data to file then data from text file to list - TypeError: must be str, not bytes

I'm a beginner in programming and have decided to teach myself Python. After a few days, i've decided to code a little piece. I's pretty simple:
date of today
page i am at (i'm reading a book)
how i feel
then i add the data in a file. every time i launch the program, it adds a new line of data in the file
then i extract the data to make a list of lists.
truth is, i wanted to re-write my program in order to pickle a list and then unpickle the file. However, as i'm coping with an error i can't handle, i really really want to understand how to solve this. Therefore i hope you will be able to help me out :)
I've been struggling for the past hours on this apparently a simple and stupid problem. Though i don't find the solution. Here is the error and the code:
ERROR:
Traceback (most recent call last):
File "dailyshot.py", line 25, in <module>
SaveData(todaysline)
File "dailyshot.py", line 11, in SaveData
mon_pickler.dump(datatosave)
TypeError: must be str, not bytes
CODE:
import pickle
import datetime
def SaveData(datatosave):
with open('journey.txt', 'wb') as thefile:
my_pickler = pickle.Pickler(thefile)
my_pickler.dump(datatosave)
thefile.close()
todaylist = []
today = datetime.date.today()
todaylist.append(today)
page = input('Page Number?\n')
feel = input('How do you feel?\n')
todaysline = today.strftime('%d, %b %Y') + "; " + page + "; " + feel + "\n"
print('Thanks and Good Bye!')
SaveData(todaysline)
print('let\'s make a list now...')
thefile = open('journey.txt','rb')
thelist = [line.split(';') for line in thefile.readlines()]
thefile.close()
print(thelist)
Thanks a looot!

Ok so there are a few things to comment on here:
When you use a with statement, you don't have to explicitly close the file. Python will do that for you at the end of the with block (line 8).
You don't use todayList for anything. You create it, add an element and then just discard it. So it's probably useless :)
Why are you pickling string object? If you have strings just write them to the file as is.
If you pickle data on write you have to unpickle it on read. You shouldn't write pickled data and then just read the file as a plain text file.
Use a for append when you are just adding items to the file, w will overwrite your whole file.
What I would suggest is just writing a plain text file, where every line is one entry.
import datetime
def save(data):
with open('journey.txt', 'a') as f:
f.write(data + '\n')
today = datetime.date.today()
page = input('Page Number: ')
feel = input('How do you feel: ')
todaysline = ';'.join([today.strftime('%d, %b %Y'), page, feel])
print('Thanks and Good Bye!')
save(todaysline)
print('let\'s make a list now...')
with open('journey.txt','r') as f:
for line in f:
print(line.strip().split(';'))

Are you sure you posted the right code? That error can occur if you miss out the "b" when you open the file
eg.
with open('journey.txt', 'w') as thefile:
>>> with open('journey.txt', 'w') as thefile:
... pickler = pickle.Pickler(thefile)
... pickler.dump("some string")
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
TypeError: must be str, not bytes
The file should be opened in binary mode
>>> with open('journey.txt', 'wb') as thefile:
... pickler = pickle.Pickler(thefile)
... pickler.dump("some string")
...
>>>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error while pdf parsing PyPDF2 and textract - python

Related

I'm writing a script in python that merges a list of pdfs . this code worked for so many cases but it doesn't work in this case a few pdfs

Having maximum of two page pdfs in a directory?

Getting Assertion error while reading the PDF file python - pypdf2

word count PDF files when walking directory

Python: data to file then data from text file to list - TypeError: must be str, not bytes

Categories

Resources