Batch splitting PDFs by first table of contents level? - python

I'm looking to extract texts from PDFs for a data-mining task.
The PDFs I'm looking at contain multiple reports, each report has its own first level entry in the documents table of contents. Also, there is a written table of contents at the beginning of the PDF, which contains page numbers for each report ("from page - to page").
I'm looking for a way to either:
Split the PDF into the individual reports, in order to dump each of those into a .txt file.
Dump each section of the PDF into a .txt directly.
So far, I have been able to dump to entire file into a .txt using PDFminer (python), as follows:
# Not all imports are needed for this task
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def myparse(data):
fp = file(data, 'rb')
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
#fp.close()
#device.close()
str = retstr.getvalue()
#retstr.close()
return str
t1 = myparse("part2.pdf")
text_file = open("part2.txt", "w")
text_file.write(t1)
text_file.close()
Also, this returns the entire structure of the table of contents:
# Open a PDF document.
fp = open('solar.pdf', 'rb')
parser = PDFParser(fp)
password = ""
document = PDFDocument(parser, password)
# Get the outlines of the document.
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title, a)
Any idea how to go ahead from here? Any tools using python, R or bash would be easiest to use for me personally, but as long as it enables batch splitting based on the first outline level of the document, any solution would be great.
Thank you,
Matthias

I've found a straightforward solution for this using sejda-console:
from subprocess import call
import os
pdfname = "example.pdf"
outdir = "C:\\out\\%s" % pdfname
if not os.path.exists(outdir):
os.makedirs(outdir)
sejda = 'C:\\sejda\\bin\\sejda-console.bat'
call = sejda
call += ' splitbybookmarks'
call += ' --bookmarkLevel 1'
call += ' -f "%s"' % pdfname
call += ' -o "%s"' % outdir
print '\n', call
subprocess.call(call)
print "PDFs have been written to out-directory"
Abviously this requires the sejda programme: http://www.sejda.org/

Related

PyPDF2 - PdfFileReader - cannot extract text

I am looping through a directory and reading in numerous PDFs. I am extracting all text information from each page using a loop.
5/13 PDFs are throwing an error when trying to use .getNumPages(): Exception has occurred: ValueError invalid literal for int() with base 10: b''. I believe this error is occurring because the object (PyPDF2) is showing numPages: 0.
Current Code
dir = os.listdir(directory)
for f in dir:
object = PyPDF2.PdfFileReader(directory + '\\' + f)
NumPages = object.getNumPages()
text_output = "" # Initiate Variable
# Loop through all pages and extract/merge text
with open(directory + '\\' + f, mode='rb') as FileName:
reader = PyPDF2.PdfFileReader(FileName)
for p_num in range(0, NumPages):
page = reader.getPage(p_num)
text_output = text_output + '\n' + 'PAGE: ' + \
str(p_num + 1) + '\n' + page.extractText()
I added an image showing the object data where numPages: 0
I cannot figure out why only certain PDFs are having this issue. Any help would be appreciated!!
I have tested few pdf libraries and I have noticed PyMuPDF is best in reading pdf files.
Here code example:
import fitz
doc = fitz.open("file.pdf")
for page in doc:
text = page.getText()
print(text)
Well I also faced the same issues with PyPDF2 , so I used another python library named slate
Install the library
pip install slate3k
Then use the below code
import slate3k as slate
with open(file.pdf, 'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)
I use pdfminer to extract pdf.
You can refer example code.
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('test.pdf'))
For more information about this lid. You can refer link below
PDFminer
Please check and respond to me if have any issue occur.

How to get text from local PDF file using Python

Please do not use "tika" for an answer.
I have already tried answers from this question:
How to extract text from a PDF file?
I have this PDF file, https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing , and I would like to copy the text.
import PyPDF2
pdfFileObject = open('C:\\Path\\To\\Local\\File\\Test_PDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())
The output is "Date Submitted: 2019-10-21 16:03:36.093 | Form Key: 5544" which is only part of the text. The next line of text starts with "Exhibit A to RFA...."
I have never used PYPDF2 myself so can't really input my knowledge to find out exactly what's going wrong. But the following from the documentation states the following about the function extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Here's an alternative way to get around this and also exaplains what maybe going wrong. I would also recommend using pdftotext. This has worked reliably for me many times; this answer will also prove helpful in that.
Found a solution.
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

Redirect output of a function to another folder in python

I am using python 3. My code uses pdfminer to convert pdf to text. I want to get the output of these files in a new folder. Currently it's coming in the existing folder from which it does the conversion to .txt using pdfminer. How do I redirect the output to a different folder. I want the output in a folder called "D:\extracted_text" Code till now:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import glob
import os
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
savepath = 'D:/extracted_text/'
outfile = os.path.splitext(fname)[0] + '.txt'
comp_name = os.path.join(savepath,outfile)
print(outfile)
with open(comp_name, 'w', encoding = 'utf-8') as pdf_file:
pdf_file.write(text)
return text
directory = glob.glob(r'D:\files\*.pdf')
for myfiles in directory:
convert(myfiles)

Path not printing string values

I recently found this really handy library for pdf conversion. I am trying to convert a pdf to string values. In order to parse the data and convert to a csv file. I want to automate this for future so I cannot use Tabula.
I am calling some modules in order to convert pdf to string.
The part for string conversion is not working. (pdf2string.py)
Here is part for the pdf conversion to string.
I get no error. Success. But, there is no output.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import re
import csv
import sys
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
print str
if __name__ == '__main__':
if len(sys.argv) == 2:
path = sys.argv[1]
convert_pdf_to_html(path)
This is my bash.
python pdf2string.py example.pdf
Script is pdf2string.py and path is example.pdf.
I am also new to high-level logic in python.
Edit: you are returning before printing - remove return str, or remove print str and use the advice below.
You're not printing the output of convert_pdf_to_html(), or saving it somewhere.
print convert_pdf_to_html(path)

Extract Text Using PdfMiner and PyPDF2 Merges columns

I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link [edit: link was broken / pointed to potential malware]
I am good with any type of output (file/string). Here is the code which returns the extracted text as string for me but for some reason, columns are merged.
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
import StringIO
def convert_pdf(filename):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec)
fp = file(filename, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
I have also tried PyPdf2, but faced the same issue. Here is the sample code for PyPDF2
from PyPDF2 import PdfReader
import StringIO
def get_data_using_pypdf2(filename):
reader = PdfReader(filename)
content = ""
for page in reader.pages:
extracted_text = page.extract_text()
content += extracted_text + "\n"
content = " ".join(content.replace("\xa0", " ").strip().split())
return content.encode("ascii", "ignore")
I have also tried pdf2txt.py but unable to get the formatted output.
I recently struggled with a similar problem, although my pdf had slightly simpler structure.
PDFMiner uses classes called "devices" to parse the pages in a pdf fil. The basic device class is the PDFPageAggregator class, which simply parses the text boxes in the file. The converter classes , e.g. TextConverter, XMLConverter, and HTMLConverter also output the result in a file (or in a string stream as in your example) and do some more elaborate parsing for the contents.
The problem with TextConverter (and PDFPageAggregator) is that they don't recurse deep enough to the structure of the document to properly extract the different columns. The two other converters require some information about the structure of the document for display purposes, so they gather more detailed data. In your example pdf both of the simplistic devices only parse (roughly) the entire text box containing the columns, which makes it impossible (or at least very difficult) to correctly separate the different rows. The solution to this that I found works pretty well, is to either
Create a new class that inherits from PDFPageAggregator, or
Use XMLConverter and parse the resulting XML document using e.g. Beautifulsoup
In both cases you would have to combine the different text segments to rows using their bounding box y-coordinates.
In the case of a new device class ('tis more eloquent, I think) you would have to override the method receive_layout that get's called for each page during the rendering process. This method then recursively parses the elements in each page. For example, something like this might get you started:
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTPage, LTChar, LTAnno, LAParams, LTTextBox, LTTextLine
class PDFPageDetailedAggregator(PDFPageAggregator):
def __init__(self, rsrcmgr, pageno=1, laparams=None):
PDFPageAggregator.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
self.rows = []
self.page_number = 0
def receive_layout(self, ltpage):
def render(item, page_number):
if isinstance(item, LTPage) or isinstance(item, LTTextBox):
for child in item:
render(child, page_number)
elif isinstance(item, LTTextLine):
child_str = ''
for child in item:
if isinstance(child, (LTChar, LTAnno)):
child_str += child.get_text()
child_str = ' '.join(child_str.split()).strip()
if child_str:
row = (page_number, item.bbox[0], item.bbox[1], item.bbox[2], item.bbox[3], child_str) # bbox == (x1, y1, x2, y2)
self.rows.append(row)
for child in item:
render(child, page_number)
return
render(ltpage, self.page_number)
self.page_number += 1
self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2]))
self.result = ltpage
In the code above, each found LTTextLine element is stored in an ordered list of tuples containing the page number, coordinates of the bounding box, and the text contained in that particular element. You would then do something similar to this:
from pprint import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
fp = open('pdf_doc.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
doc.initialize('password') # leave empty for no password
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageDetailedAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
# receive the LTPage object for this page
device.get_result()
pprint(device.rows)
The variable device.rows contains the ordered list with all the text lines arranged using their page number and y-coordinates. You can loop over the text lines and group lines with the same y-coordinates to form the rows, store the column data etc.
I tried to parse your pdf using the above code and the columns are mostly parsed correctly. However, some of the columns are so close together that the default PDFMiner heuristics fail to separate them into their own elements. You can probably get around this by tweaking the word margin parameter (the -W flag in the command line tool pdf2text.py). In any case, you might want to read through the (poorly documented) PDFMiner API as well as browse through the source code of PDFMiner, which you can obtain from github. (Alas, I cannot paste the link because I do not have sufficient rep points :'<, but you can hopefully google the correct repo)
I tried your first block of code and got a bunch of results that look like this:
MULTIPLE DWELLING AGARDEN COMPLEX 14945010314370 TO 372WILLOWRD W MULTIPLE DWELLING AGARDEN COMPLEX 14945010314380 TO 384WILLOWRD W MULTIPLE DWELLING AGARDEN COMPLEX 149450103141000 TO 1020WILLOWBROOKRD MULTIPLE DWELLING AROOMING HOUSE 198787
I am guessing you are in a similar position as this answer and that all the whitespace is used to position the words in the proper place, not as actual printable space characters. The fact that you have tried with with other pdf libraries makes me think that this might be an issue that is difficult for any pdf library to parse.
Solution provided by #hlindblo gave pretty good results. To further group the extracted text chunks by page and paragraph, here are the simple commands I used.
from collections import OrderedDict
grouped_text = OrderedDict()
for p in range(1000): # max page nb is 1000
grouped_text[p] = {}
for (page_nb, x_min, y_min, x_max, y_max, text) in device.rows:
x_min = round(x_min)//10 # manipulate the level of aggregation --> x_min might be slitghly different
try:
grouped_text[page_nb][x_min]+= " " + text
except:
grouped_text[page_nb][x_min] = text

Categories