I am currently using the class provided in the answer here:
How to extract text and text coordinates from a pdf file?
The class provided is very helpful in that I can get the position of every text box in a PDF. The class given also inserts a '_' every time there is a new line within the textbox.
I was wondering whether there was some way to get the position of each line of text within the textbox as well?
Found it: The solution is to recurse even when there is a TextBox, until a textline is found. The class below should provide the x and y coordinates of every line of text on a pdf when the parsepdf method is called.
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
class pdfPositionHandling:
def parse_obj(self, lt_objs):
# loop over the object list
for obj in lt_objs:
if isinstance(obj, pdfminer.layout.LTTextLine):
print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))
# if it's a textbox, also recurse
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
self.parse_obj(obj._objs)
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
self.parse_obj(obj._objs)
def parsepdf(self, filename, startpage, endpage):
# Open a PDF file.
fp = open(filename, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
i = 0
# loop over all pages in the document
for page in PDFPage.create_pages(document):
if i >= startpage and i <= endpage:
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
self.parse_obj(layout._objs)
i += 1
Related
This code is a Python script to convert a PDF file to a .docx file. It performs the following steps:
Import the necessary libraries and modules, including codecs, chardet, pdfminer, and python-docx.
Detect the text encoding of the PDF file by opening it in binary mode and passing its contents to the chardet library's detect function. The function returns a dictionary of encoding information, and the script stores the value of the "encoding" key in the "encoding" variable.
Use pdfminer to convert the PDF file to text. PDFResourceManager is used to store shared resources such as fonts or images used by multiple pages. PDFPageInterpreter is used to process each page of the PDF and extract the text. The extracted text is stored in a StringIO object named "retstr".
Decode the extracted text using the codecs.decode function and the detected encoding, and store the result in the "text" variable.
Create a new Document object from the python-docx library, add a paragraph containing the converted text, and save the .docx file as "output.docx".
I have attached my experimental Python code below :-
import codecs
import chardet
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
from docx import Document
# Detect the text encoding of the PDF file
with open("input.pdf", "rb") as pdf_file:
result = chardet.detect(pdf_file.read())
encoding = result["encoding"]
# Convert the PDF file to text using pdfminer
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
with open("input.pdf", "rb") as pdf_file:
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(pdf_file):
interpreter.process_page(page)
text = retstr.getvalue()
# Convert the text to Unicode using the detected encoding
text = codecs.decode(text, encoding)
# Save the converted text to a .docx file
doc = Document()
doc.add_paragraph(text)
doc.save("output.docx")
But I am getting an error on line 27 of the code.
TypeError: decode() argument 'encoding' must be str, not None
After updating the line 27 code to text = text.decode(encoding) I am now getting
AttributeError: 'str' object has no attribute 'decode'
I have been going round in circles trying to get this to work and for the life of me I can not. I would greatly appricate it if someone could help me out with this. I am trying to create a python application that can scan a folder of documents and any sub folders for PDF files. It will then scan through all of the individual documents and look for a specific phrase. Once it has found this phrase it will add it to a .txt file with the document name and page number of the doucment. Once it is compelte the .txt file will be created and allow the user to see a report on which documents have got this phrase in them and where it is located.
I am using Python and PDFminer.six and Tkinter
So far my code is as follows.
`
import tkinter as tk
from tkinter import filedialog
import os
import re
from tqdm import tqdm
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import PDFPageAggregator
root = tk.Tk()
root.withdraw()
def extract_text_from_pdf(pdf_path):
resource_manager = PDFResourceManager()
device = PDFPageAggregator(resource_manager)
interpreter = PDFPageInterpreter(resource_manager, device)
laparams = LAParams()
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if element.get_text() and (element.get_text()).strip():
yield element.get_text(), element.bbox
# Function to scan through a folder and its subfolders
def scan_folder(folder_path, phrase):
# Create a dictionary to store the results
results = {}
# Iterate through the selected folder and its subfolders
for root, dirs, files in os.walk(folder_path):
for file in tqdm(files):
file_path = os.path.join(root, file)
# Check if the file is a pdf file
if file.endswith(".pdf"):
matches = []
for text, bbox, page_id in extract_text_from_pdf(file_path):
# Search for the specified phrase
match = re.search(phrase, text)
if match:matches.append({'word': phrase, 'location': bbox, 'page': page_id+1})
results[file_path] = matches
# prompt the user to select a location to save the text file
file_path = filedialog.asksaveasfilename(defaultextension=".txt",
initialfile="results.txt",
initialdir=folder_path)
# write the results to the selected text file
with open(file_path, 'w') as f:
for key, value in results.items():
for match in value:
f.write(key + " : " + match['word'] + " found at " + str(match['location']) + " on page " + str(match['page']) + '\n')
return results
# Example usage
folder_path = filedialog.askdirectory(initialdir = "C:/", title = "Select folder")
phrase = input("Enter the phrase you want to search for: ")
results = scan_folder(folder_path, phrase)`
But I keep running into problem after problem, the latest one being this
Exception has occurred: AttributeError 'LTCurve' object has no attribute 'get_text' File "C:\Users\Edward Baker\OneDrive - Folley Electrical\Desktop\Document scanning project\PDF_Scanner_New.py", line 31, in extract_text_from_pdf if element.get_text() and (element.get_text()).strip(): ^^^^^^^^^^^^^^^^ File "C:\Users\Edward Baker\OneDrive - Folley Electrical\Desktop\Document scanning project\PDF_Scanner_New.py", line 47, in scan_folder for text, bbox, page_id in extract_text_from_pdf(file_path): File "C:\Users\Edward Baker\OneDrive - Folley Electrical\Desktop\Document scanning project\PDF_Scanner_New.py", line 66, in <module> results = scan_folder(folder_path, phrase) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'LTCurve' object has no attribute 'get_text'
I would love some help on this little side project
So I've just played around with PDFMiner and can now extract text from a PDF and throw it into an html or textfile.
pdf2txt.py -o outputfile.txt -t txt inputfile.pdf
I have then written a simple script to extract all certain strings:
with open('output.txt', 'r') as searchfile:
for line in searchfile:
if 'HELLO' in line:
print(line)
And now I can use all these strings containing the word HELLO to add to my databse if that is what I wanted.
My questions is:
Is the only way or can PDFinder grab conditional stuff before even spitting it out to the txt, html or even straight into the database?
Well, yes, you can: PDFMiner has API.
The basic example sais
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# do stuff with the page here
and in the loop you should go with
# receive the LTPage object for the page.
layout = device.get_result()
and then use LTTextBox object. You have to analyse that. There's no full example in the docs, but you may check out the pdf2txt.py source which will help you to find missing pieces (although it does much more since it parses options and applies them).
And once you have the code that extracts the text, you can do what you want before saving a file. Including searching for certain parts of text.
PS looks like this was, in a way, asked before: How do I use pdfminer as a library which should be helpful, too.
I'm looking to extract texts from PDFs for a data-mining task.
The PDFs I'm looking at contain multiple reports, each report has its own first level entry in the documents table of contents. Also, there is a written table of contents at the beginning of the PDF, which contains page numbers for each report ("from page - to page").
I'm looking for a way to either:
Split the PDF into the individual reports, in order to dump each of those into a .txt file.
Dump each section of the PDF into a .txt directly.
So far, I have been able to dump to entire file into a .txt using PDFminer (python), as follows:
# Not all imports are needed for this task
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def myparse(data):
fp = file(data, 'rb')
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
#fp.close()
#device.close()
str = retstr.getvalue()
#retstr.close()
return str
t1 = myparse("part2.pdf")
text_file = open("part2.txt", "w")
text_file.write(t1)
text_file.close()
Also, this returns the entire structure of the table of contents:
# Open a PDF document.
fp = open('solar.pdf', 'rb')
parser = PDFParser(fp)
password = ""
document = PDFDocument(parser, password)
# Get the outlines of the document.
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title, a)
Any idea how to go ahead from here? Any tools using python, R or bash would be easiest to use for me personally, but as long as it enables batch splitting based on the first outline level of the document, any solution would be great.
Thank you,
Matthias
I've found a straightforward solution for this using sejda-console:
from subprocess import call
import os
pdfname = "example.pdf"
outdir = "C:\\out\\%s" % pdfname
if not os.path.exists(outdir):
os.makedirs(outdir)
sejda = 'C:\\sejda\\bin\\sejda-console.bat'
call = sejda
call += ' splitbybookmarks'
call += ' --bookmarkLevel 1'
call += ' -f "%s"' % pdfname
call += ' -o "%s"' % outdir
print '\n', call
subprocess.call(call)
print "PDFs have been written to out-directory"
Abviously this requires the sejda programme: http://www.sejda.org/
I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link [edit: link was broken / pointed to potential malware]
I am good with any type of output (file/string). Here is the code which returns the extracted text as string for me but for some reason, columns are merged.
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
import StringIO
def convert_pdf(filename):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec)
fp = file(filename, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
I have also tried PyPdf2, but faced the same issue. Here is the sample code for PyPDF2
from PyPDF2 import PdfReader
import StringIO
def get_data_using_pypdf2(filename):
reader = PdfReader(filename)
content = ""
for page in reader.pages:
extracted_text = page.extract_text()
content += extracted_text + "\n"
content = " ".join(content.replace("\xa0", " ").strip().split())
return content.encode("ascii", "ignore")
I have also tried pdf2txt.py but unable to get the formatted output.
I recently struggled with a similar problem, although my pdf had slightly simpler structure.
PDFMiner uses classes called "devices" to parse the pages in a pdf fil. The basic device class is the PDFPageAggregator class, which simply parses the text boxes in the file. The converter classes , e.g. TextConverter, XMLConverter, and HTMLConverter also output the result in a file (or in a string stream as in your example) and do some more elaborate parsing for the contents.
The problem with TextConverter (and PDFPageAggregator) is that they don't recurse deep enough to the structure of the document to properly extract the different columns. The two other converters require some information about the structure of the document for display purposes, so they gather more detailed data. In your example pdf both of the simplistic devices only parse (roughly) the entire text box containing the columns, which makes it impossible (or at least very difficult) to correctly separate the different rows. The solution to this that I found works pretty well, is to either
Create a new class that inherits from PDFPageAggregator, or
Use XMLConverter and parse the resulting XML document using e.g. Beautifulsoup
In both cases you would have to combine the different text segments to rows using their bounding box y-coordinates.
In the case of a new device class ('tis more eloquent, I think) you would have to override the method receive_layout that get's called for each page during the rendering process. This method then recursively parses the elements in each page. For example, something like this might get you started:
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTPage, LTChar, LTAnno, LAParams, LTTextBox, LTTextLine
class PDFPageDetailedAggregator(PDFPageAggregator):
def __init__(self, rsrcmgr, pageno=1, laparams=None):
PDFPageAggregator.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
self.rows = []
self.page_number = 0
def receive_layout(self, ltpage):
def render(item, page_number):
if isinstance(item, LTPage) or isinstance(item, LTTextBox):
for child in item:
render(child, page_number)
elif isinstance(item, LTTextLine):
child_str = ''
for child in item:
if isinstance(child, (LTChar, LTAnno)):
child_str += child.get_text()
child_str = ' '.join(child_str.split()).strip()
if child_str:
row = (page_number, item.bbox[0], item.bbox[1], item.bbox[2], item.bbox[3], child_str) # bbox == (x1, y1, x2, y2)
self.rows.append(row)
for child in item:
render(child, page_number)
return
render(ltpage, self.page_number)
self.page_number += 1
self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2]))
self.result = ltpage
In the code above, each found LTTextLine element is stored in an ordered list of tuples containing the page number, coordinates of the bounding box, and the text contained in that particular element. You would then do something similar to this:
from pprint import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
fp = open('pdf_doc.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
doc.initialize('password') # leave empty for no password
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageDetailedAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
# receive the LTPage object for this page
device.get_result()
pprint(device.rows)
The variable device.rows contains the ordered list with all the text lines arranged using their page number and y-coordinates. You can loop over the text lines and group lines with the same y-coordinates to form the rows, store the column data etc.
I tried to parse your pdf using the above code and the columns are mostly parsed correctly. However, some of the columns are so close together that the default PDFMiner heuristics fail to separate them into their own elements. You can probably get around this by tweaking the word margin parameter (the -W flag in the command line tool pdf2text.py). In any case, you might want to read through the (poorly documented) PDFMiner API as well as browse through the source code of PDFMiner, which you can obtain from github. (Alas, I cannot paste the link because I do not have sufficient rep points :'<, but you can hopefully google the correct repo)
I tried your first block of code and got a bunch of results that look like this:
MULTIPLE DWELLING AGARDEN COMPLEX 14945010314370 TO 372WILLOWRD W MULTIPLE DWELLING AGARDEN COMPLEX 14945010314380 TO 384WILLOWRD W MULTIPLE DWELLING AGARDEN COMPLEX 149450103141000 TO 1020WILLOWBROOKRD MULTIPLE DWELLING AROOMING HOUSE 198787
I am guessing you are in a similar position as this answer and that all the whitespace is used to position the words in the proper place, not as actual printable space characters. The fact that you have tried with with other pdf libraries makes me think that this might be an issue that is difficult for any pdf library to parse.
Solution provided by #hlindblo gave pretty good results. To further group the extracted text chunks by page and paragraph, here are the simple commands I used.
from collections import OrderedDict
grouped_text = OrderedDict()
for p in range(1000): # max page nb is 1000
grouped_text[p] = {}
for (page_nb, x_min, y_min, x_max, y_max, text) in device.rows:
x_min = round(x_min)//10 # manipulate the level of aggregation --> x_min might be slitghly different
try:
grouped_text[page_nb][x_min]+= " " + text
except:
grouped_text[page_nb][x_min] = text