I am trying to use pdfminer.six in a production context to extract the text from a pdf. At the moment, for my benchmark 44 page document, it is taking approximately 18 seconds. I would like to reduce this as much as possible.
So far I have managed to reduce the time by 3 seconds, by turning caching = False. Does anyone have suggestions for how I can optimise this further? As far as I can tell using a module like multiprocessing to process the pages in parallel would not work because the underlying methods/functions are not abled to be pickled.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
path = "PATH/TO/MYPDF.pdf"
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams, showpageno= True)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = None
caching = False
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
I am using pdfminer on python 3.8. I have an application that manipulates contents of a pdf document and though it is quite a chore to assemble words/tokens and determine where they occur in a tabular document, I had this all running fine in python 2.7, but moving to py3 and the latest version of pdfminer, it ran so slow that it was un acceptable. So after a lot of digging and profiling my code I found that because all of the print statements from the older version have been converted to log statements and the loggers created by pdfminer modules are all default to level.DEBUG, and since I have assigned handlers against the root logger to write log messages to a file the overall speed was highly impacted.
Buy adding the following code after import of pdfminer modules and before instantiating any of the classes or calling them it now runs acceptably fast.
# set all pdfminer logging to WARN
pdflogs = [logging.getLogger(name) for name in logging.root.manager.loggerDict if name.startswith('pdfminer')]
for ll in pdflogs:
ll.setLevel(logging.WARNING)
Related
Please do not use "tika" for an answer.
I have already tried answers from this question:
How to extract text from a PDF file?
I have this PDF file, https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing , and I would like to copy the text.
import PyPDF2
pdfFileObject = open('C:\\Path\\To\\Local\\File\\Test_PDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())
The output is "Date Submitted: 2019-10-21 16:03:36.093 | Form Key: 5544" which is only part of the text. The next line of text starts with "Exhibit A to RFA...."
I have never used PYPDF2 myself so can't really input my knowledge to find out exactly what's going wrong. But the following from the documentation states the following about the function extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Here's an alternative way to get around this and also exaplains what maybe going wrong. I would also recommend using pdftotext. This has worked reliably for me many times; this answer will also prove helpful in that.
Found a solution.
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
I have been trying to come up with a solution to parse a PDF into an HTML so, later I'll use beautiful soup to extract all the headings, subitems and paragraph respectively in a tree structure.
I have searched a few options available on the internet but so far no success. Here's a code I've used to parse a PDF to HTML using PDFMiner.six
import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.layout import LTContainer, LTComponent, LTRect, LTLine, LAParams, LTTextLine
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.image import ImageWriter
from io import StringIO, BytesIO
from bs4 import BeautifulSoup
import re
import io
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
outfp = BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, outfp, imagewriter=ImageWriter('out'))
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
convert_pdf_to_html('PDF - Remraam Ph 1 Mosque.pdf')
However, the above code returns the following error which I'm unable to fix, would appreciate any help, thank you.
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pdfminer\pdftypes.py in decode(self)
293 data = ccittfaxdecode(data, params)
294 elif f == LITERAL_CRYPT:
--> 295 raise PDFNotImplementedError('Crypt filter is unsupported')
296 else:
297 raise PDFNotImplementedError('Unsupported filter: %r' % f)
TypeError: not all arguments converted during string formatting
The pdfminer.six package does not support pdf's with a Crypt filter. It does support other encryption methods. The difference with the Crypt filter is that this one defines the decription algorithm as a parameters, instead of a fixed filter.
From the Pdf reference manual:
The Crypt filter (PDF 1.5) allows the document-level security handler (see Section 3.5, “Encryption”) to determine which algorithms should be used to decrypt the input data. The Name parameter in the decode parameters dictionary for this
filter (see Table 3.12) specifies which of the named crypt filters in the document
(see Section 3.5.4, “Crypt Filters”) should be used.
If you need this feature you can create a github issue.
A quick update, I have fixed this issue, by just uninstalling and installing the Anaconda and then I've installed the pdfminer.six by coda. I guess the pip install doesn't work properly for me. Any way install the package using coda install .. package name
I am running an acceptance test from Command Line, which internally calls pdfminer python script method for conversion of Pdf into Text. I have provided the PDF2TextLibrary which has the code to convert Pdf into text using pdfminer library.
But while I run the test i get the error :
ClassFormatError: Invalid method Code length 85551 in class file pdfminer/glyphlist$py
I don't think you need to have a class if you are using only one function. You can save code and make it easier to read:
pdf2text.py
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
textstr = retstr.getvalue()
retstr.close()
return textstr
And this is how I use it:
*** Settings ***
Library pdf2text
*** Test Cases ***
pdfconvert
${pdftext}= Convert Pdf To txt <path_to_pdf>
>
The issue was resolved by dividing the file into smaller chunks. And the reason was that Java implementation has limit of 64KB for a class file. So in my case the class was evaluating to a size of 446KB.
I've been using pdfMiner to read values off of graphs and so far its been working great!
However there is one area in which the correct data is read correctly but in an unpredictable manner, meaning it will read all the graphs values correctly, in a completely different order than they appear.
This is not entirely a problem because as long as i know, say the last graph will always be read first, i can structure my program around that. Except it seems that pdfMiner is almost totally unpredicatable in the way it is reading this data, I can find no discernable pattern.
This is most probably because I am quite unfamiliar with pdfMiner so i am not entirely sure how it works. So yeah it would be really helpful if somone could just point me in the right direction.
Here is my data
And here is the conversion code i'm using:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
global values
print "Getting readable PDF"
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file("graphExtraction.pdf", 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
values = str
Use the bounding box information to follow the flow of your documents and figure out what comes first.
Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For example:
They have moved PDFDocument into pdfparser (sorry, if I spell incorrectly). PDFPage used to have create_pages method which is gone now. All I can see inside PDFPage are internal methods. Does anybody has a working example of pdfminer3k? It seems like there is no new documentation to reflect any of the changes.
If you are interested in reading text from a pdf file the following code works with pdfminer3k using python 3.4.
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
fp = open('file.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages():
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
print(lt_obj.get_text())
fp.close()
Perhaps,you could use pdfminer.six.
It's description:
fork of PDFMiner using six for Python 2+3 compatibility
After installing it using pip:
pip install pdfminer.six
The usage of it is just like pdfminer, at least in my code.
Hope this could save your day :)