pdfminer3k has no method named create_pages in PDFPage - python

Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For example:
They have moved PDFDocument into pdfparser (sorry, if I spell incorrectly). PDFPage used to have create_pages method which is gone now. All I can see inside PDFPage are internal methods. Does anybody has a working example of pdfminer3k? It seems like there is no new documentation to reflect any of the changes.

If you are interested in reading text from a pdf file the following code works with pdfminer3k using python 3.4.
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
fp = open('file.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages():
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
print(lt_obj.get_text())
fp.close()

Perhaps,you could use pdfminer.six.
It's description:
fork of PDFMiner using six for Python 2+3 compatibility
After installing it using pip:
pip install pdfminer.six
The usage of it is just like pdfminer, at least in my code.
Hope this could save your day :)

Related

How to get text from local PDF file using Python

Please do not use "tika" for an answer.
I have already tried answers from this question:
How to extract text from a PDF file?
I have this PDF file, https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing , and I would like to copy the text.
import PyPDF2
pdfFileObject = open('C:\\Path\\To\\Local\\File\\Test_PDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())
The output is "Date Submitted: 2019-10-21 16:03:36.093 | Form Key: 5544" which is only part of the text. The next line of text starts with "Exhibit A to RFA...."
I have never used PYPDF2 myself so can't really input my knowledge to find out exactly what's going wrong. But the following from the documentation states the following about the function extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Here's an alternative way to get around this and also exaplains what maybe going wrong. I would also recommend using pdftotext. This has worked reliably for me many times; this answer will also prove helpful in that.
Found a solution.
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

PDFMiner TypeError: not all arguments converted during string formatting

I have been trying to come up with a solution to parse a PDF into an HTML so, later I'll use beautiful soup to extract all the headings, subitems and paragraph respectively in a tree structure.
I have searched a few options available on the internet but so far no success. Here's a code I've used to parse a PDF to HTML using PDFMiner.six
import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.layout import LTContainer, LTComponent, LTRect, LTLine, LAParams, LTTextLine
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.image import ImageWriter
from io import StringIO, BytesIO
from bs4 import BeautifulSoup
import re
import io
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
outfp = BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, outfp, imagewriter=ImageWriter('out'))
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
convert_pdf_to_html('PDF - Remraam Ph 1 Mosque.pdf')
However, the above code returns the following error which I'm unable to fix, would appreciate any help, thank you.
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pdfminer\pdftypes.py in decode(self)
293 data = ccittfaxdecode(data, params)
294 elif f == LITERAL_CRYPT:
--> 295 raise PDFNotImplementedError('Crypt filter is unsupported')
296 else:
297 raise PDFNotImplementedError('Unsupported filter: %r' % f)
TypeError: not all arguments converted during string formatting
The pdfminer.six package does not support pdf's with a Crypt filter. It does support other encryption methods. The difference with the Crypt filter is that this one defines the decription algorithm as a parameters, instead of a fixed filter.
From the Pdf reference manual:
The Crypt filter (PDF 1.5) allows the document-level security handler (see Section 3.5, “Encryption”) to determine which algorithms should be used to decrypt the input data. The Name parameter in the decode parameters dictionary for this
filter (see Table 3.12) specifies which of the named crypt filters in the document
(see Section 3.5.4, “Crypt Filters”) should be used.
If you need this feature you can create a github issue.
A quick update, I have fixed this issue, by just uninstalling and installing the Anaconda and then I've installed the pdfminer.six by coda. I guess the pip install doesn't work properly for me. Any way install the package using coda install .. package name

Optimising pdfminer

I am trying to use pdfminer.six in a production context to extract the text from a pdf. At the moment, for my benchmark 44 page document, it is taking approximately 18 seconds. I would like to reduce this as much as possible.
So far I have managed to reduce the time by 3 seconds, by turning caching = False. Does anyone have suggestions for how I can optimise this further? As far as I can tell using a module like multiprocessing to process the pages in parallel would not work because the underlying methods/functions are not abled to be pickled.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
path = "PATH/TO/MYPDF.pdf"
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams, showpageno= True)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = None
caching = False
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
I am using pdfminer on python 3.8. I have an application that manipulates contents of a pdf document and though it is quite a chore to assemble words/tokens and determine where they occur in a tabular document, I had this all running fine in python 2.7, but moving to py3 and the latest version of pdfminer, it ran so slow that it was un acceptable. So after a lot of digging and profiling my code I found that because all of the print statements from the older version have been converted to log statements and the loggers created by pdfminer modules are all default to level.DEBUG, and since I have assigned handlers against the root logger to write log messages to a file the overall speed was highly impacted.
Buy adding the following code after import of pdfminer modules and before instantiating any of the classes or calling them it now runs acceptably fast.
# set all pdfminer logging to WARN
pdflogs = [logging.getLogger(name) for name in logging.root.manager.loggerDict if name.startswith('pdfminer')]
for ll in pdflogs:
ll.setLevel(logging.WARNING)

importError: cannot import name opendocx

I'm trying to make a txt file from docx using this code:
from subprocess import Popen, PIPE
from docx import opendocx, getdocumenttext
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
...
def document_to_text(filename, file_path):
...
elif filename[-5:] == ".docx":
document = opendocx(file_path)
paratextlist = getdocumenttext(document)
newparatextlist = []
for paratext in paratextlist:
newparatextlist.append(paratext.encode("utf-8"))
return '\n\n'.join(newparatextlist)
elif filename[-4:] == ".odt":
...
elif filename[-4:] == ".pdf":
...
document_to_text('1.docx','D:\Nucho\Python\AntiPlagiat\1.docx')
However, I see only: ImportError: cannot import name opendocx
Some text '.......' to post question.
pls read,
The 'opendocx()' function is no longer part of the latest version of python-docx. Starting with v0.3.0, python-docx has been completely re-written and the API is not backwardly compatible. The new call would be something like:
document = Document(docx_file_path)
Documentation on the new version is available here:
http://python-docx.readthedocs.org/
If you want the prior API, you should install docx rather than python-docx, e.g.:
pip install docx
The package name changed between the two versions, so folks can still access the legacy version if that's what they want. You should uninstall python-docx before installing docx, and vice versa, to avoid confusion over which is being imported.
Let me know if you need more.
ref:https://groups.google.com/forum/#!msg/python-docx/otp6hq4kJ5c/tfQB88Mfx2gJ

Convert PDF to Text - Keep rows of table - Python

I have tables in pdf documents that I want to convert to text. I found the following code which converts the pdf to text. However, when it converts, it does not keep the data in the correct rows. It places everything in one long line of string. Is there any way to preserve rows in a table when converting to text from PDF using Python?
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from cStringIO import StringIO
def convert_pdf(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
Pdfminer comes with text extraction tool called pdf2txt.py, which has the ability to analyze layouts. You can try using that, or study it to see how it works.
A-PDF to Text convert better PDF with tables as other tools !

Categories