I have tables in pdf documents that I want to convert to text. I found the following code which converts the pdf to text. However, when it converts, it does not keep the data in the correct rows. It places everything in one long line of string. Is there any way to preserve rows in a table when converting to text from PDF using Python?
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from cStringIO import StringIO
def convert_pdf(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
Pdfminer comes with text extraction tool called pdf2txt.py, which has the ability to analyze layouts. You can try using that, or study it to see how it works.
A-PDF to Text convert better PDF with tables as other tools !
Related
Please do not use "tika" for an answer.
I have already tried answers from this question:
How to extract text from a PDF file?
I have this PDF file, https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing , and I would like to copy the text.
import PyPDF2
pdfFileObject = open('C:\\Path\\To\\Local\\File\\Test_PDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())
The output is "Date Submitted: 2019-10-21 16:03:36.093 | Form Key: 5544" which is only part of the text. The next line of text starts with "Exhibit A to RFA...."
I have never used PYPDF2 myself so can't really input my knowledge to find out exactly what's going wrong. But the following from the documentation states the following about the function extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Here's an alternative way to get around this and also exaplains what maybe going wrong. I would also recommend using pdftotext. This has worked reliably for me many times; this answer will also prove helpful in that.
Found a solution.
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
I recently found this really handy library for pdf conversion. I am trying to convert a pdf to string values. In order to parse the data and convert to a csv file. I want to automate this for future so I cannot use Tabula.
I am calling some modules in order to convert pdf to string.
The part for string conversion is not working. (pdf2string.py)
Here is part for the pdf conversion to string.
I get no error. Success. But, there is no output.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import re
import csv
import sys
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
print str
if __name__ == '__main__':
if len(sys.argv) == 2:
path = sys.argv[1]
convert_pdf_to_html(path)
This is my bash.
python pdf2string.py example.pdf
Script is pdf2string.py and path is example.pdf.
I am also new to high-level logic in python.
Edit: you are returning before printing - remove return str, or remove print str and use the advice below.
You're not printing the output of convert_pdf_to_html(), or saving it somewhere.
print convert_pdf_to_html(path)
In my project I need to convert pdf in double column formats. Basically I need to convert them to text file and I used pdfminer, but the ordering is completely a mess (when it comes to double columns, viz,IEEE papers). I just tried converting a double column word (docx) file to text with docx, it works almost fine at least with text (not with tables and equations).
That is why I am thinking whether I can initially convert pdf to word maintaining the complete order as done by some online tools, viz, Nitro Cloud. But I need to do this conversion using python programming/ python packages.
Can anyone please give some insights.
Code using pdfminer (which I tried initially).
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
p1="C:\\sample\\samp.pdf"
c1=convert_pdf_to_txt(p1)
(pdf sample:http://www.iracst.org/ijacea/papers/vol2no62013/1vol2no6.pdf.)
I am running an acceptance test from Command Line, which internally calls pdfminer python script method for conversion of Pdf into Text. I have provided the PDF2TextLibrary which has the code to convert Pdf into text using pdfminer library.
But while I run the test i get the error :
ClassFormatError: Invalid method Code length 85551 in class file pdfminer/glyphlist$py
I don't think you need to have a class if you are using only one function. You can save code and make it easier to read:
pdf2text.py
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
textstr = retstr.getvalue()
retstr.close()
return textstr
And this is how I use it:
*** Settings ***
Library pdf2text
*** Test Cases ***
pdfconvert
${pdftext}= Convert Pdf To txt <path_to_pdf>
>
The issue was resolved by dividing the file into smaller chunks. And the reason was that Java implementation has limit of 64KB for a class file. So in my case the class was evaluating to a size of 446KB.
I've been using pdfMiner to read values off of graphs and so far its been working great!
However there is one area in which the correct data is read correctly but in an unpredictable manner, meaning it will read all the graphs values correctly, in a completely different order than they appear.
This is not entirely a problem because as long as i know, say the last graph will always be read first, i can structure my program around that. Except it seems that pdfMiner is almost totally unpredicatable in the way it is reading this data, I can find no discernable pattern.
This is most probably because I am quite unfamiliar with pdfMiner so i am not entirely sure how it works. So yeah it would be really helpful if somone could just point me in the right direction.
Here is my data
And here is the conversion code i'm using:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
global values
print "Getting readable PDF"
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file("graphExtraction.pdf", 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
values = str
Use the bounding box information to follow the flow of your documents and figure out what comes first.