How to convert double column pdf to word using python? - python

In my project I need to convert pdf in double column formats. Basically I need to convert them to text file and I used pdfminer, but the ordering is completely a mess (when it comes to double columns, viz,IEEE papers). I just tried converting a double column word (docx) file to text with docx, it works almost fine at least with text (not with tables and equations).
That is why I am thinking whether I can initially convert pdf to word maintaining the complete order as done by some online tools, viz, Nitro Cloud. But I need to do this conversion using python programming/ python packages.
Can anyone please give some insights.
Code using pdfminer (which I tried initially).
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
p1="C:\\sample\\samp.pdf"
c1=convert_pdf_to_txt(p1)
(pdf sample:http://www.iracst.org/ijacea/papers/vol2no62013/1vol2no6.pdf.)

Related

Converting a Hindi PDF to Editable Text in Docx with Proper Text Encoding Detection and Conversion

This code is a Python script to convert a PDF file to a .docx file. It performs the following steps:
Import the necessary libraries and modules, including codecs, chardet, pdfminer, and python-docx.
Detect the text encoding of the PDF file by opening it in binary mode and passing its contents to the chardet library's detect function. The function returns a dictionary of encoding information, and the script stores the value of the "encoding" key in the "encoding" variable.
Use pdfminer to convert the PDF file to text. PDFResourceManager is used to store shared resources such as fonts or images used by multiple pages. PDFPageInterpreter is used to process each page of the PDF and extract the text. The extracted text is stored in a StringIO object named "retstr".
Decode the extracted text using the codecs.decode function and the detected encoding, and store the result in the "text" variable.
Create a new Document object from the python-docx library, add a paragraph containing the converted text, and save the .docx file as "output.docx".
I have attached my experimental Python code below :-
import codecs
import chardet
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
from docx import Document
# Detect the text encoding of the PDF file
with open("input.pdf", "rb") as pdf_file:
result = chardet.detect(pdf_file.read())
encoding = result["encoding"]
# Convert the PDF file to text using pdfminer
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
with open("input.pdf", "rb") as pdf_file:
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(pdf_file):
interpreter.process_page(page)
text = retstr.getvalue()
# Convert the text to Unicode using the detected encoding
text = codecs.decode(text, encoding)
# Save the converted text to a .docx file
doc = Document()
doc.add_paragraph(text)
doc.save("output.docx")
But I am getting an error on line 27 of the code.
TypeError: decode() argument 'encoding' must be str, not None
After updating the line 27 code to text = text.decode(encoding) I am now getting
AttributeError: 'str' object has no attribute 'decode'

How to get text from local PDF file using Python

Please do not use "tika" for an answer.
I have already tried answers from this question:
How to extract text from a PDF file?
I have this PDF file, https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing , and I would like to copy the text.
import PyPDF2
pdfFileObject = open('C:\\Path\\To\\Local\\File\\Test_PDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())
The output is "Date Submitted: 2019-10-21 16:03:36.093 | Form Key: 5544" which is only part of the text. The next line of text starts with "Exhibit A to RFA...."
I have never used PYPDF2 myself so can't really input my knowledge to find out exactly what's going wrong. But the following from the documentation states the following about the function extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Here's an alternative way to get around this and also exaplains what maybe going wrong. I would also recommend using pdftotext. This has worked reliably for me many times; this answer will also prove helpful in that.
Found a solution.
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

Path not printing string values

I recently found this really handy library for pdf conversion. I am trying to convert a pdf to string values. In order to parse the data and convert to a csv file. I want to automate this for future so I cannot use Tabula.
I am calling some modules in order to convert pdf to string.
The part for string conversion is not working. (pdf2string.py)
Here is part for the pdf conversion to string.
I get no error. Success. But, there is no output.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import re
import csv
import sys
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
print str
if __name__ == '__main__':
if len(sys.argv) == 2:
path = sys.argv[1]
convert_pdf_to_html(path)
This is my bash.
python pdf2string.py example.pdf
Script is pdf2string.py and path is example.pdf.
I am also new to high-level logic in python.
Edit: you are returning before printing - remove return str, or remove print str and use the advice below.
You're not printing the output of convert_pdf_to_html(), or saving it somewhere.
print convert_pdf_to_html(path)

How to use pdfMiner in python to predicatbly read values

I've been using pdfMiner to read values off of graphs and so far its been working great!
However there is one area in which the correct data is read correctly but in an unpredictable manner, meaning it will read all the graphs values correctly, in a completely different order than they appear.
This is not entirely a problem because as long as i know, say the last graph will always be read first, i can structure my program around that. Except it seems that pdfMiner is almost totally unpredicatable in the way it is reading this data, I can find no discernable pattern.
This is most probably because I am quite unfamiliar with pdfMiner so i am not entirely sure how it works. So yeah it would be really helpful if somone could just point me in the right direction.
Here is my data
And here is the conversion code i'm using:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
global values
print "Getting readable PDF"
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file("graphExtraction.pdf", 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
values = str
Use the bounding box information to follow the flow of your documents and figure out what comes first.

Convert PDF to Text - Keep rows of table - Python

I have tables in pdf documents that I want to convert to text. I found the following code which converts the pdf to text. However, when it converts, it does not keep the data in the correct rows. It places everything in one long line of string. Is there any way to preserve rows in a table when converting to text from PDF using Python?
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from cStringIO import StringIO
def convert_pdf(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
Pdfminer comes with text extraction tool called pdf2txt.py, which has the ability to analyze layouts. You can try using that, or study it to see how it works.
A-PDF to Text convert better PDF with tables as other tools !

Categories