I'm trying to make a txt file from docx using this code:
from subprocess import Popen, PIPE
from docx import opendocx, getdocumenttext
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
...
def document_to_text(filename, file_path):
...
elif filename[-5:] == ".docx":
document = opendocx(file_path)
paratextlist = getdocumenttext(document)
newparatextlist = []
for paratext in paratextlist:
newparatextlist.append(paratext.encode("utf-8"))
return '\n\n'.join(newparatextlist)
elif filename[-4:] == ".odt":
...
elif filename[-4:] == ".pdf":
...
document_to_text('1.docx','D:\Nucho\Python\AntiPlagiat\1.docx')
However, I see only: ImportError: cannot import name opendocx
Some text '.......' to post question.
pls read,
The 'opendocx()' function is no longer part of the latest version of python-docx. Starting with v0.3.0, python-docx has been completely re-written and the API is not backwardly compatible. The new call would be something like:
document = Document(docx_file_path)
Documentation on the new version is available here:
http://python-docx.readthedocs.org/
If you want the prior API, you should install docx rather than python-docx, e.g.:
pip install docx
The package name changed between the two versions, so folks can still access the legacy version if that's what they want. You should uninstall python-docx before installing docx, and vice versa, to avoid confusion over which is being imported.
Let me know if you need more.
ref:https://groups.google.com/forum/#!msg/python-docx/otp6hq4kJ5c/tfQB88Mfx2gJ
Related
import os
import glob
import comtypes.client
from PyPDF2 import PdfFileMerger
def docxs_to_pdf():
"""Converts all word files in pdfs and append them to pdfslist"""
word = comtypes.client.CreateObject('Word.Application')
pdfslist = PdfFileMerger()
x = 0
for f in glob.glob("*.docx"):
input_file = os.path.abspath(f)
output_file = os.path.abspath("demo" + str(x) + ".pdf")
# loads each word document
doc = word.Documents.Open(input_file)
doc.SaveAs(output_file, FileFormat=16+1)
doc.Close() # Closes the document, not the application
pdfslist.append(open(output_file, 'rb'))
x += 1
word.Quit()
return pdfslist
def joinpdf(pdfs):
"""Unite all pdfs"""
with open("result.pdf", "wb") as result_pdf:
pdfs.write(result_pdf)
def main():
"""docxs to pdfs: Open Word, create pdfs, close word, unite pdfs"""
pdfs = docxs_to_pdf()
joinpdf(pdfs)
main()
I am using jupyter notebook and it throw an error what should I do :
this is error message
I am going to convert many .doc file to one pdf. Help me I am beginner in this field.
Make sure you have all the dependencies installed in your environment. You can use pip to install comtypes.client, simply pass this in your terminal:
pip install comtypes
You can download _ctypes from sourceforge:
https://sourceforge.net/projects/ctypes/files/ctypes/1.0.2/ctypes-1.0.2.tar.gz/download?use_mirror=deac-fra
Using docx2pdf does seem easier for your task though. After you converted the files you can use PyPDF2 to append them.
I have been trying to come up with a solution to parse a PDF into an HTML so, later I'll use beautiful soup to extract all the headings, subitems and paragraph respectively in a tree structure.
I have searched a few options available on the internet but so far no success. Here's a code I've used to parse a PDF to HTML using PDFMiner.six
import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.layout import LTContainer, LTComponent, LTRect, LTLine, LAParams, LTTextLine
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.image import ImageWriter
from io import StringIO, BytesIO
from bs4 import BeautifulSoup
import re
import io
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
outfp = BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, outfp, imagewriter=ImageWriter('out'))
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
convert_pdf_to_html('PDF - Remraam Ph 1 Mosque.pdf')
However, the above code returns the following error which I'm unable to fix, would appreciate any help, thank you.
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pdfminer\pdftypes.py in decode(self)
293 data = ccittfaxdecode(data, params)
294 elif f == LITERAL_CRYPT:
--> 295 raise PDFNotImplementedError('Crypt filter is unsupported')
296 else:
297 raise PDFNotImplementedError('Unsupported filter: %r' % f)
TypeError: not all arguments converted during string formatting
The pdfminer.six package does not support pdf's with a Crypt filter. It does support other encryption methods. The difference with the Crypt filter is that this one defines the decription algorithm as a parameters, instead of a fixed filter.
From the Pdf reference manual:
The Crypt filter (PDF 1.5) allows the document-level security handler (see Section 3.5, “Encryption”) to determine which algorithms should be used to decrypt the input data. The Name parameter in the decode parameters dictionary for this
filter (see Table 3.12) specifies which of the named crypt filters in the document
(see Section 3.5.4, “Crypt Filters”) should be used.
If you need this feature you can create a github issue.
A quick update, I have fixed this issue, by just uninstalling and installing the Anaconda and then I've installed the pdfminer.six by coda. I guess the pip install doesn't work properly for me. Any way install the package using coda install .. package name
I am trying to convert texts in pdf file to text or HTML format, but this error is occurring frequently
'cannot import name 'process_pdf' from 'pdfminer.pdfinterp' '
How can I remove this ?
I have tried this code in the visual basic studio, but it's still not working , but in that case, I got indentation error due to spaces, so I tried this in the jupyter notebook and got this error.
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager , process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layput import LAParams
def to_txt(pdf_path):
input_ = file(pdf_path , 'rb')
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams = LAParams())
process_pdf(manager, converter, input_)
return output.getvalue()
b = to_txt(rb"C:\Users\Jasvinder Singh\Desktop\HACK-IN REPORT.docx")
ImportError: cannot import name 'process_pdf' from 'pdfminer.pdfinterp' (C:\Users\Jasvinder Singh\Anaconda3\lib\site-packages\pdfminer\pdfinterp.py)
Please see the documentation and this comment on a bug.
The process_pdf method has been replaced by PDFPage.get_pages().
So I've just played around with PDFMiner and can now extract text from a PDF and throw it into an html or textfile.
pdf2txt.py -o outputfile.txt -t txt inputfile.pdf
I have then written a simple script to extract all certain strings:
with open('output.txt', 'r') as searchfile:
for line in searchfile:
if 'HELLO' in line:
print(line)
And now I can use all these strings containing the word HELLO to add to my databse if that is what I wanted.
My questions is:
Is the only way or can PDFinder grab conditional stuff before even spitting it out to the txt, html or even straight into the database?
Well, yes, you can: PDFMiner has API.
The basic example sais
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# do stuff with the page here
and in the loop you should go with
# receive the LTPage object for the page.
layout = device.get_result()
and then use LTTextBox object. You have to analyse that. There's no full example in the docs, but you may check out the pdf2txt.py source which will help you to find missing pieces (although it does much more since it parses options and applies them).
And once you have the code that extracts the text, you can do what you want before saving a file. Including searching for certain parts of text.
PS looks like this was, in a way, asked before: How do I use pdfminer as a library which should be helpful, too.
Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For example:
They have moved PDFDocument into pdfparser (sorry, if I spell incorrectly). PDFPage used to have create_pages method which is gone now. All I can see inside PDFPage are internal methods. Does anybody has a working example of pdfminer3k? It seems like there is no new documentation to reflect any of the changes.
If you are interested in reading text from a pdf file the following code works with pdfminer3k using python 3.4.
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
fp = open('file.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages():
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
print(lt_obj.get_text())
fp.close()
Perhaps,you could use pdfminer.six.
It's description:
fork of PDFMiner using six for Python 2+3 compatibility
After installing it using pip:
pip install pdfminer.six
The usage of it is just like pdfminer, at least in my code.
Hope this could save your day :)