How to open PDF file with Docx in Python? - python

I want to open a pdf file from my mac, but I get this error:
'This file can't be opened. It's possible damaged or has a document structure which Preview doesn't recognize.'
This is the code I'm using:
from docx import Document
#open the document
doc=Document('./testDoc.docx')
a = input('Whats your name ')
b = input('Whats your date of birth ')
Dictionary = {"name": a, "dob": b}
for i in Dictionary:
for p in doc.paragraphs:
if p.text.find(i)>=0:
p.text=p.text.replace(i,Dictionary[i])
#save changed document
doc.save('/my/path/contract{}.pdf'.format(a))
Does anyone know what is going wrong?

Unfortunately, I don't think the docx module works for pdfs--there's nothing in their documentation about it. But you can use the docx2pdf module instead: https://pypi.org/project/docx2pdf/
Here's the simple how-to that's in their documentation:
from docx2pdf import convert
convert("input.docx", "output.pdf")

docx module cannot convert word document to PDF.
You can use pywin32 module.
import win32com.client
def wordToPdf(input_path, output_path):
word = win32com.client.Dispatch("Word.Application")
doc = word.Documents.Open(str(input_path))
doc.SaveAs(str(output_path), FileFormat=17)
doc.Close()
word.Quit()

Related

Is it possible to work in activedocument from docx module

I am using python docx and win32 module to interact with Msword. The idea is to access the selection using win32 and loop through each paragraph. If I only use the win32 module there is no worries however if i want to access the docx functionality. I have to open the document which prevents me from live editing the activedocument. Is it possible to pass the activedocument to docx module and work within selected paragraphs.
import win32com.client
import docx
import re
word = win32com.client.Dispatch("Word.Application")
word.Visible = True
doc = word.ActiveDocument
selection = word.Selection
start = selection.Start
end = selection.End
if start != end:
range = doc.Range(start, end)
paragraphs = range.Paragraphs
def search_pattern(text):
pattern = re.compile('^(?:\s{1,5})?(?:\()?([ixv]{1,3}|[a-z]{1}|[0-9]{1,2}|=|\t)?(\)|\.)?(?: {1,3})?(\t)?(?: {1,3})?([a-z]|[A-Z]|[0-9])?')
mch = pattern.search(text)
return mch
for paragraph in paragraphs:
text = paragraph.Range.Text
result = search_pattern(text)
if result:
start = paragraph.Range.Start + result.start()
end = paragraph.Range.Start + result.end()
range = doc.Range(start, end)
range.Select()
Now, if i want to access the methods of docx I have to provide the path of the document and open the document and carryout the tasks. Isn't there any way that I can work with doc object work within selection and acess the docx functionality? Any suggestions are welcomed. Thanks in advance.

Save doc file as pdf file using python

I want to save a doc file as pdf using python, I tried so many solution but I couldn't find the right one.
This is my code, I tried to make the output file as a pdf file but it didn't open. Any help is highly appreciated :
def replace_string(filenameInput, filenameOutput):
doc = Document(filenameInput)
for p in doc.paragraphs:
for d in J['data']:
if p.text.find(d['variable']) >= 0:
p.text = p.text.replace(d['variable'], d['value'])
doc.save(filenameOutput)
replace_string('test.docx', 'test2.pdf')
import docx2pdf
def convert_file(filenameInput, filenameOutput):
docx2pdf.convert(filenameInput, filenameOutput)
convert_file('test.docx', 'test2.pdf')
There is a Python package called docx2pdf. You can use it to simply convert docx file into pdf!
There is a link to the package! https://pypi.org/project/docx2pdf/

ImportError: cannot import name 'COMError' from '_ctypes' (/usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so)

import os
import glob
import comtypes.client
from PyPDF2 import PdfFileMerger
def docxs_to_pdf():
"""Converts all word files in pdfs and append them to pdfslist"""
word = comtypes.client.CreateObject('Word.Application')
pdfslist = PdfFileMerger()
x = 0
for f in glob.glob("*.docx"):
input_file = os.path.abspath(f)
output_file = os.path.abspath("demo" + str(x) + ".pdf")
# loads each word document
doc = word.Documents.Open(input_file)
doc.SaveAs(output_file, FileFormat=16+1)
doc.Close() # Closes the document, not the application
pdfslist.append(open(output_file, 'rb'))
x += 1
word.Quit()
return pdfslist
def joinpdf(pdfs):
"""Unite all pdfs"""
with open("result.pdf", "wb") as result_pdf:
pdfs.write(result_pdf)
def main():
"""docxs to pdfs: Open Word, create pdfs, close word, unite pdfs"""
pdfs = docxs_to_pdf()
joinpdf(pdfs)
main()
I am using jupyter notebook and it throw an error what should I do :
this is error message
I am going to convert many .doc file to one pdf. Help me I am beginner in this field.
Make sure you have all the dependencies installed in your environment. You can use pip to install comtypes.client, simply pass this in your terminal:
pip install comtypes
You can download _ctypes from sourceforge:
https://sourceforge.net/projects/ctypes/files/ctypes/1.0.2/ctypes-1.0.2.tar.gz/download?use_mirror=deac-fra
Using docx2pdf does seem easier for your task though. After you converted the files you can use PyPDF2 to append them.

Python- Receiving error as AttributeError: Property 'Word.Application.visible' can not be set while converting .doc to .docx format

Hi i am trying to convert doc files to docx format. With the below code it was working fine, but i see this error now .How to handle this error? Please suggest
#Converting all .doc files into docx format
import glob
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 0
for i, doc in enumerate(glob.iglob("*.doc")):
in_file = os.path.abspath(doc)
wb = word.Documents.Open(in_file)
out_file = os.path.abspath("{}.docx".format(doc.strip('.doc')))
wb.SaveAs2(out_file, FileFormat=16) # file format for docx
wb.Close()
word.Quit()
Error output:
AttributeError: Property 'Word.Application.visible' can not be set.

Is there any way to convert Pdf file to Docx using python

I am wondering if there is a way in python (tool or function etc.) to convert my pdf file to doc or docx?
I am aware of online converters but I need this in Python code.
If you have pdf with lot of pages..below code will work:
import PyPDF2
path="C:\\ .... "
text=""
pdf_file = open(path, 'rb')
text =""
read_pdf = PyPDF2.PdfFileReader(pdf_file)
c = read_pdf.numPages
for i in range(c):
page = read_pdf.getPage(i)
text+=(page.extractText())
If you happen to have MS Word, there is a really simple way to do this using COM.
Here is a script I wrote that can convert pdf to docx by calling the Word application.
import glob
import win32com.client
import os
word = win32com.client.Dispatch("Word.Application")
word.visible = 0
pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
print(doc)
filename = doc.split('\\')[-1]
in_file = os.path.abspath(doc)
print(in_file)
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
print("outfile\n",out_file)
wb.SaveAs2(out_file, FileFormat=16) # file format for docx
print("success...")
wb.Close()
word.Quit()

Categories