.doc to pdf using python

.doc to pdf using python - python

I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.

A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')

You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf

I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/

I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application

unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.

As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'

It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs

If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.

I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF

I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")

I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.

You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).

import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html

I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.

Related

How do I get the path of an open Microsoft Office program from the command line?

I'm writing something in python that needs to know which specific files/programs are open. I've mapped the list of running processes to find the executable paths of the processes I'm looking for. This works for most things, but all Microsoft Office programs run under general processes like WINWORD.exe or EXCEL.exe etc. I've also tried getting a list of open windows and their titles to see what file is being edited, but the window titles are relative paths not absolute paths to the file being edited.
Here's a sample:
import wmi
f = wmi.WMI()
pid_map = {}
PID = 4464 #pid of Microsoft Word
for process in f.Win32_Process():
if not process.Commandline: continue
pid_map[process.ProcessID] = process.Commandline
pid_map[PID]
Outputs:
'"C:\\Program Files\\Microsoft Office\\root\\Office16\\WINWORD.EXE" '
How do I get the path of the file actually being edited?

I figured it out. Here is a function that will return the files being edited.
import pythoncom
def get_office(): # creates doctype: docpath dictionary
context = pythoncom.CreateBindCtx(0)
files = {}
dupl = 1
patt2 = re.compile(r'(?i)(\w:)((\\|\/)+([\w\-\.\(\)\{\}\s]+))+'+r'(\.\w+)') #matches files, can change the latter to only get specific files
#look for path in ROT
for moniker in pythoncom.GetRunningObjectTable():
name = moniker.GetDisplayName(context, None)
checker = re.search(patt2,name)
if checker:
match = checker.group(5) #extension
if match in ('.XLAM','.xlam'): continue #These files aren't useful
try:
files[match[1:]] # check to see if the file type was already documented
match += str(dupl)
dupl += 1
except KeyError:
pass
files[match[1:]] = name #add doctype: doc path pairing to dictionary

How to avoid MS-Word dialog box of a .docx file containing comments to pause python execution at saving?

Problem:
I need to batch some Word files with python to:
check if they are .doc files
if so change their name
save them as .docx files
So that I can then extract some info from the tables contained in the document with docx lib.
I encounter an issue when trying to save docx files containing comments since a popup appears to ask me to confirm if I want to save the file with comments. It pauses the code execution untill an operator manually confirm by clicking OK into the popup.
It prevents the code to be run automatically without any operator input.
Note: The comments don't need to be kept in the .docx files since I won't use them for further computation.
What I do:
Here's the code I have right now, that stops before end of execution untill you confirm in word you accept to keep the comments (in case your doc file contained some):
import win32com.client
doc_file = "path\\of\\document.doc"
docx_file = "path\\of\\new_document.docx"
word = win32com.client.Dispatch("Word.application")
#get the file extension
file_extension = '.'+doc_file.split('\\').pop().split('.').pop()
#test file extension and convert it to docx if original document is a .doc
if file_extension.lower() == '.doc':
wordDoc = word.Documents.Open(doc_file, False, False, False)
wordDoc.SaveAs2(docx_file, FileFormat = 12)
wordDoc.Close()
#test file extension and print a message in the console if not a .doc document
else:
print('Extension of document {0} is not .doc, will not be treated'.format(doc_file))
word.Quit()
What I've tried:
I tried to look for solutions to remove the comments before saving since I do not use them later in the .docx file I created, but I didn't find any satisfying solution.
Maybe I'm just using the wrong approach and there's a super simple way to dismiss the dialog box or something, but somehow didn't find it.
Thanks!

This seems to do the job, but removes all comments:
import win32com.client
doc_file = "path\\of\\document.doc"
docx_file = "path\\of\\new_document.docx"
word = win32com.client.Dispatch("Word.application")
#get the file extension
file_extension = '.'+doc_file.split('\\').pop().split('.').pop()
#test file extension and convert it to docx if original document is a .doc
if file_extension.lower() == '.doc':
wordDoc = word.Documents.Open(doc_file, False, False, False)
# Accept all revisions
word.ActiveDocument.Revisions.AcceptAll()
# Delete all comments
if word.ActiveDocument.Comments.Count >= 1:
word.ActiveDocument.DeleteAllComments()
wordDoc.SaveAs2(docx_file, FileFormat = 12)
wordDoc.Close()
#test file extension and print a message in the console if not a .doc document
else:
print('Extension of document {0} is not .doc, will not be treated'.format(doc_file))
word.Quit()
I just added the part below that accepts the modifications and remove the comments in original code:
# Accept all revisions
word.ActiveDocument.Revisions.AcceptAll()
# Delete all comments
if word.ActiveDocument.Comments.Count >= 1:
word.ActiveDocument.DeleteAllComments()
I found the solution here: Python - Using win32com.client to accept all changes in Word Documents
But it still doesn't fully answer the initial question. Because it just gets rid of comments since in my own situation I don't need them. But in case you need the comments, I still don't know how to proceed.

I stumbled upon this today:
import win32com.client
doc_file = "path\\of\\document.doc"
docx_file = "path\\of\\new_document.docx"
word = win32com.client.Dispatch("Word.application")
#Disable save with comments warning
word.Options.WarnBeforeSavingPrintingSendingMarkup = False
#get the file extension
file_extension = '.'+doc_file.split('\\').pop().split('.').pop()
#test file extension and convert it to docx if original document is a .doc
if file_extension.lower() == '.doc':
wordDoc = word.Documents.Open(doc_file, False, False, False)
wordDoc.SaveAs2(docx_file, FileFormat = 12)
wordDoc.Close()
#test file extension and print a message in the console if not a .doc document
else:
print('Extension of document {0} is not .doc, will not be treated'.format(doc_file))
word.Quit()
An even easier solution is to use wordconv.exe which is located in your office installation beside the WinWord.exe
The commandline is like this:
wordconv.exe -oice -nme inputfilePath outputFilePath

How to convert docx to pdf on Mac OS with Python?

I've looked up several SO and other web pages but I haven't found anything that works.
The script I wrote, opens a docx, changes some words and then saves it in a certain folder as a docx. However, I want it to save it as a pdf but I don't know how to.
This is an example of the code I'm working with:
# Opening the original document
doc = Document('./myDocument.docx')
# Some code which changes the doc
# Saving the changed doc as a docx
doc.save('/my/folder/myChangedDocument.docx')
The things I tried to do for it to save as a pdf:
from docx2pdf import convert
# This after it was saved as a docx
convert('/my/folder/myChangedDocument.docx', '/my/folder/myChangedDocument.pdf')
But it says that Word needs permission to open the saved file and I have to select the file to give it the permission. After that, it just says:
0%| | 0/1 [00:03<?, ?it/s]
{'input': '/my/folder/contractsomeVariable.docx', 'output': '/my/folder/contractsomeVariable.pdf', 'result': 'error', 'error': 'Error: An error has occurred.'}
And I tried to simply put .pdf instead of .docx after the document name when I saved it but that didn't work either as the module docx can't do that.
So does someone know how I can save a docx as a pdf using Python?

you can use docx2pdf by making the changes first and then coverting.
Use pip to install on mac (I am guessing you already have it but it is still good to include).
pip install docx2pdf
Once docx2pdf is installed, you can your docx file in inputfile and put an empty .pdf file in outputfile.
from docx2pdf import convert
inputFile = "document.docx"
outputFile = "document2.pdf"
file = open(outputFile, "w")
file.close()
convert(inputFile, outputFile)

A simple way, you can use libreoffice
Ref: https://www.libreoffice.org/get-help/install-howto/macos/
And script sample:
def convert_word_to_pdf_local(folder, source, timeout=None):
args = [
LIBREOFFICE_BINARY_PATH,
'--headless',
'--convert-to',
'pdf',
'--outdir',
folder,
source,
]
if check_libreoffice_exists() is False:
raise Exception('Libreoffice not found')
process = subprocess.run(
args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout
)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
if filename is None:
raise Exception('Libreoffice is not working')
else:
filename = filename.group(1)
pdf_file = open(filename, 'rb')
return pdf_file
def check_libreoffice_exists():
s = os.system(f'{LIBREOFFICE_BINARY_PATH} --version')
if s != 0:
return False
return True

How do I write a python script that can read doc/docx files and convert them to txt?

Basically I have a folder with plenty of .doc/.docx files. I need them in .txt format. The script should iterate over all the files in a directory, convert them to .txt files and store them in another folder.
How can I do it?
Does there exist a module that can do this?

I figured this would make an interesting quick programming project. This has only been tested on a simple .docx file containing "Hello, world!", but the train of logic should give you a place to work from to parse more complex documents.
from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree
# command format: python3 docx_to_txt.py Hello.docx
# let's get the file name
zip_dir = sys.argv[1]
# cut off the .docx, make it a .zip
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
# make a copy of the .docx and put it in .zip
copyfile(zip_dir, zip_dir_zip_ext)
# unzip the .zip
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
# get the xml out of /word/document.xml
data = etree.parse('./temp/word/document.xml')
# we'll want to go over all 't' elements in the xml node tree.
# note that MS office uses namespaces and that the w must be defined in the namespaces dictionary args
# each :t element is the "text" of the file. that's what we're looking for
# result is a list filled with the text of each t node in the xml document model
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
# dump result into a new .txt file
with open(os.path.splitext(zip_dir)[0]+'.txt', 'w') as txt:
# join the elements of result together since txt.write can't take lists
joined_result = '\n'.join(result)
# write it into the new file
txt.write(joined_result)
# close the zip_ref file
zip_ref.close()
# get rid of our mess of working directories
rmtree('./temp')
os.remove(zip_dir_zip_ext)
I'm sure there's a more elegant or pythonic way to accomplish this. You'll need to have the file you want to convert in the same directory as the python file. Command format is python3 docx_to_txt.py file_name.docx

conda install -c conda-forge python-docx
from docx import Document
doc = Document(file)
for p in doc.paragrafs:
print(p.text)
pass

Thought I would share my approach, basically boils down to two commands that convert either .doc or .docx to a string, both options require a certain package:
import docx
import os
import glob
import subprocess
import sys
# .docx (pip3 install python-docx)
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
# .doc (apt-get install antiword)
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
I then wrap these solutions up in a function, that can either return the result as a python string, or write to a file (with the option of appending or replacing).
import docx
import os
import glob
import subprocess
import sys
def doc2txt(infile, outfile, return_string=False, append=False):
if os.path.exists(infile):
if infile.endswith(".docx"):
try:
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
except Exception as e:
print("Exception in converting .docx to str: ", e)
return None
elif infile.endswith(".doc"):
try:
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
except Exception as e:
print("Exception in converting .docx to str: ", e)
return None
else:
print("{0} is not .doc or .docx".format(infile))
return None
if return_string == True:
return doctext
else:
writemode = "a" if append==True else "w"
with open(outfile, writemode) as f:
f.write(doctext)
f.close()
else:
print("{0} does not exist".format(infile))
return None
I then would call this function via something like:
files = glob.glob("/path/to/filedir/**/*.doc*", recursive=True)
outfile = "/path/to/out.txt"
for file in files:
doc2txt(file, outfile, return_string=False, append=True)
It's not often I need to perform this operation, but up until now the script has worked for all my needs, if you find this function has a bug let me know in a comment.

Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs

Based on the script here: .doc to pdf using python I've got a semi-working script to export .docx files to pdf from C:\Export_to_pdf into a new folder.
The problem is that it gets through the first couple of documents and then fails with:
(-2147352567, 'Exception occurred.', (0, u'Microsoft Word', u'Command failed', u'wdmain11.chm', 36966, -2146824090), None)
This, apparently is an unhelpful general error message. If I debug slowly it using pdb, I can loop through all files and export successfully. If I also keep an eye on the processes in Windows Task Manager I can see that WINWORD starts then ends when it is supposed to, but on the larger files it takes longer for the memory usage to stablise. This makes me think that the script is tripping up when WINWORD doesn't have time to initialize or quit before the next method is called on the client.Dispatch object.
Is there a way with win32com or comtypes to identify and wait for a process to start or finish?
My script:
import os
from win32com import client
folder = "C:\\Export_to_pdf"
file_type = 'docx'
out_folder = folder + "\\PDF"
os.chdir(folder)
if not os.path.exists(out_folder):
print 'Creating output folder...'
os.makedirs(out_folder)
print out_folder, 'created.'
else:
print out_folder, 'already exists.\n'
for files in os.listdir("."):
if files.endswith(".docx"):
print files
print '\n\n'
try:
for files in os.listdir("."):
if files.endswith(".docx"):
out_name = files.replace(file_type, r"pdf")
in_file = os.path.abspath(folder + "\\" + files)
out_file = os.path.abspath(out_folder + "\\" + out_name)
word = client.Dispatch("Word.Application")
doc = word.Documents.Open(in_file)
print 'Exporting', out_file
doc.SaveAs(out_file, FileFormat=17)
doc.Close()
word.Quit()
except Exception, e:
print e
The working code - just replaced the try block with this. Note moved the DispatchEx statement outside the for loop and the word.Quit() to a finally statement to ensure it closes.
try:
word = client.DispatchEx("Word.Application")
for files in os.listdir("."):
if files.endswith(".docx") or files.endswith('doc'):
out_name = files.replace(file_type, r"pdf")
in_file = os.path.abspath(folder + "\\" + files)
out_file = os.path.abspath(out_folder + "\\" + out_name)
doc = word.Documents.Open(in_file)
print 'Exporting', out_file
doc.SaveAs(out_file, FileFormat=17)
doc.Close()
except Exception, e:
print e
finally:
word.Quit()

The might not be the problem but dispatching a separate word instance and then closing it within each iteration is not necessary and may be the cause of the strand memory problem you are seeing. You only need to open the instance once and within that instance you can open and close all the documents you need. Like the following:
try:
word = client.DispatchEx("Word.Application") # Using DispatchEx for an entirely new Word instance
word.Visible = True # Added this in here so you can see what I'm talking about with the movement of the dispatch and Quit lines.
for files in os.listdir("."):
if files.endswith(".docx"):
out_name = files.replace(file_type, r"pdf")
in_file = os.path.abspath(folder + "\\" + files)
out_file = os.path.abspath(out_folder + "\\" + out_name)
doc = word.Documents.Open(in_file)
print 'Exporting', out_file
doc.SaveAs(out_file, FileFormat=17)
doc.Close()
word.Quit()
except Exception, e:
Note: Be careful using try/except when opening win32com instances and files as if you open them and the error occurs before you close it it won't close (as it has not reached that command yet).
Also you may want to consider using DispatchEx instead of just Dispatch. DispatchEx opens a new instance (an entirely new .exe) whereas I believe just using Dispatch will try and look for an open instance to latch onto but the documentation of this is foggy. Use DispatchEx if in fact you want more than one instance (i.e open one file in one and one file in another).
As for waiting, the program should just wait on that line when more time is needed but I dunno.
Oh! also you can use word.Visible = True if you want to be able to see the instance and files actually open (might be useful to visually see the problem but turn it of when fixed because it will def slow things down ;-) ).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

.doc to pdf using python - python

unoconv (writen in Python) and OpenOffice running as a headless daemon. https://github.com/unoconv/unoconv http://dag.wiee.rs/home-made/unoconv/ Works very nicely for doc, docx, ppt, pptx, xls, xlsx. Very useful if you need to convert docs or save/convert to certain formats on a server.

If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here). This blog article presents a different implementation of the same idea.

I have modified it for ppt support as well. My solution support all the below-specified extensions. word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"] ppt_extensions = [".ppt", ".pptx"] My Solution: Github Link I have modified code from Docx2PDF

I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter). If he isn't happy with the output, tell him it could take you weeks to do it with word.

Related

How do I get the path of an open Microsoft Office program from the command line?

How to avoid MS-Word dialog box of a .docx file containing comments to pause python execution at saving?

How to convert docx to pdf on Mac OS with Python?

How do I write a python script that can read doc/docx files and convert them to txt?

Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs

Categories

Resources