Using Adobe Readers Export as text function in python

Using Adobe Readers Export as text function in python - python

I want to convert lots of PDFs into text files.
The formatting is very important and only Adobe Reader seems to get it right (PDFMiner or PyPDF2 do not.)
Is there a way to automate the "export as text" function from Adobe Reader?

The following code will do what you want for one file. I recommend organizing the script into a few little functions and then calling the functions in a loop to process many files. You'll need to install the keyboard library using pip, or some other tool.
import pathlib as pl
import os
import keyboard
import time
import io
KILL_KEY = 'esc'
read_path = pl.Path("C:/Users/Sam/Downloads/WS-1401-IP.pdf")
####################################################################
write_path = pl.Path(str(read_path.parent/read_path.stem) + ".txt")
overwrite_file = os.path.exists(write_path)
# alt -- activate keyboard shortcuts
# `F` -- open file menu
# `v` -- select "save as text" option
# keyboard.write(write_path)
# `alt+s` -- save button
# `ctrl+w` -- close file
os.startfile(read_path)
time.sleep(1)
keyboard.press_and_release('alt')
time.sleep(1)
keyboard.press_and_release('f') # -- open file menu
time.sleep(1)
keyboard.press_and_release('v') # -- select "save as text" option
time.sleep(1)
keyboard.write(str(write_path))
time.sleep(1)
keyboard.press_and_release('alt+s')
time.sleep(2)
if overwrite_file:
keyboard.press_and_release('y')
# wait for program to finish saving
waited_too_long = True
for _ in range(5):
time.sleep(1)
if os.path.exists(write_path):
waited_too_long = False
break
if waited_too_long:
with io.StringIO() as ss:
print(
"program probably saved to somewhere other than",
write_path,
file = ss
)
msg = ss.getvalue()
raise ValueError(msg)
keyboard.press_and_release('ctrl+w') # close the file

Related

I want to listen to a pdf file as audio and my question is to run time.sleep() for 10 seconds and countdown through the speaker

Here is my code:
import pyttsx3 as audio_converter
import PyPDF2 # version 1.26.0
import time
import sys
book = open("OOP.pdf", "rb")
pdfreader = PyPDF2.PdfFileReader(book)
pages = pdfreader.numPages # to find out number of pages
print(f'PDF name : "Object Oriented Programing (OOP)\nPage number : {pages}"')
speaker = audio_converter.init() # to init the package of pyttsx3
page = pdfreader.getPage(7) # pdf page number + 1 (index 0)
text = page.extractText()
speaker.say(f"Bismillah, Let's read the pdf {text}")
# pdf audio will start after the following commands
for i in range(5, 0, -1):
sys.stdout.write(str(i)+' ')
sys.stdout.flush()
time.sleep(1)
speaker.runAndWait() # this line will read existing pdf only
This code is made for read my pdf and i'm here to countdown the time underneath the for loop using speaker.say(#)
i'm expecting here to speak the countdown time from 10 to 0 and that should be pronounced in speaker.
in the very first line after the for loop,
sys.stdout.write(str(i)+' ')
when this line prints the countdown i want to use this comman such
speaker.say(str(i)) #but this is an error code
Please assist me

Updating the response:
Trying to focus in the countdown, you need to call say() and after that runAndWait() to make the library play the sound
import pyttsx3 as audio_converter
import time
import sys
speaker = audio_converter.init() # to init the package of pyttsx3
# pdf audio will start after the following commands
for i in range(5, 0, -1):
sys.stdout.write(str(i)+' ')
sys.stdout.flush()
speaker.say(str(i))
speaker.runAndWait()
time.sleep(1)
After that, you can do the same to read the pdf text

Open and save a PDF file using Acrobat Reader in the background using Python

Following the Opening pdf file question
I am looking for a way to also command Adobe Acrobat Reader to save the file programmatically using Python.
I am not looking for the pikepdf way of saving the file.
Reason: This PDF file, created with fill-pdf, needs to go through special formatting done by Acrobat Reader upon opening. Upon exit Acrobat Reader asks whether to save the formatting it did, I need this "Yes, Save" to be via code.
Edit: How to proceed from here using pywinauto?
import time
from pywinauto.application import Application
pdf_file = r'C:\Path\To\Total.pdf'
acrobat_path = r"C:\Path\To\Acrobat.exe"
app = Application(backend=u'uia').start(cmd_line = acrobat_path + ' ' + pdf_file)
print("started")
time.sleep(1)
app = Application(backend=u'uia').connect(path=acrobat_path)
print("connected")

solution with pyautogui:
import os
os.startfile(path)
filename = os.path.basename(path)
focus = pg.getWindowsWithTitle(filename)
while len(focus) == 0:
focus = pg.getWindowsWithTitle(filename)
focus [0].show() # show() for python3.9, activate() for python3.7
time.sleep(1)
pg.hotkey('ctrl', 's')
time.sleep(1)
pg.hotkey('ctrl', 'q')
print("Blessed Be God")

Print multiple pdf files in python

My process to achieve, is to print multiple pdf files from a folder, closing Adobe Acrobat afterwards and lastly mapping the files to new folders, using another of my scripts.
The Mapping works as intended, but I´m struggling with printing all files. I can either print only one, or none of the files and Adobe still remains open. I've been playing around with asyncio, but I dont know if it gets the job done.
The code is not very well documented or of outstanding quality, it just has to get the task done and willl be very likely never be touched agein. Here it is:
import os
import sys
import keyboard
import asyncio
import psutil
from win32 import win32api, win32print
import map_files
import utility
def prepareFilesToPrint(folder):
# Scans folder for files with naming convention and puts them in a seperate array to print
filesToPrint = []
for file in os.listdir(folder.value):
if utility.checkFileName(file):
filesToPrint.append(file)
return filesToPrint
def preparePrinter():
# Opens the printer and defines attributes such as duplex mode
name = win32print.GetDefaultPrinter()
printdefaults = {"DesiredAccess": win32print.PRINTER_ALL_ACCESS}
handle = win32print.OpenPrinter(name, printdefaults)
attributes = win32print.GetPrinter(handle, 2)
attributes['pDevMode'].Duplex = 2 # Lange Seite spiegeln
win32print.SetPrinter(handle, 2, attributes, 0)
return handle
async def printFiles(filesToPrint):
for file in filesToPrint:
await win32api.ShellExecute(
0, "print", file, '"%s"' % win32print.GetDefaultPrinter(), ".", 0)
def cleanup(handle):
# Closes Adobe after printing ALL files (!working)
win32print.ClosePrinter(handle)
for p in psutil.process_iter():
if 'AcroRd' in str(p):
p.kill()
async def printTaskFiles():
# Iterates over files in downloads folder and prints them if they are task sheets (!working)
os.chdir("C:/Users/Gebker/Downloads/")
filesToPrint = prepareFilesToPrint(utility.Folder.DOWNLOADS)
if filesToPrint.__len__() == 0:
print("No Files to print. Exiting...")
sys.exit()
print("=============================================================")
print("The following files will be printed:")
for file in filesToPrint:
print(file)
print("=============================================================")
input("Press ENTER to print. Exit with ESC")
while True:
try:
if keyboard.is_pressed('ENTER'):
print("ENTER pressed. Printing...")
handle = preparePrinter()
await printFiles(filesToPrint)
cleanup(handle)
print("Done printing. Mapping files now...")
# map_files.scanFolders()
break
elif keyboard.is_pressed('ESC'):
print("ESC pressed. Exiting...")
sys.exit()
except:
break
if __name__ == "__main__":
asyncio.run(printTaskFiles())

Selenium give file name when downloading

I am working with a selenium script where I am trying to download a Excel file and give it a specific name. This is my code:
Is there anyway that I can give the file being downloaded a specific name ?
Code:
#!/usr/bin/python
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile()
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
profile.set_preference("browser.download.dir", "C:\\Downloads" )
browser = webdriver.Firefox(firefox_profile=profile)
browser.get('https://test.com/')
browser.find_element_by_partial_link_text("Excel").click() # Download file

Here is another simple solution, where you can wait until the download completed and then get the downloaded file name from chrome downloads.
Chrome:
# method to get the downloaded file name
def getDownLoadedFileName(waitTime):
driver.execute_script("window.open()")
# switch to new tab
driver.switch_to.window(driver.window_handles[-1])
# navigate to chrome downloads
driver.get('chrome://downloads')
# define the endTime
endTime = time.time()+waitTime
while True:
try:
# get downloaded percentage
downloadPercentage = driver.execute_script(
"return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value")
# check if downloadPercentage is 100 (otherwise the script will keep waiting)
if downloadPercentage == 100:
# return the file name once the download is completed
return driver.execute_script("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').text")
except:
pass
time.sleep(1)
if time.time() > endTime:
break
Firefox:
def getDownLoadedFileName(waitTime):
driver.execute_script("window.open()")
WebDriverWait(driver,10).until(EC.new_window_is_opened)
driver.switch_to.window(driver.window_handles[-1])
driver.get("about:downloads")
endTime = time.time()+waitTime
while True:
try:
fileName = driver.execute_script("return document.querySelector('#contentAreaDownloadsView .downloadMainArea .downloadContainer description:nth-of-type(1)').value")
if fileName:
return fileName
except:
pass
time.sleep(1)
if time.time() > endTime:
break
Once you click on the download link/button, just call the above method.
# click on download link
browser.find_element_by_partial_link_text("Excel").click()
# get the downloaded file name
latestDownloadedFileName = getDownLoadedFileName(180) #waiting 3 minutes to complete the download
print(latestDownloadedFileName)
JAVA + Chrome:
Here is the method in java.
public String waitUntilDonwloadCompleted(WebDriver driver) throws InterruptedException {
// Store the current window handle
String mainWindow = driver.getWindowHandle();
// open a new tab
JavascriptExecutor js = (JavascriptExecutor)driver;
js.executeScript("window.open()");
// switch to new tab
// Switch to new window opened
for(String winHandle : driver.getWindowHandles()){
driver.switchTo().window(winHandle);
}
// navigate to chrome downloads
driver.get("chrome://downloads");
JavascriptExecutor js1 = (JavascriptExecutor)driver;
// wait until the file is downloaded
Long percentage = (long) 0;
while ( percentage!= 100) {
try {
percentage = (Long) js1.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value");
//System.out.println(percentage);
}catch (Exception e) {
// Nothing to do just wait
}
Thread.sleep(1000);
}
// get the latest downloaded file name
String fileName = (String) js1.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').text");
// get the latest downloaded file url
String sourceURL = (String) js1.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').href");
// file downloaded location
String donwloadedAt = (String) js1.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div.is-active.focus-row-active #file-icon-wrapper img').src");
System.out.println("Download deatils");
System.out.println("File Name :-" + fileName);
System.out.println("Donwloaded path :- " + donwloadedAt);
System.out.println("Downloaded from url :- " + sourceURL);
// print the details
System.out.println(fileName);
System.out.println(sourceURL);
// close the downloads tab2
driver.close();
// switch back to main window
driver.switchTo().window(mainWindow);
return fileName;
}
This is how to call this in your java script.
// download triggering step
downloadExe.click();
// now waituntil download finish and then get file name
System.out.println(waitUntilDonwloadCompleted(driver));
Output:
Download deatils
File Name :-RubyMine-2019.1.2 (7).exe
Donwloaded path :- chrome://fileicon/C%3A%5CUsers%5Csupputuri%5CDownloads%5CRubyMine-2019.1.2%20(7).exe?scale=1.25x
Downloaded from url :- https://download-cf.jetbrains.com/ruby/RubyMine-2019.1.2.exe
RubyMine-2019.1.2 (7).exe

You cannot specify name of download file through selenium. However, you can download the file, find the latest file in the downloaded folder, and rename as you want.
Note: borrowed methods from google searches may have errors. but you get the idea.
import os
import shutil
filename = max([Initial_path + "\\" + f for f in os.listdir(Initial_path)],key=os.path.getctime)
shutil.move(filename,os.path.join(Initial_path,r"newfilename.ext"))

Hope this snippet is not that confusing. It took me a while to create this and is really useful, because there has not been a clear answer to this problem, with just this library.
import os
import time
def tiny_file_rename(newname, folder_of_download):
filename = max([f for f in os.listdir(folder_of_download)], key=lambda xa : os.path.getctime(os.path.join(folder_of_download,xa)))
if '.part' in filename:
time.sleep(1)
os.rename(os.path.join(folder_of_download, filename), os.path.join(folder_of_download, newname))
else:
os.rename(os.path.join(folder_of_download, filename),os.path.join(folder_of_download,newname))
Hope this saves someone's day, cheers.
EDIT: Thanks to #Om Prakash editing my code, it made me remember that I didn't explain the code thoughly.
Using the max([]) function could lead to a race condition, leaving you with empty or corrupted file(I know it from experience). You want to check if the file is completely downloaded in the first place. This is due to the fact that selenium don't wait for the file download to complete, so when you check for the last created file, an incomplete file will show up on your generated list and it will try to move that file. And even then, you are better off waiting a little bit for the file to be free from Firefox.
EDIT 2: More Code
I was asked if 1 second was enough time and mostly it is, but in case you need to wait more than that you could change the above code to this:
import os
import time
def tiny_file_rename(newname, folder_of_download, time_to_wait=60):
time_counter = 0
filename = max([f for f in os.listdir(folder_of_download)], key=lambda xa : os.path.getctime(os.path.join(folder_of_download,xa)))
while '.part' in filename:
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:
raise Exception('Waited too long for file to download')
filename = max([f for f in os.listdir(folder_of_download)], key=lambda xa : os.path.getctime(os.path.join(folder_of_download,xa)))
os.rename(os.path.join(folder_of_download, filename), os.path.join(folder_of_download, newname))

There is something i would correct for #parishodak answer:
the filename here will only return the relative path (here the name of the file) not the absolute path.
That is why #FreshRamen got the following error after:
File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/‌python2.7/genericpath.py",
line 72, in getctime return os.stat(filename).st_ctime OSError:
[Errno 2] No such file or directory: '.localized'
There is the correct code:
import os
import shutil
filepath = 'c:\downloads'
filename = max([filepath +"\"+ f for f in os.listdir(filepath)], key=os.path.getctime)
shutil.move(os.path.join(dirpath,filename),newfilename)

I've come up with a different solution. Since you only care about the last downloaded file, then why not download it into a dummy_dir? So that, that file is going to be the only file in that directory. Once it's downloaded, you can move it to your destination_dir as well as changing it's name.
Here is an example that works with Firefox:
def rename_last_downloaded_file(dummy_dir, destination_dir, new_file_name):
def get_last_downloaded_file_path(dummy_dir):
""" Return the last modified -in this case last downloaded- file path.
This function is going to loop as long as the directory is empty.
"""
while not os.listdir(dummy_dir):
time.sleep(1)
return max([os.path.join(dummy_dir, f) for f in os.listdir(dummy_dir)], key=os.path.getctime)
while '.part' in get_last_downloaded_file_path(dummy_dir):
time.sleep(1)
shutil.move(get_last_downloaded_file_path(dummy_dir), os.path.join(destination_dir, new_file_name))
You can fiddle with the sleep time and add a TimeoutException as well, as you see fit.

Here is the code sample I used to download pdf with a specific file name. First you need to configure chrome webdriver with required options. Then after clicking the button (to open pdf popup window), call a function to wait for download to finish and rename the downloaded file.
import os
import time
import shutil
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
# function to wait for download to finish and then rename the latest downloaded file
def wait_for_download_and_rename(newFilename):
# function to wait for all chrome downloads to finish
def chrome_downloads(drv):
if not "chrome://downloads" in drv.current_url: # if 'chrome downloads' is not current tab
drv.execute_script("window.open('');") # open a new tab
drv.switch_to.window(driver.window_handles[1]) # switch to the new tab
drv.get("chrome://downloads/") # navigate to chrome downloads
return drv.execute_script("""
return document.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloadsList')
.items.filter(e => e.state === 'COMPLETE')
.map(e => e.filePath || e.file_path || e.fileUrl || e.file_url);
""")
# wait for all the downloads to be completed
dld_file_paths = WebDriverWait(driver, 120, 1).until(chrome_downloads) # returns list of downloaded file paths
# Close the current tab (chrome downloads)
if "chrome://downloads" in driver.current_url:
driver.close()
# Switch back to original tab
driver.switch_to.window(driver.window_handles[0])
# get latest downloaded file name and path
dlFilename = dld_file_paths[0] # latest downloaded file from the list
# wait till downloaded file appears in download directory
time_to_wait = 20 # adjust timeout as per your needs
time_counter = 0
while not os.path.isfile(dlFilename):
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:
break
# rename the downloaded file
shutil.move(dlFilename, os.path.join(download_dir,newFilename))
return
# specify custom download directory
download_dir = r'c:\Downloads\pdf_reports'
# for configuring chrome pdf viewer for downloading pdf popup reports
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', {
"download.default_directory": download_dir, # Set own Download path
"download.prompt_for_download": False, # Do not ask for download at runtime
"download.directory_upgrade": True, # Also needed to suppress download prompt
"plugins.plugins_disabled": ["Chrome PDF Viewer"], # Disable this plugin
"plugins.always_open_pdf_externally": True, # Enable this plugin
})
# get webdriver with options for configuring chrome pdf viewer
driver = webdriver.Chrome(options = chrome_options)
# open desired webpage
driver.get('https://mywebsite.com/mywebpage')
# click the button to open pdf popup
driver.find_element_by_id('someid').click()
# call the function to wait for download to finish and rename the downloaded file
wait_for_download_and_rename('My file.pdf')
# close the browser windows
driver.quit()
Set timeout (120) to the wait time as per your needs.

I am using the following function.
It checks for a file in the download location that you specify for chrome/selenium, and only is there is a file created as maxium 10 seconds ago (max_old_time), it renames it. Otherwise, it wait a maxium of 60 seconds (max_waiting_time)..
Not sure if is the best way, but it worked for me..
import os, shutil, time
from datetime import datetime
def rename_last_file(download_folder,destination_folder,newfilename):
#Will wait for maxium max_waiting_time seconds for a new in folder.
max_waiting_time=60
#Will rename only is the file creation has less than max_old_stime seconds.
max_old_time=10
start_time=datetime.now().timestamp()
while True:
filelist=[]
last_file_time=0
for current_file in os.listdir(download_folder):
filelist.append(current_file)
current_file_fullpath=os.path.join(download_folder, current_file)
current_file_time=os.path.getctime(current_file_fullpath)
if os.path.isfile(current_file_fullpath):
if last_file_time==0:
last_file=current_file
last_file_time=os.path.getctime(os.path.join(download_folder, last_file))
if current_file_time>last_file_time and os.path.isfile(current_file_fullpath):
last_file=current_file
last_file_fullpath=os.path.join(download_folder, last_file)
if start_time-last_file_time<max_old_time:
shutil.move(last_file_fullpath,os.path.join(destination_folder,newfilename))
print(last_file_fullpath)
return(0)
elif (datetime.now().timestamp()-start_time)>max_waiting_time:
print("exit")
return(1)
else:
print("waiting file...")
time.sleep(5)

Using #dmb 's trick. Ive just made one correction: after .part control, below time.sleep(1) we must request filename again. Otherwise, the line below will try to rename a .part file, which no more exists.

Here is a browser-agnostic solution that waits for the download to finish then returns the file name.
from datetime import datetime, timedelta
def wait_for_download_and_get_file_name():
print(f'Waiting for download to finish', end='')
while True:
# Get the name of the file with the latest creation time
newest_file_name = max([os.path.join(DOWNLOAD_DIR, f) for f in os.listdir(DOWNLOAD_DIR)], key=os.path.getctime)
# Get the creation time of the file
file_creation_time = datetime.fromtimestamp(os.path.getctime(newest_file_name))
five_seconds_ago = datetime.now() - timedelta(seconds=5)
if file_creation_time < five_seconds_ago:
# The file with the latest creation time is too old to be the file that we're waiting for
print(f'.', end='')
time.sleep(0.5)
else:
print(f'\nFinished downloading "{newest_file_name}"')
break
return newest_file_name
Caveat: this will not work if you have more than one thread or process downloading files to the same directory at the same time.

In my case i downloading and rename .csv files, also i using as a reference files that has '__' in the title, but you can change '_' for your specific usage.
Add this block after download on your selenium script.
string = 'SOMETHING_OR_VARIABLE'
path = r'PATH_WHERE_FILE_ARE_BEING_DOWNLOAD'
files = [i for i in os.listdir(path) if os.path.isfile(os.path.join(path,i)) and \
'_' in i]
if files != []:
import os
files = [i for i in os.listdir(path) if os.path.isfile(os.path.join(path,i)) and \
'_' in i]
print(files[0])
os.rename(path + '\\' +files[0], path + '\\' +f'{string}.csv')
else:
print('error')

You can download the file and name it at the same time using urlretrieve:
import urllib
url = browser.find_element_by_partial_link_text("Excel").get_attribute('href')
urllib.urlretrieve(url, "/choose/your/file_name.xlsx")

.doc to pdf using python

I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.

A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')

You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf

I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/

I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application

unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.

As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'

It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs

If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.

I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF

I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")

I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.

You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).

import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html

I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Adobe Readers Export as text function in python - python

I want to convert lots of PDFs into text files. The formatting is very important and only Adobe Reader seems to get it right (PDFMiner or PyPDF2 do not.) Is there a way to automate the "export as text" function from Adobe Reader?

Related

I want to listen to a pdf file as audio and my question is to run time.sleep() for 10 seconds and countdown through the speaker

Open and save a PDF file using Acrobat Reader in the background using Python

Print multiple pdf files in python

Selenium give file name when downloading

.doc to pdf using python

Categories

Resources