I'm trying to convert word documents to PDF using python.Currently on python 3.8.
".doc to pdf using python"
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
above code answers my question but it works only on my machine when I deploy it onto VM and run it using scheduler remotely, I get below error
File "File Name", line 10, in <module>
word = comtypes.client.CreateObject('Word.Application')
File "Path", init__.py",line1225, in CoCreateInstance
----------------------------------------------------
File"_ctypes/callproc.c",line 930, in getResult
PermissionError:[WinError -2147024891] Access is denied
FYI, word is installed on VM
Thanks
Related
I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python.
I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:
from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()
file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)
from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
Neither of the options are working.
Initially the input_dir was setup locally, so the code was able to fetch the pdf file and read it.
Is there a different way to pass the URL/path of the file from the storage account to the pdf's read function?
Any help is appreciated.
I tried in my environment and got below results:
Initially, I tried with same process without downloading the Pdf files from azure Datalake storage account and got no results. But AFAIK, to read the pdf file with downloading is possible way.
I tried with below code to read pdf file with Module PyPDF2, and it executed with content successfully.
Code:
from azure.storage.filedatalake import DataLakeFileClient
import PyPDF2
service_client = DataLakeFileClient.from_connection_string("<your storage connection string>",file_system_name="test",file_path="dem.pdf")
with open("dem.pdf", 'wb') as file:
data = service_client.download_file()
data.readinto(file)
object=open("dem.pdf",'rb')
pdfread=PyPDF2.PdfFileReader(object)
print("Number of pages:",pdfread.numPages)
pageObj = pdfread.getPage(0)
print(pageObj.extractText())
Console:
You can also read the pdf file through browser using file URL:
https://<storage account name >.dfs.core.windows.net/test/dem.pdf+? sas-token
Browser:
import os
import glob
import comtypes.client
from PyPDF2 import PdfFileMerger
def docxs_to_pdf():
"""Converts all word files in pdfs and append them to pdfslist"""
word = comtypes.client.CreateObject('Word.Application')
pdfslist = PdfFileMerger()
x = 0
for f in glob.glob("*.docx"):
input_file = os.path.abspath(f)
output_file = os.path.abspath("demo" + str(x) + ".pdf")
# loads each word document
doc = word.Documents.Open(input_file)
doc.SaveAs(output_file, FileFormat=16+1)
doc.Close() # Closes the document, not the application
pdfslist.append(open(output_file, 'rb'))
x += 1
word.Quit()
return pdfslist
def joinpdf(pdfs):
"""Unite all pdfs"""
with open("result.pdf", "wb") as result_pdf:
pdfs.write(result_pdf)
def main():
"""docxs to pdfs: Open Word, create pdfs, close word, unite pdfs"""
pdfs = docxs_to_pdf()
joinpdf(pdfs)
main()
I am using jupyter notebook and it throw an error what should I do :
this is error message
I am going to convert many .doc file to one pdf. Help me I am beginner in this field.
Make sure you have all the dependencies installed in your environment. You can use pip to install comtypes.client, simply pass this in your terminal:
pip install comtypes
You can download _ctypes from sourceforge:
https://sourceforge.net/projects/ctypes/files/ctypes/1.0.2/ctypes-1.0.2.tar.gz/download?use_mirror=deac-fra
Using docx2pdf does seem easier for your task though. After you converted the files you can use PyPDF2 to append them.
I am wondering if there is a way in python (tool or function etc.) to convert my pdf file to doc or docx?
I am aware of online converters but I need this in Python code.
If you have pdf with lot of pages..below code will work:
import PyPDF2
path="C:\\ .... "
text=""
pdf_file = open(path, 'rb')
text =""
read_pdf = PyPDF2.PdfFileReader(pdf_file)
c = read_pdf.numPages
for i in range(c):
page = read_pdf.getPage(i)
text+=(page.extractText())
If you happen to have MS Word, there is a really simple way to do this using COM.
Here is a script I wrote that can convert pdf to docx by calling the Word application.
import glob
import win32com.client
import os
word = win32com.client.Dispatch("Word.Application")
word.visible = 0
pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
print(doc)
filename = doc.split('\\')[-1]
in_file = os.path.abspath(doc)
print(in_file)
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
print("outfile\n",out_file)
wb.SaveAs2(out_file, FileFormat=16) # file format for docx
print("success...")
wb.Close()
word.Quit()
I am new to the python language and I am given a task to convert rtf to pdf using python. I googled and found some code- (not exactly rtf to pdf) but I tried working on it and changed it according to my requirement. But I am not able to solve it.
I have used the below code:
import sys
import os
import comtypes.client
#import win32com.client
rtfFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
rtf= comtypes.client.CreateObject('Rtf.Application')
rtf.Visible = True
doc = rtf.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=rtfFormatPDF)
doc.Close()
rtf.Quit()
But its throwing the below error
Traceback (most recent call last):
File "C:/Python34/Lib/idlelib/rtf_to_pdf.py", line 12, in <module>
word = comtypes.client.CreateObject('Rtf.Application')
File "C:\Python34\lib\site-packages\comtypes\client\__init__.py", line 227, in CreateObject
clsid = comtypes.GUID.from_progid(progid)
File "C:\Python34\lib\site-packages\comtypes\GUID.py", line 78, in from_progid
_CLSIDFromProgID(str(progid), byref(inst))
File "_ctypes/callproc.c", line 920, in GetResult
OSError: [WinError -2147221005] Invalid class string
Can anyone help me with this?
I would really appreciate if someone can find the better and fast way of doing it. I have around 200,000 files to convert.
Anisha
I used Marks's advice and changed it back to Word.Application and my source pointing to rtf files. Works perfectly! - the process was slow but still faster than the JAVA application which my team was using. I have attached the final code in my question.
Final Code:
Got it done using the code which works with Word application :
import sys
import os,os.path
import comtypes.client
wdFormatPDF = 17
input_dir = 'input directory'
output_dir = 'output directory'
for subdir, dirs, files in os.walk(input_dir):
for file in files:
in_file = os.path.join(subdir, file)
output_file = file.split('.')[0]
out_file = output_dir+output_file+'.pdf'
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
If you have Libre Office in your system, you got the best solution.
import os
os.system('soffice --headless --convert-to pdf filename.rtf')
# os.system('libreoffice --headless -convert-to pdf filename.rtf')
# os.system('libreoffice6.3 --headless -convert-to pdf filename.rtf')
Commands may vary to different versions and platforms. But this would be the best solution ever I had.
I need to insert an image to word document and then delete it from the hard drive.
from win32com import client
import os
client.pythoncom.CoInitialize()
word = client.Dispatch("Word.Application")
doc = word.Documents.Open("C:\R1234.docx")
newShape = doc.Bookmarks("layout1").Range.InlineShapes.AddPicture("C:\picture.png", False, True)
doc.Close()
word.Quit()
os.remove("C:\picture.png")
Looks very simple but returns an error trying to remove the picture:
WindowsError: [Error 32] The process cannot access the file because it
is being used by another process: 'C:\picture.png'