.doc to pdf using python errors - python

I'm trying to convert word documents to PDF using python.Currently on python 3.8.
".doc to pdf using python"
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
above code answers my question but it works only on my machine when I deploy it onto VM and run it using scheduler remotely, I get below error
File "File Name", line 10, in <module>
word = comtypes.client.CreateObject('Word.Application')
File "Path", init__.py",line1225, in CoCreateInstance
----------------------------------------------------
File"_ctypes/callproc.c",line 930, in getResult
PermissionError:[WinError -2147024891] Access is denied
FYI, word is installed on VM
Thanks

Related

Read pdf file from storage account (Azure Data lake) without downloading it using python

I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python.
I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:
from azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()
file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)
from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
PDFPage.get_pages(infile, check_extractable=False)
Neither of the options are working.
Initially the input_dir was setup locally, so the code was able to fetch the pdf file and read it.
Is there a different way to pass the URL/path of the file from the storage account to the pdf's read function?
Any help is appreciated.
I tried in my environment and got below results:
Initially, I tried with same process without downloading the Pdf files from azure Datalake storage account and got no results. But AFAIK, to read the pdf file with downloading is possible way.
I tried with below code to read pdf file with Module PyPDF2, and it executed with content successfully.
Code:
from azure.storage.filedatalake import DataLakeFileClient
import PyPDF2
service_client = DataLakeFileClient.from_connection_string("<your storage connection string>",file_system_name="test",file_path="dem.pdf")
with open("dem.pdf", 'wb') as file:
data = service_client.download_file()
data.readinto(file)
object=open("dem.pdf",'rb')
pdfread=PyPDF2.PdfFileReader(object)
print("Number of pages:",pdfread.numPages)
pageObj = pdfread.getPage(0)
print(pageObj.extractText())
Console:
You can also read the pdf file through browser using file URL:
https://<storage account name >.dfs.core.windows.net/test/dem.pdf+? sas-token
Browser:

ImportError: cannot import name 'COMError' from '_ctypes' (/usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so)

import os
import glob
import comtypes.client
from PyPDF2 import PdfFileMerger
def docxs_to_pdf():
"""Converts all word files in pdfs and append them to pdfslist"""
word = comtypes.client.CreateObject('Word.Application')
pdfslist = PdfFileMerger()
x = 0
for f in glob.glob("*.docx"):
input_file = os.path.abspath(f)
output_file = os.path.abspath("demo" + str(x) + ".pdf")
# loads each word document
doc = word.Documents.Open(input_file)
doc.SaveAs(output_file, FileFormat=16+1)
doc.Close() # Closes the document, not the application
pdfslist.append(open(output_file, 'rb'))
x += 1
word.Quit()
return pdfslist
def joinpdf(pdfs):
"""Unite all pdfs"""
with open("result.pdf", "wb") as result_pdf:
pdfs.write(result_pdf)
def main():
"""docxs to pdfs: Open Word, create pdfs, close word, unite pdfs"""
pdfs = docxs_to_pdf()
joinpdf(pdfs)
main()
I am using jupyter notebook and it throw an error what should I do :
this is error message
I am going to convert many .doc file to one pdf. Help me I am beginner in this field.
Make sure you have all the dependencies installed in your environment. You can use pip to install comtypes.client, simply pass this in your terminal:
pip install comtypes
You can download _ctypes from sourceforge:
https://sourceforge.net/projects/ctypes/files/ctypes/1.0.2/ctypes-1.0.2.tar.gz/download?use_mirror=deac-fra
Using docx2pdf does seem easier for your task though. After you converted the files you can use PyPDF2 to append them.

Is there any way to convert Pdf file to Docx using python

I am wondering if there is a way in python (tool or function etc.) to convert my pdf file to doc or docx?
I am aware of online converters but I need this in Python code.
If you have pdf with lot of pages..below code will work:
import PyPDF2
path="C:\\ .... "
text=""
pdf_file = open(path, 'rb')
text =""
read_pdf = PyPDF2.PdfFileReader(pdf_file)
c = read_pdf.numPages
for i in range(c):
page = read_pdf.getPage(i)
text+=(page.extractText())
If you happen to have MS Word, there is a really simple way to do this using COM.
Here is a script I wrote that can convert pdf to docx by calling the Word application.
import glob
import win32com.client
import os
word = win32com.client.Dispatch("Word.Application")
word.visible = 0
pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
print(doc)
filename = doc.split('\\')[-1]
in_file = os.path.abspath(doc)
print(in_file)
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
print("outfile\n",out_file)
wb.SaveAs2(out_file, FileFormat=16) # file format for docx
print("success...")
wb.Close()
word.Quit()

Converting rtf to pdf using python

I am new to the python language and I am given a task to convert rtf to pdf using python. I googled and found some code- (not exactly rtf to pdf) but I tried working on it and changed it according to my requirement. But I am not able to solve it.
I have used the below code:
import sys
import os
import comtypes.client
#import win32com.client
rtfFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
rtf= comtypes.client.CreateObject('Rtf.Application')
rtf.Visible = True
doc = rtf.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=rtfFormatPDF)
doc.Close()
rtf.Quit()
But its throwing the below error
Traceback (most recent call last):
File "C:/Python34/Lib/idlelib/rtf_to_pdf.py", line 12, in <module>
word = comtypes.client.CreateObject('Rtf.Application')
File "C:\Python34\lib\site-packages\comtypes\client\__init__.py", line 227, in CreateObject
clsid = comtypes.GUID.from_progid(progid)
File "C:\Python34\lib\site-packages\comtypes\GUID.py", line 78, in from_progid
_CLSIDFromProgID(str(progid), byref(inst))
File "_ctypes/callproc.c", line 920, in GetResult
OSError: [WinError -2147221005] Invalid class string
Can anyone help me with this?
I would really appreciate if someone can find the better and fast way of doing it. I have around 200,000 files to convert.
Anisha
I used Marks's advice and changed it back to Word.Application and my source pointing to rtf files. Works perfectly! - the process was slow but still faster than the JAVA application which my team was using. I have attached the final code in my question.
Final Code:
Got it done using the code which works with Word application :
import sys
import os,os.path
import comtypes.client
wdFormatPDF = 17
input_dir = 'input directory'
output_dir = 'output directory'
for subdir, dirs, files in os.walk(input_dir):
for file in files:
in_file = os.path.join(subdir, file)
output_file = file.split('.')[0]
out_file = output_dir+output_file+'.pdf'
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
If you have Libre Office in your system, you got the best solution.
import os
os.system('soffice --headless --convert-to pdf filename.rtf')
# os.system('libreoffice --headless -convert-to pdf filename.rtf')
# os.system('libreoffice6.3 --headless -convert-to pdf filename.rtf')
Commands may vary to different versions and platforms. But this would be the best solution ever I had.

can't delete picture from folder after inserting it in ms-word using win32com

I need to insert an image to word document and then delete it from the hard drive.
from win32com import client
import os
client.pythoncom.CoInitialize()
word = client.Dispatch("Word.Application")
doc = word.Documents.Open("C:\R1234.docx")
newShape = doc.Bookmarks("layout1").Range.InlineShapes.AddPicture("C:\picture.png", False, True)
doc.Close()
word.Quit()
os.remove("C:\picture.png")
Looks very simple but returns an error trying to remove the picture:
WindowsError: [Error 32] The process cannot access the file because it
is being used by another process: 'C:\picture.png'

Categories