I'm using comptypes python 3.6 and trying read office documents as i need to extract the text from these files.
I understand that for word and ppt this is how to open files using comtype
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(filename)
ppt = comtypes.client.CreateObject('PowerPoint.Application')
prs = ppt.Presentations.Open(filename)
How about for Outlook files (.msg)? I tried the following code but doesn't work
ol = comtypes.client.CreateObject('Outlook.Application')
msg = ol.MailItem.Open(filename)
I've resorted in using the approach done in this thread instead of what I was testing out on my question.
Related
I have a MS Word document that consists of several hundred pages.
Each page is identical apart from the name of a person which is unique across each page. (One page is one user).
I would like to take this word document and automate the process to save each page individually so I end up with several hundred word documents, one for each person, rather than one document that consists of everyone, that I can then distribute to the different people.
I have been using the module python-docx found here: https://python-docx.readthedocs.io/en/latest/
I am struggling on how to achieve this task.
As far as I have researched it is not possible to loop over each page as the pages are not determined in the .docx file itself but are generated by the program, i.e. Microsoft Word.
However python-docx can interpret the text and since each page is the same can I not say to python when you see this text (the last piece of text on a given page) consider this to be the end of a page and anything after this point is a new page.
Ideally if I could write a loop that would consider such a point and create a document up until that point, and repeating the over all pages that would be great. It would need to take all formatting/pictures as well.
I am not against other methods, such as converting to PDF first if that is an option.
Any thoughts?
I had the exact same problem. Unfortunately I could not find a way to split .docx by page. The solution was to first use python-docx or docx2python (whatever you like) to iterate over each page and extract the unique (person) information and put it in a list so you end up with:
people = ['person_A', 'person_B', 'person_C', ....]
Then save the .docx as a pdf split the pdfs up by page and then save them as person_A.pdf etc like this:
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("document.pdf", "rb"))
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open(f"{people[i]}.pdf", "wb") as outputStream:
output.write(outputStream)
The result is a bunch of one page PDFs saved as Person_A.pdf, Person_B.pdf etc.
Hope that helps.
I would suggest another package aspose-words-cloud to split a word document into separate pages. Currently, it works with cloud storage(Aspose cloud storage, Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage and FTP Storage). However, in near future, it will support process files from the request body(streams).
P.S: I am developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
import os
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxx-xxxxx-xxxx-xxxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
remoteFolder = 'Temp'
localFolder = 'C:/Temp'
localFileName = '02_pages.docx'
remoteFileName = '02_pages.docx'
#upload file
words_api.upload_file(asposewordscloud.models.requests.UploadFileRequest(open(localFolder + '/' + localFileName,'rb'),remoteFolder + '/' + remoteFileName))
#Split DOCX pages as a zip file
request = asposewordscloud.models.requests.SplitDocumentRequest(name=remoteFileName, format='docx', folder=remoteFolder, zip_output= 'true')
result = words_api.split_document(request)
print("Result {}".format(result.split_result.zipped_pages.href))
#download file
request_download=asposewordscloud.models.requests.DownloadFileRequest(result.split_result.zipped_pages.href)
response_download = words_api.download_file(request_download)
copyfile(response_download, 'C:/'+ result.split_result.zipped_pages.href)
For a DOCX document I do:
document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')
How to do this for DOC document?
You don't.
DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.
Don't try to process DOC files directly. Convert them to DOCX first.
See:
Convert .doc to .docx using C#
Automation: how to automate transforming .doc to .docx?
multiple .doc to .docx file conversion using python
Python & MS Word: Convert .doc to .docx?
Below is a simple example to write an XML file and read it back. The writing works OK, but I am not sure how to read this file back? Below is some sample code. How do I get thse values from the XML file?
file1 = 'result1.xml'
fs = cv2.FileStorage(file1, cv2.FILE_STORAGE_WRITE)
fs.write('var1', 1)
fs.write('var2', 2)
fs = cv2.FileStorage(file1,cv2.FILE_STORAGE_READ)
fn = fs.real
Python in different versions has its own library to parse XML data.
Here is where you can find the documentation : XML Library
You have to be careful when using it, as said in title of the webpage, this library is not safe if XML files aren't built properly.
Here is another useful website : How to parse XML files using Python ?
I have a docx object generated using the python docx module. How would I be able to convert it to pdf directly?
The following works for computers that have word installed (I think word 2007 and above, but do not hold me to that). I am not sure this works on everything but it seems to work for me on doc, docx, and .rtf files. I think it should work on all files that word can open.
# Imports =============================================================
import comtypes.client
import time
# Variables and Inputs=================================================
File = r'C:\path\filename.docx' # Or .doc, rtf files
outFile = r'C:\path\newfilename.pdf'
# Functions ============================================================
def convert_word_to_pdf(inputFile,outputFile):
''' the following lines that are commented out are items that others shared with me they used when
running loops to stop some exceptions and errors, but I have not had to use them yet (knock on wood) '''
word = comtypes.client.CreateObject('Word.Application')
#word.visible = True
#time.sleep(3)
doc = word.Documents.Open(inputFile)
doc.SaveAs(outputFile, FileFormat = 17)
doc.close()
#word.visible = False
word.Quit()
# Main Body=================================================================
convert_word_to_pdf(File,outFile)
Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.
There are a couple of packages that let you do this.
Check
python-docx.
docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt).
Since .docx files are simply .zip files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs.
In this case, you would likely have to convert doc -> docx first. antiword is an option.
python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
A quick search of PyPI turns up the docx package.
import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)