Read Docx files via python

Read Docx files via python - python

Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.

There are a couple of packages that let you do this.
Check
python-docx.
docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt).
Since .docx files are simply .zip files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs.
In this case, you would likely have to convert doc -> docx first. antiword is an option.

python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.

See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)

A quick search of PyPI turns up the docx package.

import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)

Related

How to check if a docx or doc file is empty

I have a script that can convert a docx file to a json and I was wondering how can I detect if a file is empty.
A solution I found is that one:
https://thispointer.com/python-three-ways-to-check-if-a-file-is-empty/
Using:
os.stat(file_path).st_size == 0:
os.path.exists(file_path)
os.path.getsize(path)
Unfortunately since an empty docx is not equal to 0. I can't use those methods.
Any other solution?

what if you use the docx module?
you can check it here, according to that documentation, you can read the paragraphs and after check the length:
import docx
doc = docx.Document("E:/my_word_file.docx")
all_paras = doc.paragraphs
len(all_paras)
If the lenght is equal to 0 you can assume this is empty.
However this only works for .docx files for what I can see

What is the best way to convert a docx object to a pdf in python

I have a docx object generated using the python docx module. How would I be able to convert it to pdf directly?

The following works for computers that have word installed (I think word 2007 and above, but do not hold me to that). I am not sure this works on everything but it seems to work for me on doc, docx, and .rtf files. I think it should work on all files that word can open.
# Imports =============================================================
import comtypes.client
import time
# Variables and Inputs=================================================
File = r'C:\path\filename.docx' # Or .doc, rtf files
outFile = r'C:\path\newfilename.pdf'
# Functions ============================================================
def convert_word_to_pdf(inputFile,outputFile):
''' the following lines that are commented out are items that others shared with me they used when
running loops to stop some exceptions and errors, but I have not had to use them yet (knock on wood) '''
word = comtypes.client.CreateObject('Word.Application')
#word.visible = True
#time.sleep(3)
doc = word.Documents.Open(inputFile)
doc.SaveAs(outputFile, FileFormat = 17)
doc.close()
#word.visible = False
word.Quit()
# Main Body=================================================================
convert_word_to_pdf(File,outFile)

Python comtypes read Outlook file

I'm using comptypes python 3.6 and trying read office documents as i need to extract the text from these files.
I understand that for word and ppt this is how to open files using comtype
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(filename)
ppt = comtypes.client.CreateObject('PowerPoint.Application')
prs = ppt.Presentations.Open(filename)
How about for Outlook files (.msg)? I tried the following code but doesn't work
ol = comtypes.client.CreateObject('Outlook.Application')
msg = ol.MailItem.Open(filename)

I've resorted in using the approach done in this thread instead of what I was testing out on my question.

Reading core_properties using python-docx

I'm trying to read the last_saved_by attribute on docx files. I've followed the comments on Github, and from this question. It seems that support has been added, but the documentation isn't very clear for me.
I've entered the following code into my script (Notepad++):
import docx
document = Document()
core_properties = document.core_properties
core_properties.author = 'Foo B. Baz'
document.save('new-filename.docx')
I only get an error message at the end:
NameError: name 'Document' is not defined
I'm not sure where I'm going wrong. :(
When I enter it line by line through python itself, the problem seems to come up from the second line.
I'm using Python 3.4, and docx 0.8.6

Figured out where I was going wrong, for those that want to know:
from docx import Document
import docx
document = Document('mine.docx')
core_properties = document.core_properties
print(core_properties.author)
There'll be a more succinct way of doing this, I'm sure (importing docx twice seems redundant for a start) - but it works, so I'm happy! :)

If the only thing you need from the docx module is Document, then you only need to use
from docx import Document
If you use more than that, you could use
import docx
document = docx.Document()
importing specific names from the docx module is your choice; either way, you don't need to have two lines importing from (or importing) docx, although it's not expensive to have both.

Open a word document with python using windows

I am trying to open a word document with python in windows, but I am unfamiliar with windows.
My code is as follows.
import docx as dc
doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
Through another post, I learned that I had to put the r in front of my string to convert it to a raw string or it would interpret the \U as an escape sequence.
The error I get is
PackageNotFoundError: Package not found at 'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx'
I'm unsure of why it cannot find my document, 01100-Allergan-UD1314-SUMMARY OF WORK.docx. The pathway is correct as I copied it directly from the file system.
Any help is appreciated thanks.

try this
import StringIO
from docx import Document
file = r'H:\myfolder\wordfile.docx'
with open(file) as f:
source_stream = StringIO(f.read())
document = Document(source_stream)
source_stream.close()
http://python-docx.readthedocs.io/en/latest/user/documents.html
Also, in regards to debugging the file not found error, simplify your directory names and files names. Rename the file to 'file' instead of referring to a long path with spaces, etc.

If you want to open the document in Microsoft Word try using os.startfile().
In your example it would be:
os.startfile(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
This will open the document in word on your computer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read Docx files via python - python

Does anyone know a python library to read docx files? I have a word document that I am trying to read data from.

python-docx can read as well as write. doc = docx.Document('myfile.docx') allText = [] for docpara in doc.paragraphs: allText.append(docpara.text) Now all paragraphs will be in the list allText. Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.

See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/ You should use the python-docx library available on PyPi. Then you can use the following doc = docx.Document('myfile.docx') allText = [] for docpara in doc.paragraphs: allText.append(docpara.text)

A quick search of PyPI turns up the docx package.

Related

How to check if a docx or doc file is empty

What is the best way to convert a docx object to a pdf in python

Python comtypes read Outlook file

Reading core_properties using python-docx

Open a word document with python using windows

Categories

Resources