How to check if a docx or doc file is empty - python

I have a script that can convert a docx file to a json and I was wondering how can I detect if a file is empty.
A solution I found is that one:
https://thispointer.com/python-three-ways-to-check-if-a-file-is-empty/
Using:
os.stat(file_path).st_size == 0:
os.path.exists(file_path)
os.path.getsize(path)
Unfortunately since an empty docx is not equal to 0. I can't use those methods.
Any other solution?

what if you use the docx module?
you can check it here, according to that documentation, you can read the paragraphs and after check the length:
import docx
doc = docx.Document("E:/my_word_file.docx")
all_paras = doc.paragraphs
len(all_paras)
If the lenght is equal to 0 you can assume this is empty.
However this only works for .docx files for what I can see

Related

How to open/save docx file to be aditted as xml and save the result as docx after editting using python

I have a docx file in which I need to edit its paragraphs (the paragraphs might contain equations). I tried to do these jobs using python-docx but it was not successful since editing the text of each paragraph and replacing it with the edited new paragraph needs to call p.add_paragraphs(editText(paragraph.text)) which ignores and omits any mathematical equation.
By searching for a method to gain this goal I found that this job is possible through XML codes by finding <w:t> tags and editing their content like this:
tree= ET.parse(filename)
root=tree.getroot()
for par in root.findall('w:p'):
if par.find('w:r'):
myText= par.find('w:r').find('w:t')
myText.text= editText(myText.text)
Then I must save the result as docx.
My quation is: what the format of filename is? should it be a document.xml file? If so, how can I reach that from my original document.docx file? and one more question is that how can I save the result as a .docx file again?
For saving docx as xml, I have given a try to save it by document.save('Document2.xml'). But the content of the result was not correct.
Would you give me some advice how to do them?
Not experienced with this at all, but perhaps this is what you were looking for?
https://virantha.com/2013/08/16/reading-and-writing-microsoft-word-docx-files-with-python/
From the article:
import zipfile
from lxml import etree
class DocsWriter:
def __init__(self, docx_file):
self.zipfile = zipfile.ZipFile(docx_file)
def _write_and_close_docx (self, xml_content, output_filename):
""" Create a temp directory, expand the original docx zip.
Write the modified xml to word/document.xml
Zip it up as the new docx
"""
tmp_dir = tempfile.mkdtemp()
self.zipfile.extractall(tmp_dir)
with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
xmlstr = etree.tostring (xml_content, pretty_print=True)
f.write(xmlstr)
# Get a list of all the files in the original docx zipfile
filenames = self.zipfile.namelist()
# Now, create the new zip file and add all the filex into the archive
zip_copy_filename = output_filename
with zipfile.ZipFile(zip_copy_filename, "w") as docx:
for filename in filenames:
docx.write(os.path.join(tmp_dir,filename), filename)
# Clean up the temp dir
shutil.rmtree(tmp_dir)
From what I can tell, this code block writes an xml document as .docx. Refer to the article for more context.
Python is not the best tool for this. Use VBA if you need to automate something in a Word document, or multiple Word documents. I can't tell what you are even trying to do here, but let's start at the beginning, with something simple. If, for instance, you want to loop through all paragraphs in your Word document, and select only the equations, you can run the code below to do just that.
Sub SelectAllEquations()
Dim xMath As OMath
Dim I As Integer
With ActiveDocument
.DeleteAllEditableRanges wdEditorEveryone
For I = 1 To .OMaths.Count
Set xMath = .OMaths.Item(I)
xMath.Range.Paragraphs(1).Range.Editors.Add wdEditorEveryone
Next
.SelectAllEditableRanges wdEditorEveryone
.DeleteAllEditableRanges wdEditorEveryone
End With
End Sub
Again, I don't know what your end game is, but I think it's worthwhile to start with something like that, and build on your foundation.

How can I create a word (.docx) document if not found using python and write in it?

How can I create a word (.docx) document if not found using python and write in it?
I certainly cannot do either of the following:
file = open(file_name, 'r')
file = open(file_name, 'w')
or, to create or append if found:
f = open(file_name, 'a+')
Also I cannot find any related info in python-docx documentation at:
https://python-docx.readthedocs.io/en/latest/
NOTE:
I need to create an automated report via python with text and pie charts, graphs etc.
Probably the safest way to open (and truncate) a new file for writing is using 'xb' mode. 'x' will raise a FileExistsError if the file is already there. 'b' is necessary because a word document is fundamentally a binary file: it's a zip archive with XML and other files inside it. You can't compress and decompress a zip file if you convert bytes through character encoding.
Document.save accepts streams, so you can pass in a file object opened like that to save your document.
Your work-flow could be something like this:
doc = docx.Document(...)
...
# Make your document
...
with open('outfile.docx', 'xb') as f:
doc.save(f)
It's a good idea to use with blocks instead of raw open to ensure the file gets closed properly even in case of an error.
In the same way that you can't simply write to a Word file directly, you can't append to it either. The way to "append" is to open the file, load the Document object, and then write it back, overwriting the original content. Since the word file is a zip archive, it's very likely that appended text won't even be at the end of the XML file it's in, much less the whole docx file:
doc = docx.Document('file_to_append.docx')
...
# Modify the contents of doc
...
doc.save('file_to_append.docx')
Keep in mind that the python-docx library may not support loading some elements, which may end up being permanently discarded when you save the file this way.
Looks like I found an answer:
The important point here was to create a new file, if not found, or
otherwise edit the already present file.
import os
from docx import Document
#checking if file already present and creating it if not present
if not os.path.isfile(r"file_path"):
#Creating a blank document
document = Document()
#saving the blank document
document.save('file_name.docx')
#------------editing the file_name.docx now------------------------
#opening the existing document
document = Document('file_name.docx')
#editing it
document.add_heading("hello world" , 0)
#saving document in the end
document.save('file_name.docx')
Further edits/suggestions are welcome.

What is the best way to convert a docx object to a pdf in python

I have a docx object generated using the python docx module. How would I be able to convert it to pdf directly?
The following works for computers that have word installed (I think word 2007 and above, but do not hold me to that). I am not sure this works on everything but it seems to work for me on doc, docx, and .rtf files. I think it should work on all files that word can open.
# Imports =============================================================
import comtypes.client
import time
# Variables and Inputs=================================================
File = r'C:\path\filename.docx' # Or .doc, rtf files
outFile = r'C:\path\newfilename.pdf'
# Functions ============================================================
def convert_word_to_pdf(inputFile,outputFile):
''' the following lines that are commented out are items that others shared with me they used when
running loops to stop some exceptions and errors, but I have not had to use them yet (knock on wood) '''
word = comtypes.client.CreateObject('Word.Application')
#word.visible = True
#time.sleep(3)
doc = word.Documents.Open(inputFile)
doc.SaveAs(outputFile, FileFormat = 17)
doc.close()
#word.visible = False
word.Quit()
# Main Body=================================================================
convert_word_to_pdf(File,outFile)

Open a word document with python using windows

I am trying to open a word document with python in windows, but I am unfamiliar with windows.
My code is as follows.
import docx as dc
doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
Through another post, I learned that I had to put the r in front of my string to convert it to a raw string or it would interpret the \U as an escape sequence.
The error I get is
PackageNotFoundError: Package not found at 'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx'
I'm unsure of why it cannot find my document, 01100-Allergan-UD1314-SUMMARY OF WORK.docx. The pathway is correct as I copied it directly from the file system.
Any help is appreciated thanks.
try this
import StringIO
from docx import Document
file = r'H:\myfolder\wordfile.docx'
with open(file) as f:
source_stream = StringIO(f.read())
document = Document(source_stream)
source_stream.close()
http://python-docx.readthedocs.io/en/latest/user/documents.html
Also, in regards to debugging the file not found error, simplify your directory names and files names. Rename the file to 'file' instead of referring to a long path with spaces, etc.
If you want to open the document in Microsoft Word try using os.startfile().
In your example it would be:
os.startfile(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
This will open the document in word on your computer.

Read Docx files via python

Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.
There are a couple of packages that let you do this.
Check
python-docx.
docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt).
Since .docx files are simply .zip files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs.
In this case, you would likely have to convert doc -> docx first. antiword is an option.
python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
A quick search of PyPI turns up the docx package.
import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)

Categories