Reading core_properties using python-docx - python

I'm trying to read the last_saved_by attribute on docx files. I've followed the comments on Github, and from this question. It seems that support has been added, but the documentation isn't very clear for me.
I've entered the following code into my script (Notepad++):
import docx
document = Document()
core_properties = document.core_properties
core_properties.author = 'Foo B. Baz'
document.save('new-filename.docx')
I only get an error message at the end:
NameError: name 'Document' is not defined
I'm not sure where I'm going wrong. :(
When I enter it line by line through python itself, the problem seems to come up from the second line.
I'm using Python 3.4, and docx 0.8.6

Figured out where I was going wrong, for those that want to know:
from docx import Document
import docx
document = Document('mine.docx')
core_properties = document.core_properties
print(core_properties.author)
There'll be a more succinct way of doing this, I'm sure (importing docx twice seems redundant for a start) - but it works, so I'm happy! :)

If the only thing you need from the docx module is Document, then you only need to use
from docx import Document
If you use more than that, you could use
import docx
document = docx.Document()
importing specific names from the docx module is your choice; either way, you don't need to have two lines importing from (or importing) docx, although it's not expensive to have both.

Related

Inserting txt file info into docx

I am using python 3.5, python-docx, and TKinter. Is there any way to be able to type a .txt file name into an entry box and have python insert the contents of that .txt file into a specific place in docx? I know how to get information from Entry boxes in tkinter and how to convert them into a string.
I want to be able to enter someones name (which would also be the name of the text file) into the entry box and have python insert the content of the .txt file.
Thanks
Update here is my code that I'm trying
from tkinter import *
from tkinter import ttk
import tkinter as tk
from docx import Document
root=Tk()
def make_document():
testbox=ProjectEngineerEntry.get()
TestBox=str(testbox)
def projectengineer():
with open(+TestBox+'.txt') as f:
for line in f:
document.add_paragraph(line)
document=Document()
h1=document.add_heading('engineer test',level=1)
h1a=document.add_heading('Project Engineer',level=2)
projectengineer()
document.save('test.docx')
notebook=ttk.Notebook()
notebook.pack(fill=BOTH)
frame1=ttk.Frame(notebook)
notebook.add(frame1,text='Tester',sticky=W)
tester1=Label(frame1,text='Test1')
ProjectEngineerEntry=Entry(frame1)
tester1.pack()
ProjectEngineerEntry.pack()
save=Button(frame1,text='Save',command=make_document).pack()
As you can see I am trying to take the information from the entry box, convert it to a string and then use that string to open a text file with that specific name. However I keep getting
a TypeError: bad operand type for unary +: 'str'
I don't understand whats going on here. In the actual document I used the ++ method when saving the file (saves it as the current date and time).
python-docx doesn't have this functionality and it's unlikely it ever will. Typically this sort of thing is done to suit by the script or application using python-docx. Something like this:
from docx import Document
document = Document()
with open('textfile.txt') as f:
for line in f:
document.add_paragraph(line)
document.save('wordfile.docx')
You'll need to deal with the particulars, like how a paragraph is separated (perhaps a blank line) and so on, but the code doesn't need to be much longer than this.
Take a look at the following python library: https://python-docx.readthedocs.io/en/latest/
just a pretty adaptation from scanny answer
%pip install python-docx
from docx import Document
def to_docx(file_path_plain_text, file_path_docx):
document = Document()
with open(file_path_plain_text) as f:
for line in f:
document.add_paragraph(line)
document.save(file_path_docx)
to_docx('my_file.txt', 'report.docx')
Possible charmap and encoding errors may arise in capturing the content of the plaintext file. But writing a robust function about this capture would not be part of the scope of this topic.

Open a word document with python using windows

I am trying to open a word document with python in windows, but I am unfamiliar with windows.
My code is as follows.
import docx as dc
doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
Through another post, I learned that I had to put the r in front of my string to convert it to a raw string or it would interpret the \U as an escape sequence.
The error I get is
PackageNotFoundError: Package not found at 'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx'
I'm unsure of why it cannot find my document, 01100-Allergan-UD1314-SUMMARY OF WORK.docx. The pathway is correct as I copied it directly from the file system.
Any help is appreciated thanks.
try this
import StringIO
from docx import Document
file = r'H:\myfolder\wordfile.docx'
with open(file) as f:
source_stream = StringIO(f.read())
document = Document(source_stream)
source_stream.close()
http://python-docx.readthedocs.io/en/latest/user/documents.html
Also, in regards to debugging the file not found error, simplify your directory names and files names. Rename the file to 'file' instead of referring to a long path with spaces, etc.
If you want to open the document in Microsoft Word try using os.startfile().
In your example it would be:
os.startfile(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
This will open the document in word on your computer.

Read Docx files via python

Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.
There are a couple of packages that let you do this.
Check
python-docx.
docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt).
Since .docx files are simply .zip files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs.
In this case, you would likely have to convert doc -> docx first. antiword is an option.
python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
A quick search of PyPI turns up the docx package.
import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)

Pulling data out of MS Word with pywin32

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)
oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:
from oodocx import oodocx
d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')
If you just want to remove strings, you can pass an empty string to the "replace" parameter.

Extracting tables from a DOCX Word document in python

I'm trying to extract a content of tables in DOCX Word document, and boy I'm new to xml/xpath.
from docx import *
document = opendocx('someFile.docx')
tableList = document.xpath('/w:tbl')
This triggers "XPathEvalError: Undefined namespace prefix" error. I'm sure it's just the first one to expect while developing the script. Unfortunately, I couldn't find a tutorial for python-docx.
Could you kindly provide an example of table extraction?
After some back and forth, we found out that a namespace was needed for this to work correctly. The xpath method is the appropriate solution, it just needs to have the document namespace passed in first.
The lxml xpath method has the details for namespace stuff. Look down the page in the link for passing a namespaces dictionary, and other details.
As explained by mgierdal in his comment above:
tblList = document.xpath('//w:tbl', namespaces=document.nsmap) works
like a dream. So, as I understand it w: is a shorthand that has to be
expanded to the full namespace name, and the dictionary for that is
provided by document.nsmap.
You can extract the table from docx using python-docx. Check the following code:
from docx import Document()
document = Document(file_path)
tables = document.tables
First install python-docx as mentioned by #abdulsaboor
pip install python-docx
Then this code should do:
from docx import Document
document = Document('myfile.docx')
for table in document.tables:
print()
for row in table.rows:
for cell in row.cells:
print(cell.text, end=' ')

Categories