Inserting txt file info into docx - python

I am using python 3.5, python-docx, and TKinter. Is there any way to be able to type a .txt file name into an entry box and have python insert the contents of that .txt file into a specific place in docx? I know how to get information from Entry boxes in tkinter and how to convert them into a string.
I want to be able to enter someones name (which would also be the name of the text file) into the entry box and have python insert the content of the .txt file.
Thanks
Update here is my code that I'm trying
from tkinter import *
from tkinter import ttk
import tkinter as tk
from docx import Document
root=Tk()
def make_document():
testbox=ProjectEngineerEntry.get()
TestBox=str(testbox)
def projectengineer():
with open(+TestBox+'.txt') as f:
for line in f:
document.add_paragraph(line)
document=Document()
h1=document.add_heading('engineer test',level=1)
h1a=document.add_heading('Project Engineer',level=2)
projectengineer()
document.save('test.docx')
notebook=ttk.Notebook()
notebook.pack(fill=BOTH)
frame1=ttk.Frame(notebook)
notebook.add(frame1,text='Tester',sticky=W)
tester1=Label(frame1,text='Test1')
ProjectEngineerEntry=Entry(frame1)
tester1.pack()
ProjectEngineerEntry.pack()
save=Button(frame1,text='Save',command=make_document).pack()
As you can see I am trying to take the information from the entry box, convert it to a string and then use that string to open a text file with that specific name. However I keep getting
a TypeError: bad operand type for unary +: 'str'
I don't understand whats going on here. In the actual document I used the ++ method when saving the file (saves it as the current date and time).

python-docx doesn't have this functionality and it's unlikely it ever will. Typically this sort of thing is done to suit by the script or application using python-docx. Something like this:
from docx import Document
document = Document()
with open('textfile.txt') as f:
for line in f:
document.add_paragraph(line)
document.save('wordfile.docx')
You'll need to deal with the particulars, like how a paragraph is separated (perhaps a blank line) and so on, but the code doesn't need to be much longer than this.

Take a look at the following python library: https://python-docx.readthedocs.io/en/latest/

just a pretty adaptation from scanny answer
%pip install python-docx
from docx import Document
def to_docx(file_path_plain_text, file_path_docx):
document = Document()
with open(file_path_plain_text) as f:
for line in f:
document.add_paragraph(line)
document.save(file_path_docx)
to_docx('my_file.txt', 'report.docx')
Possible charmap and encoding errors may arise in capturing the content of the plaintext file. But writing a robust function about this capture would not be part of the scope of this topic.

Related

Convert pdf files to raw text in new directory

Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.

How to encapsulate method to print console out to Word

I want to print my summary stats sometimes to console, and also other times to Word.
I don't want my code to be littered with lines calling to Word, because then I'd need to find and comment out like 100 lines each time I just wanted the console output.
I've thought about using a flag variable up at the front and changing it to false when I wanted to print versus not, but that's also a hassle.
The best solution I came up with was to write a separate script that opens a document, writes by calling my first summary stats script, and then closes the document:
import sys
import RunSummaryStats
from docx import Document
filename = "demo.docx"
document = Document()
document.save(filename)
f = open(filename, 'w')
sys.stdout = f
# actually call my summary stats script here: Call RunSummaryStats etc.
print("5")
f.close()
However, when I tried doing the above with python docx, upon opening my docs file I received the error We're sorry, we can't open this document because some parts are missing or invalid. As you can see the code above just printed out one number so it can't be a problem with the data I'm trying to write.
Finally, it needs to go to Word and not other file formats, to format some data tables.
By the way, this is an excerpt of RunSummaryStats. You can see how it's already filled with print lines which are helpful when I'm still exploring the data, and which I don't want to get rid of/replace with adding into a list:
The easy thing is to let cStringIO do the work, and separate collecting all your data from writing it into a file. That is:
import RunSummaryStats
import sys
# first, collect all your data into a StringIO object
orig_stdout = sys.stdout
stat_buffer = cStringIO.StringIO()
sys.stdout = stat_buffer
try:
# actually call my summary stats script here: Call RunSummaryStats etc.
print("5")
finally:
sys.stdout = orig_stdout
# then, write the StringIO object's contents to a Document.
from docx import Document
filename = "demo.docx"
document = Document()
document.write(add_paragraph(stat_buffer.getvalue()))
document.save(filename)
The Document() constructor essentially creates the .docx file package (this is actually a .zip archive of lots of XML and other stuff, which later the Word Application parses and renders etc.).
This statement f = open(filename, 'w') opens that file object (NB: this does not open Word Application, nor does it open a Word Document instance) and then you dump your stdout into that object. That is 100% of the time going to result in a corrupted Word Document; because you simply cannot write to a word document that way. You're basically creating a plain text file with a docx extension, but none of the underlying "guts" that make a docx a docx. As a result, Word Application doesn't know what to do with it.
Modify your code so that this "summary" procedure returns an iterable (the items in this iterable will be whatever you want to put in the Word Document). Then you can use something like the add_paragraph method to add each item to the Word Document.
def get_summary_stats(console=False):
"""
if console==True, results will be printed to console
returns a list of string for use elsewhere
"""
# hardcoded, but presume you will actually *get* these information somehow, modify as needed:
stats = ["some statistics about something", "more stuff about things"]
if console:
for s in stats:
print(s)
return stats
Then:
filename = "demo.docx"
document = Document()
# True will cause this function to print to console
for stat in get_summary_stats(True):
document.add_paragraph(stat)
document.save(filename)
So maybe there was a better way to do it, but in the end I
created a single function out of my summary stats script def run_summary
created a function based on #Charles Duffy's answer def print_word where StringIO reads from RunSummaryStats.run_summary(filepath, filename)
called def_print_word in my final module. There I set the variables for path, filename, and raw data source like so:
PrintScriptToWord.print_word(ATSpath, RSBSfilename, curr_file + ".docx")
I welcome any suggestions to improve this or other approaches.

How can I create a word (.docx) document if not found using python and write in it?

How can I create a word (.docx) document if not found using python and write in it?
I certainly cannot do either of the following:
file = open(file_name, 'r')
file = open(file_name, 'w')
or, to create or append if found:
f = open(file_name, 'a+')
Also I cannot find any related info in python-docx documentation at:
https://python-docx.readthedocs.io/en/latest/
NOTE:
I need to create an automated report via python with text and pie charts, graphs etc.
Probably the safest way to open (and truncate) a new file for writing is using 'xb' mode. 'x' will raise a FileExistsError if the file is already there. 'b' is necessary because a word document is fundamentally a binary file: it's a zip archive with XML and other files inside it. You can't compress and decompress a zip file if you convert bytes through character encoding.
Document.save accepts streams, so you can pass in a file object opened like that to save your document.
Your work-flow could be something like this:
doc = docx.Document(...)
...
# Make your document
...
with open('outfile.docx', 'xb') as f:
doc.save(f)
It's a good idea to use with blocks instead of raw open to ensure the file gets closed properly even in case of an error.
In the same way that you can't simply write to a Word file directly, you can't append to it either. The way to "append" is to open the file, load the Document object, and then write it back, overwriting the original content. Since the word file is a zip archive, it's very likely that appended text won't even be at the end of the XML file it's in, much less the whole docx file:
doc = docx.Document('file_to_append.docx')
...
# Modify the contents of doc
...
doc.save('file_to_append.docx')
Keep in mind that the python-docx library may not support loading some elements, which may end up being permanently discarded when you save the file this way.
Looks like I found an answer:
The important point here was to create a new file, if not found, or
otherwise edit the already present file.
import os
from docx import Document
#checking if file already present and creating it if not present
if not os.path.isfile(r"file_path"):
#Creating a blank document
document = Document()
#saving the blank document
document.save('file_name.docx')
#------------editing the file_name.docx now------------------------
#opening the existing document
document = Document('file_name.docx')
#editing it
document.add_heading("hello world" , 0)
#saving document in the end
document.save('file_name.docx')
Further edits/suggestions are welcome.

Reading core_properties using python-docx

I'm trying to read the last_saved_by attribute on docx files. I've followed the comments on Github, and from this question. It seems that support has been added, but the documentation isn't very clear for me.
I've entered the following code into my script (Notepad++):
import docx
document = Document()
core_properties = document.core_properties
core_properties.author = 'Foo B. Baz'
document.save('new-filename.docx')
I only get an error message at the end:
NameError: name 'Document' is not defined
I'm not sure where I'm going wrong. :(
When I enter it line by line through python itself, the problem seems to come up from the second line.
I'm using Python 3.4, and docx 0.8.6
Figured out where I was going wrong, for those that want to know:
from docx import Document
import docx
document = Document('mine.docx')
core_properties = document.core_properties
print(core_properties.author)
There'll be a more succinct way of doing this, I'm sure (importing docx twice seems redundant for a start) - but it works, so I'm happy! :)
If the only thing you need from the docx module is Document, then you only need to use
from docx import Document
If you use more than that, you could use
import docx
document = docx.Document()
importing specific names from the docx module is your choice; either way, you don't need to have two lines importing from (or importing) docx, although it's not expensive to have both.

Open a word document with python using windows

I am trying to open a word document with python in windows, but I am unfamiliar with windows.
My code is as follows.
import docx as dc
doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
Through another post, I learned that I had to put the r in front of my string to convert it to a raw string or it would interpret the \U as an escape sequence.
The error I get is
PackageNotFoundError: Package not found at 'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx'
I'm unsure of why it cannot find my document, 01100-Allergan-UD1314-SUMMARY OF WORK.docx. The pathway is correct as I copied it directly from the file system.
Any help is appreciated thanks.
try this
import StringIO
from docx import Document
file = r'H:\myfolder\wordfile.docx'
with open(file) as f:
source_stream = StringIO(f.read())
document = Document(source_stream)
source_stream.close()
http://python-docx.readthedocs.io/en/latest/user/documents.html
Also, in regards to debugging the file not found error, simplify your directory names and files names. Rename the file to 'file' instead of referring to a long path with spaces, etc.
If you want to open the document in Microsoft Word try using os.startfile().
In your example it would be:
os.startfile(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
This will open the document in word on your computer.

Categories