read ms word with python - python

I'm trying to read ms word with StringIO. But somehow the output become strange string
from docx import Document
import StringIO
import cStringIO
files = "D:/Workspace/Python scripting/test.docx"
document = Document(files)
f = cStringIO.StringIO()
document.save(f)
contents = f.getvalue()
print contents
Thanks for any help in advance

document.save(f) saves the file to a string, formatted as a .docx file. You're then reading that string, which will do exactly the same thing as f=open(files).read(). If you want the text in the document, you should use python-docx's API for that. I haven't used it before, but the documentation is here:
https://python-docx.readthedocs.org/en/latest/index.html
It looks like you could use something like this:
paragraphs=document.paragraphs
This is the list of Paragraph objects in the document. You can get the tex of that paragraph like this:
text="\n".join([paragraph.text for paragraph in paragraphs])
text will then contain the text of the document.

Related

save text to a docx file python

I have a structured text on python. I want to save it to a docx file.
Something like
text = "A simple text\n Structured"
with open('docx_file.docx', 'w') as f:
f.write(text)
Check python-docx package:
from docx import Document
document = Document()
document.add_heading('A simple text', level=1)
document.add_paragraph('some more text ... ')
document.save('docx_file.docx')
A docx file is not a plain text file, so unless you want to not use a library for this, I recommend https://grokonez.com/python/how-to-read-write-word-docx-files-in-python-docx-module.
Unless you need to use a "fancy" format like docx, I would recommend just writing plain text to a txt file.

Convert pdf files to raw text in new directory

Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.

How to encapsulate method to print console out to Word

I want to print my summary stats sometimes to console, and also other times to Word.
I don't want my code to be littered with lines calling to Word, because then I'd need to find and comment out like 100 lines each time I just wanted the console output.
I've thought about using a flag variable up at the front and changing it to false when I wanted to print versus not, but that's also a hassle.
The best solution I came up with was to write a separate script that opens a document, writes by calling my first summary stats script, and then closes the document:
import sys
import RunSummaryStats
from docx import Document
filename = "demo.docx"
document = Document()
document.save(filename)
f = open(filename, 'w')
sys.stdout = f
# actually call my summary stats script here: Call RunSummaryStats etc.
print("5")
f.close()
However, when I tried doing the above with python docx, upon opening my docs file I received the error We're sorry, we can't open this document because some parts are missing or invalid. As you can see the code above just printed out one number so it can't be a problem with the data I'm trying to write.
Finally, it needs to go to Word and not other file formats, to format some data tables.
By the way, this is an excerpt of RunSummaryStats. You can see how it's already filled with print lines which are helpful when I'm still exploring the data, and which I don't want to get rid of/replace with adding into a list:
The easy thing is to let cStringIO do the work, and separate collecting all your data from writing it into a file. That is:
import RunSummaryStats
import sys
# first, collect all your data into a StringIO object
orig_stdout = sys.stdout
stat_buffer = cStringIO.StringIO()
sys.stdout = stat_buffer
try:
# actually call my summary stats script here: Call RunSummaryStats etc.
print("5")
finally:
sys.stdout = orig_stdout
# then, write the StringIO object's contents to a Document.
from docx import Document
filename = "demo.docx"
document = Document()
document.write(add_paragraph(stat_buffer.getvalue()))
document.save(filename)
The Document() constructor essentially creates the .docx file package (this is actually a .zip archive of lots of XML and other stuff, which later the Word Application parses and renders etc.).
This statement f = open(filename, 'w') opens that file object (NB: this does not open Word Application, nor does it open a Word Document instance) and then you dump your stdout into that object. That is 100% of the time going to result in a corrupted Word Document; because you simply cannot write to a word document that way. You're basically creating a plain text file with a docx extension, but none of the underlying "guts" that make a docx a docx. As a result, Word Application doesn't know what to do with it.
Modify your code so that this "summary" procedure returns an iterable (the items in this iterable will be whatever you want to put in the Word Document). Then you can use something like the add_paragraph method to add each item to the Word Document.
def get_summary_stats(console=False):
"""
if console==True, results will be printed to console
returns a list of string for use elsewhere
"""
# hardcoded, but presume you will actually *get* these information somehow, modify as needed:
stats = ["some statistics about something", "more stuff about things"]
if console:
for s in stats:
print(s)
return stats
Then:
filename = "demo.docx"
document = Document()
# True will cause this function to print to console
for stat in get_summary_stats(True):
document.add_paragraph(stat)
document.save(filename)
So maybe there was a better way to do it, but in the end I
created a single function out of my summary stats script def run_summary
created a function based on #Charles Duffy's answer def print_word where StringIO reads from RunSummaryStats.run_summary(filepath, filename)
called def_print_word in my final module. There I set the variables for path, filename, and raw data source like so:
PrintScriptToWord.print_word(ATSpath, RSBSfilename, curr_file + ".docx")
I welcome any suggestions to improve this or other approaches.

Inserting txt file info into docx

I am using python 3.5, python-docx, and TKinter. Is there any way to be able to type a .txt file name into an entry box and have python insert the contents of that .txt file into a specific place in docx? I know how to get information from Entry boxes in tkinter and how to convert them into a string.
I want to be able to enter someones name (which would also be the name of the text file) into the entry box and have python insert the content of the .txt file.
Thanks
Update here is my code that I'm trying
from tkinter import *
from tkinter import ttk
import tkinter as tk
from docx import Document
root=Tk()
def make_document():
testbox=ProjectEngineerEntry.get()
TestBox=str(testbox)
def projectengineer():
with open(+TestBox+'.txt') as f:
for line in f:
document.add_paragraph(line)
document=Document()
h1=document.add_heading('engineer test',level=1)
h1a=document.add_heading('Project Engineer',level=2)
projectengineer()
document.save('test.docx')
notebook=ttk.Notebook()
notebook.pack(fill=BOTH)
frame1=ttk.Frame(notebook)
notebook.add(frame1,text='Tester',sticky=W)
tester1=Label(frame1,text='Test1')
ProjectEngineerEntry=Entry(frame1)
tester1.pack()
ProjectEngineerEntry.pack()
save=Button(frame1,text='Save',command=make_document).pack()
As you can see I am trying to take the information from the entry box, convert it to a string and then use that string to open a text file with that specific name. However I keep getting
a TypeError: bad operand type for unary +: 'str'
I don't understand whats going on here. In the actual document I used the ++ method when saving the file (saves it as the current date and time).
python-docx doesn't have this functionality and it's unlikely it ever will. Typically this sort of thing is done to suit by the script or application using python-docx. Something like this:
from docx import Document
document = Document()
with open('textfile.txt') as f:
for line in f:
document.add_paragraph(line)
document.save('wordfile.docx')
You'll need to deal with the particulars, like how a paragraph is separated (perhaps a blank line) and so on, but the code doesn't need to be much longer than this.
Take a look at the following python library: https://python-docx.readthedocs.io/en/latest/
just a pretty adaptation from scanny answer
%pip install python-docx
from docx import Document
def to_docx(file_path_plain_text, file_path_docx):
document = Document()
with open(file_path_plain_text) as f:
for line in f:
document.add_paragraph(line)
document.save(file_path_docx)
to_docx('my_file.txt', 'report.docx')
Possible charmap and encoding errors may arise in capturing the content of the plaintext file. But writing a robust function about this capture would not be part of the scope of this topic.

Modifying text in docx through Python

I am currently trying to open an existing Word document, modify a line by inserting a raw input at the end, then save it as an overwrite to the original "DOCX". I am using the "DOCX" module. I'm able to create a new document, write in it, then save it... however cannot figure out how to modify an existing line in an existing "DOCX".
doc = docx.Document()
paragraph = doc.add_paragraph()
so far, i've tried this out.. the problem is the paragraph i need to modify is paragraph 0, and this code places my text at the bottom of the page on a new paragraph.
import docx
paragraph = doc.add_paragraph()
doc = docx.Document("C:\Users\xxx\Desktop\test.docx")
doc.add_paragraph("hello")
docx.text.paragraph.Paragraph object at 0x03697170
doc.save("C:\Users\xxx\Desktop\test.docx")
How can I go about instructing python to write at the end of an existing string in a paragraph then save it overwriting the original?
import docx
doc = docx.Document("C:\Users\xxx\Desktop\test.docx")
doc.paragraphs[0].add_run("hello")
doc.save("C:\Users\xxx\Desktop\test.docx")
Do you mean for something like this?
from docx import Document
existing_docx = Document(r'path_to_existing.docx')
for paragraph in existing_docx.paragraphs:
paragraph.text = paragraph.text + your_text
existing_docx.save(r'same_path_or_another.docx')
The new docx will contain your text after every paragraph with the code above.
To get paragraph 0 try:
para = doc.paragraphs[0]

Categories