Modifying text in docx through Python

Modifying text in docx through Python - python

I am currently trying to open an existing Word document, modify a line by inserting a raw input at the end, then save it as an overwrite to the original "DOCX". I am using the "DOCX" module. I'm able to create a new document, write in it, then save it... however cannot figure out how to modify an existing line in an existing "DOCX".
doc = docx.Document()
paragraph = doc.add_paragraph()
so far, i've tried this out.. the problem is the paragraph i need to modify is paragraph 0, and this code places my text at the bottom of the page on a new paragraph.
import docx
paragraph = doc.add_paragraph()
doc = docx.Document("C:\Users\xxx\Desktop\test.docx")
doc.add_paragraph("hello")
docx.text.paragraph.Paragraph object at 0x03697170
doc.save("C:\Users\xxx\Desktop\test.docx")
How can I go about instructing python to write at the end of an existing string in a paragraph then save it overwriting the original?

import docx
doc = docx.Document("C:\Users\xxx\Desktop\test.docx")
doc.paragraphs[0].add_run("hello")
doc.save("C:\Users\xxx\Desktop\test.docx")

Do you mean for something like this?
from docx import Document
existing_docx = Document(r'path_to_existing.docx')
for paragraph in existing_docx.paragraphs:
paragraph.text = paragraph.text + your_text
existing_docx.save(r'same_path_or_another.docx')
The new docx will contain your text after every paragraph with the code above.

To get paragraph 0 try:
para = doc.paragraphs[0]

Related

How to remove blank page in word file using python docx or any other library available in python?

I have a word document in Docx format which has multiple blank pages. I need to delete all the blank pages in the document using python. I have tried to delete empty lines in the document. But it is not working. tried the below code:
from docx import *
document=Document("\\doc_path")
z=len(document.paragraphs)
for i in range(0,z):
if document.paragraphs[i].text=="":
document.paragraphs[i].clear()
The above-mentioned code is not working and deleting all the intermediate empty lines on a page.

How to open/save docx file to be aditted as xml and save the result as docx after editting using python

I have a docx file in which I need to edit its paragraphs (the paragraphs might contain equations). I tried to do these jobs using python-docx but it was not successful since editing the text of each paragraph and replacing it with the edited new paragraph needs to call p.add_paragraphs(editText(paragraph.text)) which ignores and omits any mathematical equation.
By searching for a method to gain this goal I found that this job is possible through XML codes by finding <w:t> tags and editing their content like this:
tree= ET.parse(filename)
root=tree.getroot()
for par in root.findall('w:p'):
if par.find('w:r'):
myText= par.find('w:r').find('w:t')
myText.text= editText(myText.text)
Then I must save the result as docx.
My quation is: what the format of filename is? should it be a document.xml file? If so, how can I reach that from my original document.docx file? and one more question is that how can I save the result as a .docx file again?
For saving docx as xml, I have given a try to save it by document.save('Document2.xml'). But the content of the result was not correct.
Would you give me some advice how to do them?

Not experienced with this at all, but perhaps this is what you were looking for?
https://virantha.com/2013/08/16/reading-and-writing-microsoft-word-docx-files-with-python/
From the article:
import zipfile
from lxml import etree
class DocsWriter:
def __init__(self, docx_file):
self.zipfile = zipfile.ZipFile(docx_file)
def _write_and_close_docx (self, xml_content, output_filename):
""" Create a temp directory, expand the original docx zip.
Write the modified xml to word/document.xml
Zip it up as the new docx
"""
tmp_dir = tempfile.mkdtemp()
self.zipfile.extractall(tmp_dir)
with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
xmlstr = etree.tostring (xml_content, pretty_print=True)
f.write(xmlstr)
# Get a list of all the files in the original docx zipfile
filenames = self.zipfile.namelist()
# Now, create the new zip file and add all the filex into the archive
zip_copy_filename = output_filename
with zipfile.ZipFile(zip_copy_filename, "w") as docx:
for filename in filenames:
docx.write(os.path.join(tmp_dir,filename), filename)
# Clean up the temp dir
shutil.rmtree(tmp_dir)
From what I can tell, this code block writes an xml document as .docx. Refer to the article for more context.

Python is not the best tool for this. Use VBA if you need to automate something in a Word document, or multiple Word documents. I can't tell what you are even trying to do here, but let's start at the beginning, with something simple. If, for instance, you want to loop through all paragraphs in your Word document, and select only the equations, you can run the code below to do just that.
Sub SelectAllEquations()
Dim xMath As OMath
Dim I As Integer
With ActiveDocument
.DeleteAllEditableRanges wdEditorEveryone
For I = 1 To .OMaths.Count
Set xMath = .OMaths.Item(I)
xMath.Range.Paragraphs(1).Range.Editors.Add wdEditorEveryone
Next
.SelectAllEditableRanges wdEditorEveryone
.DeleteAllEditableRanges wdEditorEveryone
End With
End Sub
Again, I don't know what your end game is, but I think it's worthwhile to start with something like that, and build on your foundation.

Trying to replace text in a table in a .docx file based on a regex

I am trying to replace a text in a table in a .docx file in Python. I am fairly new with Python, so here's the code that I will explain later on:
from typing import List, Any
from docx import Document
import re
import sys
label_name = sys.argv[1:][0]
file_name = "MyDocFile.docx"
doc = Document(file_name)
cell_text_array = []
target_index = 0
def index_cells(doc_obj):
global cell_text_array
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
cell_text_array.append(cell.text)
def docx_replace_regex(doc_obj, regex, replace):
global cell_text_array
for p in doc_obj.paragraphs:
if regex.search(p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if regex.search(inline[i].text):
text = regex.sub(replace, inline[i].text)
inline[i].text = text
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
docx_replace_regex(cell, regex, replace)
# index the cells in the document
index_cells(doc)
# everything after: /myregex/
target_index = cell_text_array.index('myregex')
# the text that I actually need is 3 spots after 'myregex'
target_index += 3
former_label = cell_text_array[target_index]
# find regex and replace
regex1 = re.compile(re.escape(r"" + former_label))
replace1 = r"" + label_name
print(regex1)
print(replace1)
# call the replace function and save what has been replaced
docx_replace_regex(doc, regex1, replace1)
doc.save('result1.docx')
The first function index_cells() basically opens MyDocFile.docx and searches every string from the tables the .docx file has and saves them in cell_text_array. I have taken the next function from the internet because I don't usually code in Python but I am forced to in this case (I can't use Ruby's 'docx' module for various reasons). So docx_replace_regex() does exactly what it's name suggests: opens the .docx file, finds the text that needs to be replaced and replaces it with replace (even if the text that needs to be replaced is found in a table or another paragraph).
What I'm trying to do is basically pass a new name/label/tag (whatever you want to call it) as a parameter to the file and change the old name/label/tag from the .docx file with the parameter and save the newly edited .docx file to another new .docx file.
This code works fine if the name/tag/label I'm trying to replace does not have any dots. In fact, I tested it on other strings from tables and it worked just fine. Since this name/tag/label contains dots I had to use re.compile(re.escape()) so the dots would not be considered special characters and I thought it should work, but for some reason after the new file is generated, nothing is changed.
I printed out regex1 and replace1 to see the values. regex1 has the following format: re.compile('tag\\.name\\.label') while replace1 is just tag.name.label without any '' or "". I think that this might be the problem for the misbehaviour, but I'm not sure since I'm very new to Python.
Can anyone help me with this? Is there something I'm missing?

How to copy certain strings from txt file to Word doc using Python?

I want to copy a word or string from txt file to word file at a certain block of table!
can someone guide me how to do it?
Best Regards,
Usman

If your question is how to write a word (.docx) file. There is a library called docx. Simply installed using pip:
pip install python-docx
Here is a short example that writes a docx file for you.
from docx import Document
document = Document()
document.add_heading('Document Title', 0)
p = document.add_paragraph('A plain paragraph having some text')
document.save('demo.docx')
Here is a script that that will read a text file and add lines that match a condition to a word file. I would change the matches_my_condition function to your own needs.
from docx import Document
def matches_my_condition(line):
""" Returns true or false if the given line should be added to the document """
# Which will return true if the word cake appears in the line
return 'cake' in line
# Prepare document
document = Document()
with open('my_text_file.txt', 'r') as textfile:
for line in textfile.readlines():
if matches_my_condition(line):
document.add_paragraph(line)
document.save('my_cake_file.docx')

read ms word with python

I'm trying to read ms word with StringIO. But somehow the output become strange string
from docx import Document
import StringIO
import cStringIO
files = "D:/Workspace/Python scripting/test.docx"
document = Document(files)
f = cStringIO.StringIO()
document.save(f)
contents = f.getvalue()
print contents
Thanks for any help in advance

document.save(f) saves the file to a string, formatted as a .docx file. You're then reading that string, which will do exactly the same thing as f=open(files).read(). If you want the text in the document, you should use python-docx's API for that. I haven't used it before, but the documentation is here:
https://python-docx.readthedocs.org/en/latest/index.html
It looks like you could use something like this:
paragraphs=document.paragraphs
This is the list of Paragraph objects in the document. You can get the tex of that paragraph like this:
text="\n".join([paragraph.text for paragraph in paragraphs])
text will then contain the text of the document.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modifying text in docx through Python - python

import docx doc = docx.Document("C:\Users\xxx\Desktop\test.docx") doc.paragraphs[0].add_run("hello") doc.save("C:\Users\xxx\Desktop\test.docx")

To get paragraph 0 try: para = doc.paragraphs[0]

Related

How to remove blank page in word file using python docx or any other library available in python?

How to open/save docx file to be aditted as xml and save the result as docx after editting using python

Trying to replace text in a table in a .docx file based on a regex

How to copy certain strings from txt file to Word doc using Python?

read ms word with python

Categories

Resources