Text-Replace in docx and save the changed file with python-docx - python

I'm trying to use the python-docx module to replace a word in a file and save the new file with the caveat that the new file must have exactly the same formatting as the old file, but with the word replaced. How am I supposed to do this?
The docx module has a savedocx that takes 7 inputs:
document
coreprops
appprops
contenttypes
websettings
wordrelationships
output
How do I keep everything in my original file the same except for the replaced word?

this worked for me:
def docx_replace(old_file,new_file,rep):
zin = zipfile.ZipFile (old_file, 'r')
zout = zipfile.ZipFile (new_file, 'w')
for item in zin.infolist():
buffer = zin.read(item.filename)
if (item.filename == 'word/document.xml'):
res = buffer.decode("utf-8")
for r in rep:
res = res.replace(r,rep[r])
buffer = res.encode("utf-8")
zout.writestr(item, buffer)
zout.close()
zin.close()

As it seems to be, Docx for Python is not meant to store a full Docx with images, headers, ... , but only contains the inner content of the document. So there's no simple way to do this.
Howewer, here is how you could do it:
First, have a look at the docx tag wiki:
It explains how the docx file can be unzipped: Here's how a typical file looks like:
+--docProps
| + app.xml
| \ core.xml
+ res.log
+--word //this folder contains most of the files that control the content of the document
| + document.xml //Is the actual content of the document
| + endnotes.xml
| + fontTable.xml
| + footer1.xml //Containst the elements in the footer of the document
| + footnotes.xml
| +--media //This folder contains all images embedded in the word
| | \ image1.jpeg
| + settings.xml
| + styles.xml
| + stylesWithEffects.xml
| +--theme
| | \ theme1.xml
| + webSettings.xml
| \--_rels
| \ document.xml.rels //this document tells word where the images are situated
+ [Content_Types].xml
\--_rels
\ .rels
Docx only gets one part of the document, in the method opendocx
def opendocx(file):
'''Open a docx file, return a document XML tree'''
mydoc = zipfile.ZipFile(file)
xmlcontent = mydoc.read('word/document.xml')
document = etree.fromstring(xmlcontent)
return document
It only gets the document.xml file.
What I recommend you to do is:
get the content of the document with **opendocx*
Replace the document.xml with the advReplace method
Open the docx as a zip, and replace the document.xml content's by the new xml content.
Close and output the zipped file (renaming it to output.docx)
If you have node.js installed, be informed that I have worked on DocxGenJS which is templating engine for docx documents, the library is in active development and will be released soon as a node module.

Are you using the docx module from here?
If yes, then the docx module already exposes methods like replace, advReplace etc which can help you achieve your task. Refer to the source code for more details of the exposed methods.

from docx import Document
file_path = 'C:/tmp.docx'
document = Document(file_path)
def docx_replace(doc_obj, data: dict):
"""example: data=dict(order_id=123), result: {order_id} -> 123"""
for paragraph in doc_obj.paragraphs:
for key, val in data.items():
key_name = '{{{}}}'.format(key)
if key_name in paragraph.text:
paragraph.text = paragraph.text.replace(key_name, str(val))
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
docx_replace(cell, data)
docx_replace(document, dict(order_id=123, year=2018, payer_fio='payer_fio', payer_fio1='payer_fio1'))
document.save(file_path)

The problem with the methods above is that they lose the existing formatting. Please see my answer which performs the replace and retains formatting.
There is also python-docx-template which allows jinja2 style templating within a docx template. Here's a link to the documentation

I've forked a repo of python-docx here, which preserves all of the preexisting data in a docx file, including formatting. Hopefully this is what you're looking for.

In addition to #ramil, you have to escape some characters before placing them as string values into the XML, so this worked for me:
def escape(escapee):
escapee = escapee.replace("&", "&")
escapee = escapee.replace("<", "<")
escapee = escapee.replace(">", ">")
escapee = escapee.replace("\"", """)
escapee = escapee.replace("'", "&apos;")
return escapee

We can use python-docx to keep an image on docx.
docx detect image as a paragraph.
But for this paragraph the text is empty.
So you can use like this.
paragraphs = document.paragraphs for paragraph in paragraphs: if paragraph.text == '': continue

Related

Python script doesn't write Chinese characters to XML file

I'm making a mod for a game where the majority of the files are XMLs, the text of which is Simplified Chinese. My goal is to replace all of the Simplified Chinese in the files with Traditional, followed by an English translation. I'm using the Cloud Translate API from Google to do that part, and it all works fine. At first I was just doing a find and replace on the Chinese text and then adding English to the end of string, but the issue with that is that I'm getting extra English translations whenever the Chinese text occurs more than once.
In an effort to fix that I read more of the XML documentation for Python, and I started trying to use tree.write, but that's where I'm getting stuck. When I use it, the XML file has the UTF codes for the Chinese characters, rather than the actual characters. If I open the file in a web browser, the characters render correctly, but at this point I'm just unsure if they'll still work with the game if they're not writing into the XML normally.
Here's an example XML I'm working with:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍卖会</DisplayName>
<Desc>[NAME]来到了[PLACE],发现此地有个拍卖行。</Desc>
<Selections.0.Display>参与拍卖</Selections.0.Display>
<Selections.1.Display>离去</Selections.1.Display>
</Text>
</List>
</Texts>
My code which works but sometimes duplicates English translations:
import lxml.etree as ET
from google.cloud import translate_v2 as translate
import pinyin
translator = translate.Client()
tgt = "zh-TW"
tt = "en"
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'r', encoding="utf-8") as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.iter('Text'):
print(elem.text)
for child in elem:
txt = child.text
ttxt = translator.translate(txt, target_language=tgt)
etxt = translator.translate(txt, target_language=tt)
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'r') as n:
new = n.read().replace(txt, ttxt['translatedText'] + '(' + etxt['translatedText'] + ')', 1)
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'w') as n:
n.write(new)
The output of that looks like this:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍賣會(auctions)</DisplayName>
<Desc>[NAME]來到了[PLACE],發現此地有個拍賣行。([NAME] came to [PLACE] and found an auction house here.)</Desc>
<Selections.0.Display>參與拍賣(Participate in the auction)</Selections.0.Display>
<Selections.1.Display>離去(Leave)</Selections.1.Display>
</Text>
</List>
</Texts>
And here's my tree.write code:
import lxml.etree as ET
from google.cloud import translate_v2 as translate
import pinyin
translator = translate.Client()
tgt = "zh-TW"
tt = "en"
with open('/home/dave/zh-TW/Settings/MapStories/MapStory_Auction.xml', 'r', encoding="utf-8") as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.iter('Text'):
print(elem.text)
for child in elem:
print(child.text)
txt = child.text
ttxt = translator.translate(txt, target_language=tgt)
etxt = translator.translate(txt, target_language=tt)
child.text = ttxt['translatedText'] + "(" + etxt['translatedText'] + ")"
tree.write('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml')
And the output from that looks like this:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍賣會(auctions)</DisplayName>
<Desc>[NAME]來到了[PLACE],發現此地有個拍賣行。([NAME] came to [PLACE] and found an auction house here.)</Desc>
<Selections.0.Display>參與拍賣(Participate in the auction)</Selections.0.Display>
<Selections.1.Display>離去(Leave)</Selections.1.Display>
</Text>
</List>
</Texts>
Any help would be appreciated. I think once I figure this out I should be able to fly through the rest of the translating.
tree.write('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml')
Per the documentation:
write(file, encoding="us-ascii", xml_declaration=None, default_namespace=None, method="xml", *, short_empty_elements=True)
...
The output is either a string (str) or binary (bytes). This is controlled by the encoding argument. If encoding is "unicode", the output is a string; otherwise, it’s binary. Note that this may conflict with the type of file if it’s an open file object; make sure you do not try to write a string to a binary stream and vice versa.
So we just need to set the encoding parameter appropriately. Writing as ASCII means that non-ASCII characters need to be entity-escaped (拍 etc.) (It still writes to the file without a problem, of course, because the UTF-8 encoding specified for the file is ASCII-transparent.)

Python open .doc file

I'm working on a project in which I need to read the text from multiple doc and docx files. The docx files were easily done with the docx2txt module but I cannot for the love of me make it work for doc files. I've tried with textract, but it doesn't seem to work on Windows. I just need the text in the file, no pictures or anything like that. Any ideas?
I found that this seems to work:
import win32com.client
text = win32com.client.Dispatch("Word.Application")
text.visible = False
wb = text.Documents.Open("myfile.doc")
document = text.ActiveDocument
print(document.Range().Text)
I had a similar issue, the following function worked for me.
def get_string(path: Path) -> str:
string = ''
with open(path, 'rb') as stream:
stream.seek(2560)
current_stream = stream.read(1)
while not (str(current_stream) == "b'\\x00'"):
if str(current_stream) in special_chars.keys():
string += special_chars[str(current_stream)]
else:
try:
char = current_stream.decode('UTF-8')
if char.isalnum() or char == ' ':
string += char
except UnicodeDecodeError:
string += ''
current_stream = stream.read(1)
return string
I tested it on a .doc file looking like the following:
picture of .doc file
The output from:
string = get_string(filepath)
print(string)
is:
The big red fox jumped over the small barrier to get to the chickens on the other side
And the chickens ran about but had no luck in surviving the day
this||||that||||The other||||

Python: Import text from HTML or text document into Word

I've been looking at some of the documentation, but all of the work I've seen around docx is primarily directed towards working with text already in a word document. What I'd like to know, is is there a simple way to take text either from HTML or a Text document, and import that into a word document, and to do that wholesale? with all of the text in the HTML/Text document? It doesn't seem to like the string, it's too long.
My understanding of the documentation, is that you have to work with text on a paragraph by paragraph basis. The task that I'd like to do is relatively simple - however it's beyond my python skills. I'd like to set up the margins on the word document, and then import the text into the word document so that it adheres to the margins that I previously specified.
Does anyone have any-thoughts? None of the previous posts have been very helpful that I have found.
import textwrap
import requests
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Inches
class DocumentWrapper(textwrap.TextWrapper):
def wrap(self, text):
split_text = text.split('\n\n')
lines = [line for para in split_text for line in textwrap.TextWrapper.wrap(self, para)]
return lines
page = requests.get("http://classics.mit.edu/Aristotle/prior.mb.txt")
soup = BeautifulSoup(page.text,"html.parser")
#we are going to pull in the text wrap extension that we have added.
#The typical width that we want tow
text_wrap_extension = DocumentWrapper(width=82,initial_indent="",fix_sentence_endings=True)
new_string = text_wrap_extension.fill(page.text)
final_document = "Prior_Analytics.txt"
with open(final_document, "w") as f:
f.writelines(new_string)
document = Document(final_document)
### Specified margin specifications
sections = document.sections
for section in sections:
section.top_margin = (Inches(1.00))
section.bottom_margin = (Inches(1.00))
section.right_margin = (Inches(1.00))
section.left_margin = (Inches(1.00))
document.save(final_document)
The error that I get thrown is below:
docx.opc.exceptions.PackageNotFoundError: Package not found at 'Prior_Analytics.txt'
This error simply means there is no .docx file at the location you specified.. So you can modify your code to create the file it it doesnt exist.
final_document = "Prior_Analytics.txt"
with open(final_document, "w+") as f:
f.writelines(new_string)
You are providing a relative path. How do you know what Python's current working directory is? That's where the relative path you give will start from.
A couple lines of code like this will tell you:
import os
print(os.path.realpath('./'))
Note that:
docx is used to open .docx files
I got it.
document = Document()
sections = document.sections
for section in sections:
section.top_margin = Inches(2)
section.bottom_margin = Inches(2)
section.left_margin = Inches(2)
section.right_margin = Inches(2)
document.add_paragraph(###Add your text here. Add Paragraph Accepts text of whatever size.###)
document.save()#name of document goes here, as a string.

Create hyperlinks from urls in text file using QTextBrowser

I have a text file with some basic text:
For more information on this topic, go to (http://moreInfo.com)
This tool is available from (https://www.someWebsite.co.uk)
Contacts (https://www.contacts.net)
I would like the urls to show up as hyperlinks in a QTextBrowser, so that when clicked, the web browser will open and load the website. I have seen this post which uses:
Bar
but as the text file can be edited by anyone (i.e. they might include text which does not provide a web address), I would like it if these addresses, if any, can be automatically hyperlinked before being added to the text browser.
This is how I read the text file:
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(file_path, 'r')
text = f.read()
text_browser.setText(text)
text_browser.setOpenExternalLinks(True)
self.dockwidget.show()
Edit:
Made some headway and managed to get the hyperlinks using (assuming the links are inside parenthesis):
import re
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(about_file_path, 'r')
text = f.read()
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
for x in urls:
if x in text:
text = text.replace(x, x.replace('http', '') + x + '')
textBrowser.setHtml(text)
textBrowser.setOpenExternalLinks(True)
self.dockwidget.show()
However, it all appears in one line and not in the same format as in the text file. How could I solve this?
Matching urls correctly is more complex than your current solution might suggest. For a full breakdown of the issues, see: What is the best regular expression to check if a string is a valid URL?
.
The other problem is much easier to solve. To preserve newlines, you can use this:
text = '<br>'.join(text.splitlines())

Search and Replace not working in header? Python docx

I'm using python-docx module to do some edits on a large number of documents. They all contain a header in which I need to replace a number, but everytime I do this the document won't open, with the error that the content is unreadable. Anyone have any ideas as to why this is happening, or sample working code snippets? Thanks.
from docx import *
#document = yourdocument.docx
filename = "NUR-ADM-2001"
relationships = relationshiplist()
document = opendocx("C:/Users/ai/My Documents/Nursing docs/" + filename + ".docx")
docbody = document.xpath('/w:document/w:body',namespaces=nsprefixes)[0]
advReplace(docbody, "NUR-NPM 101", "NUR-NPM 202")
# Create our properties, contenttypes, and other support files
coreprops = coreproperties(title='Nursing Doc',subject='Policies',creator='IA',keywords='Policy'])
appprops = appproperties()
contenttypes = contenttypes()
websettings = websettings()
wordrelationships = wordrelationships(relationships)
# Save our document
savedocx(document,coreprops,appprops,contenttypes,websettings, wordrelationships,"C:/Users/ai/My Documents/Nursing docs/" + filename + ".docx")
Edit: So it eventually can open the document, but it says some content cannot be displayed and the headers have vanished... thoughts?
I don't know this module, but in general you should not edit a file in place. Open file "A", write file "/tmp/A". Close both files and make sure you have no errors, then move "/tmp/A" to "A". Otherwise you risk clobbering your file if something goes wrong during the write.

Categories