Using python RE to replace a string in Word Document? - python

So i am trying to run through a word document to replace all text strings with 'aaa' (just for example) to replace it with a variable from user input, i have been bashing my head with a few answers on stackoverflow to figure out and came across Regular expressions which i have never used before, after using a tutorial for a bit I just can't seem to get my head round it.
This is all the code i have tried exampling but just can't seem to get python to actually change the text string in this Word Document.
from docx import Document
import re
signature = Document ('test1.docx')
person = raw_input('Name?')
person = person+('.docx')
save = signature.save(person)
name_change = raw_input('Change name?')
line = re.sub('[a]{3}',name_change,signature)
print line
save
for line in signature.paragraphs:
line = re.sub('[a]{3}',name_change,signature)
for table in signature.tables:
for cell in table.cells:
for paragraph in cell.paragraphs:
if 'aaa' in paragraph.text:
print paragraph.text
paragraph.text= replace('aaa',name_change)
save
Thank you in advance for any help.

for line in signature.paragraphs:
line = re.sub('[a]{3}',name_change,signature)
The above code is redundant since you update the variable line with the re.sub, but it doesn't cause an update in the actual origin, as shown below:
data = ['aaa', 'baaa']
for item in data:
item = re.sub('[a]{3}', 't', item)
print(data)
#['aaa', 'baaa']
Also, you are iterating over signature.paragraphs but just calling re.sub on the entirety of signature everytime. Try something like this:
signature = re.sub('[a]{3}', name_change, signature)
save

Related

Why are MailMerge objects unable to be converted to Unicode? Is there a reliable way to print templated docs in Python?

I am trying to read a text file which is formatted like a spreadsheet into a list, and then use the parameters from said columns as merge fields and insert them into a template word document using Python. However, I get a TypeError saying, "Objects of type 'MailMerge' can not be converted to Unicode when trying to print the docs using win32api.ShellExecute(). We normally do this process manually: sort the main text file into three others based on if the members are 15 days, 25 days, or 30 days delinquent, then use mail merge in word to select this text file as a list of recipients. I've written this code in the example as an aside from the main program to get the mechanics working before putting it together.
In the main program I have the sorting feature completed. I also was able to use document.merge() and insert information into one document. However to do this for several pages in one document I need to use MailMerge() and pass it dictionaries as arguments. I tried using a for loop with the respective row indices as values for the keys, but since that function needs more than one dictionary it did not work. I then came up with using a for loop to insert each member and their info into one merge, write it to the output file, print that output file, and do that for each member until it's done. I've tried using win32api.ShellExecute(0, "print", document, '/d:"%s"' % win32print.GetDefaultPrinter(), ".", 0) but get the error stated above. The only other print method I came across was converting it to a pdf but I don't want to lose the text and formatting of the document either.
from __future__ import print_function
from mailmerge import MailMerge
import os
import win32api
import win32print
# Formatting the list so that the indices match the columns in the text
doc.
text_file = open('J:\cpletters15.txt')
courtesy_pay_list = text_file.readlines()
text_file.close()
delimeted_text = [line.split('\t') for line in courtest_pay_list]
# Opening the template
template = 'J:\courtesy_pay_15_test.docx'
document = MailMerge(template)
# This should merge and then print each member on their own letter.
for row in delimeted_text[1:]:
document.merge(
SHARE_NBR = row[2],
MEMBER_NBR = row[1],
CITY = row[11],
ADDRESS1 = row[9],
ADDRESS2 = row[10],
LAST_NAME = row[8],
STATE = row[12],
FIRST_NAME = row[7],
ZIP = row[13],
LETTER_DATE = row[4],
BALANCE = row[6]
)
document.write('J:\output.docx')
win32api.ShellExecute(
0, "print", document, '/d:"%s"' % win32print.GetDefaultPrinter(),
".", 0)
I am expecting that this code should print each document after merging the fields, but instead I get the error "Objects of type 'MailMerge' can not be converted to Unicode."
(Sorry if this is too wordy I've never posted here before).
I eventually found in the source code for mailmerge.py that there is another function called merge_template that is supposed to replace mail_merge but this did not exist in the version I installed using pip. After consulting some python coders on Discord it proved to be better to make a list of dictionaries using the lines,
values = delimeted_text[1:]
header = delimeted_text[0]
my_Dict = ([{head:val for head, val in zip(header, val)} for val in values])
this was then accepted by merge_template and successfully printed as such:
document.merge_templates(big_Dict, 'nextPage_section')
document.write('J:\output.docx')
os.startfile('J:\\output.docx', 'print')
I ended up just overwriting the mailmerge.py with the source code from the link and then everything worked out.

How do I set a variable to a regex string in Python Script for notepad++?

I am trying to set a variable (x) to a string in a text file using regular expression.
Within the file I am searching there exists several lines of code and always one of those lines is a ticket number WS########. Looks like this
~File~
out.test
WS12345678
something here
pineapple joe
etc.
~Code~
def foundSomething(m):
console.write('{0}\n'.format(m.group(0), str(m.span(0))))
editor1.research('([W][S]\d\d\d\d\d\d\d\d)', foundSomething)
Through my research i've managed to get the above code to work, it outputs WS12345678 to the console when the cooresponding text exists within a file.
How do I put WS12345678 to a variable so I can save that file with the corresponding number?
EDIT
To put it in pseudo code I am trying to
x = find "WS\d\d\d\d\d\d\d\d"
file.save(x)
Solution
Thank you #Kasra AD for the solution. I was able to create a workaround.
import re #import regular expression
test = editor1.getText() #variable = all the text in an editor window
savefilename = re.search(r"WS\d{8}",test).group(0) #setting the savefile variable
console.write(savefilename) #checking the variable
To find a specific string within a file in notepad++ using the PythonScript plugin you can pull everything from 1 editor into a string and run a regex search on that.
You need to return the result in your function then simply assign to a variable :
def foundSomething(m):
return console.write('{0}\n'.format(m.group(0), str(m.span(0))))
my_var=foundSomething(input_arg)
and also for extract your desire string you can use the following regex :
>>> s="""out.test
... WS12345678
... something here
... pineapple joe"""
>>> import re
>>> re.search(r'WS\d{8}',s).group(0)
'WS12345678'

How to use python-docx to replace text in a Word document and save

The oodocx module mentioned in the same page refers the user to an /examples folder that does not seem to be there.
I have read the documentation of python-docx 0.7.2, plus everything I could find in Stackoverflow on the subject, so please believe that I have done my “homework”.
Python is the only language I know (beginner+, maybe intermediate), so please do not assume any knowledge of C, Unix, xml, etc.
Task : Open a ms-word 2007+ document with a single line of text in it (to keep things simple) and replace any “key” word in Dictionary that occurs in that line of text with its dictionary value. Then close the document keeping everything else the same.
Line of text (for example) “We shall linger in the chambers of the sea.”
from docx import Document
document = Document('/Users/umityalcin/Desktop/Test.docx')
Dictionary = {‘sea’: “ocean”}
sections = document.sections
for section in sections:
print(section.start_type)
#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.
document.save('/Users/umityalcin/Desktop/Test.docx')
I am not seeing anything in the documentation that allows me to do this—maybe it is there but I don’t get it because everything is not spelled-out at my level.
I have followed other suggestions on this site and have tried to use earlier versions of the module (https://github.com/mikemaccana/python-docx) that is supposed to have "methods like replace, advReplace" as follows: I open the source-code in the python interpreter, and add the following at the end (this is to avoid clashes with the already installed version 0.7.2):
document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
if word in Dictionary.keys():
print "found it", Dictionary[word]
document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
wordrelationships, output, imagefiledict=None)
Running this produces the following error message:
NameError: name 'coreprops' is not defined
Maybe I am trying to do something that cannot be done—but I would appreciate your help if I am missing something simple.
If this matters, I am using the 64 bit version of Enthought's Canopy on OSX 10.9.3
UPDATE: There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.
This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.
This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.
The current version of python-docx does not have a search() function or a replace() function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.
Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)
for paragraph in document.paragraphs:
if 'sea' in paragraph.text:
print paragraph.text
paragraph.text = 'new text containing ocean'
To search in Tables as well, you would need to use something like:
for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if 'sea' in paragraph.text:
paragraph.text = paragraph.text.replace("sea", "ocean")
If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.
By the way, the code from #wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.
I needed something to replace regular expressions in docx.
I took scannys answer.
To handle style I've used answer from:
Python docx Replace string in paragraph while keeping style
added recursive call to handle nested tables.
and came up with something like this:
import re
from docx import Document
def docx_replace_regex(doc_obj, regex , replace):
for p in doc_obj.paragraphs:
if regex.search(p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if regex.search(inline[i].text):
text = regex.sub(replace, inline[i].text)
inline[i].text = text
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
docx_replace_regex(cell, regex , replace)
regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')
To iterate over dictionary:
for word, replacement in dictionary.items():
word_re=re.compile(word)
docx_replace_regex(doc, word_re , replacement)
Note that this solution will replace regex only if whole regex has same style in document.
Also if text is edited after saving same style text might be in separate runs.
For example if you open document that has "testabcd" string and you change it to "test1abcd" and save, even dough its the same style there are 3 separate runs "test", "1", and "abcd", in this case replacement of test1 won't work.
This is for tracking changes in the document. To marge it to one run, in Word you need to go to "Options", "Trust Center" and in "Privacy Options" unthick "Store random numbers to improve combine accuracy" and save the document.
Sharing a small script I wrote - helps me generating legal .docx contracts with variables while preserving the original style.
pip install python-docx
Example:
from docx import Document
import os
def main():
template_file_path = 'employment_agreement_template.docx'
output_file_path = 'result.docx'
variables = {
"${EMPLOEE_NAME}": "Example Name",
"${EMPLOEE_TITLE}": "Software Engineer",
"${EMPLOEE_ID}": "302929393",
"${EMPLOEE_ADDRESS}": "דרך השלום מנחם בגין דוגמא",
"${EMPLOEE_PHONE}": "+972-5056000000",
"${EMPLOEE_EMAIL}": "example#example.com",
"${START_DATE}": "03 Jan, 2021",
"${SALARY}": "10,000",
"${SALARY_30}": "3,000",
"${SALARY_70}": "7,000",
}
template_document = Document(template_file_path)
for variable_key, variable_value in variables.items():
for paragraph in template_document.paragraphs:
replace_text_in_paragraph(paragraph, variable_key, variable_value)
for table in template_document.tables:
for col in table.columns:
for cell in col.cells:
for paragraph in cell.paragraphs:
replace_text_in_paragraph(paragraph, variable_key, variable_value)
template_document.save(output_file_path)
def replace_text_in_paragraph(paragraph, key, value):
if key in paragraph.text:
inline = paragraph.runs
for item in inline:
if key in item.text:
item.text = item.text.replace(key, value)
if __name__ == '__main__':
main()
I got much help from answers from the earlier, but for me, the below code functions as the simple find and replace function in word would do. Hope this helps.
#!pip install python-docx
#start from here if python-docx is installed
from docx import Document
#open the document
doc=Document('./test.docx')
Dictionary = {"sea": "ocean", "find_this_text":"new_text"}
for i in Dictionary:
for p in doc.paragraphs:
if p.text.find(i)>=0:
p.text=p.text.replace(i,Dictionary[i])
#save changed document
doc.save('./test.docx')
The above solution has limitations. 1) The paragraph containing The "find_this_text" will became plain text without any format, 2) context controls that are in the same paragraph with the "find_this_text" will be deleted, and 3) the "find_this_text" in either context controls or tables will not be changed.
For the table case, I had to modify #scanny's answer to:
for table in doc.tables:
for col in table.columns:
for cell in col.cells:
for p in cell.paragraphs:
to make it work. Indeed, this does not seem to work with the current state of the API:
for table in document.tables:
for cell in table.cells:
Same problem with the code from here: https://github.com/python-openxml/python-docx/issues/30#issuecomment-38658149
The Office Dev Centre has an entry in which a developer has published (MIT licenced at this time) a description of a couple of algorithms that appear to suggest a solution for this (albeit in C#, and require porting):" MS Dev Centre posting
The library python-docx-template is pretty useful for this. It's perfect to edit Word documents and save them back to .docx format.
The problem with your second attempt is that you haven't defined the parameters that savedocx needs. You need to do something like this before you save:
relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []
coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"
he changed the API in docx py again...
for the sanity of everyone coming here:
import datetime
import os
from decimal import Decimal
from typing import NamedTuple
from docx import Document
from docx.document import Document as nDocument
class DocxInvoiceArg(NamedTuple):
invoice_to: str
date_from: str
date_to: str
project_name: str
quantity: float
hourly: int
currency: str
bank_details: str
class DocxService():
tokens = [
'#INVOICE_TO#',
'#IDATE_FROM#',
'#IDATE_TO#',
'#INVOICE_NR#',
'#PROJECTNAME#',
'#QUANTITY#',
'#HOURLY#',
'#CURRENCY#',
'#TOTAL#',
'#BANK_DETAILS#',
]
def __init__(self, replace_vals: DocxInvoiceArg):
total = replace_vals.quantity * replace_vals.hourly
invoice_nr = replace_vals.project_name + datetime.datetime.strptime(replace_vals.date_to, '%Y-%m-%d').strftime('%Y%m%d')
self.replace_vals = [
{'search': self.tokens[0], 'replace': replace_vals.invoice_to },
{'search': self.tokens[1], 'replace': replace_vals.date_from },
{'search': self.tokens[2], 'replace': replace_vals.date_to },
{'search': self.tokens[3], 'replace': invoice_nr },
{'search': self.tokens[4], 'replace': replace_vals.project_name },
{'search': self.tokens[5], 'replace': replace_vals.quantity },
{'search': self.tokens[6], 'replace': replace_vals.hourly },
{'search': self.tokens[7], 'replace': replace_vals.currency },
{'search': self.tokens[8], 'replace': total },
{'search': self.tokens[9], 'replace': 'asdfasdfasdfdasf'},
]
self.doc_path_template = os.path.dirname(os.path.realpath(__file__))+'/docs/'
self.doc_path_output = self.doc_path_template + 'output/'
self.document: nDocument = Document(self.doc_path_template + 'invoice_placeholder.docx')
def save(self):
for p in self.document.paragraphs:
self._docx_replace_text(p)
tables = self.document.tables
self._loop_tables(tables)
self.document.save(self.doc_path_output + 'testiboi3.docx')
def _loop_tables(self, tables):
for table in tables:
for index, row in enumerate(table.rows):
for cell in table.row_cells(index):
if cell.tables:
self._loop_tables(cell.tables)
for p in cell.paragraphs:
self._docx_replace_text(p)
# for cells in column.
# for cell in table.columns:
def _docx_replace_text(self, p):
print(p.text)
for el in self.replace_vals:
if (el['search'] in p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
print(inline[i].text)
if el['search'] in inline[i].text:
text = inline[i].text.replace(el['search'], str(el['replace']))
inline[i].text = text
print(p.text)
Test case:
from django.test import SimpleTestCase
from docx.table import Table, _Rows
from toggleapi.services.DocxService import DocxService, DocxInvoiceArg
class TestDocxService(SimpleTestCase):
def test_document_read(self):
ds = DocxService(DocxInvoiceArg(invoice_to="""
WAW test1
Multi myfriend
""",date_from="2019-08-01", date_to="2019-08-30", project_name='WAW', quantity=10.5, hourly=40, currency='USD',bank_details="""
Paypal to:
bippo#bippsi.com"""))
ds.save()
have folders
docs
and
docs/output/
in same folder where you have DocxService.py
e.g.
be sure to parameterize and replace stuff
As shared by some of the fellow users above that one of the challenges is finding and replacing text in word document is retaining styles if the word spans across multiple runs this could happen if word has many styles or if the word was edited multiple times when the document was created. So a simple code which assumes a word would be found completely within a single run is generally not true so python-docx based code shared above may not work for many many scenarios.
You can try the following API
https://rapidapi.com/more.sense.tech#gmail.com/api/document-filter1
This has generic code to deal with the scenarios. The API currently only addresses the paragraphic text and tabular text is currently not supported and I will try that soon.
import docx2txt as d2t
from docx import Document
from docx.text.paragraph import Paragraph
document = Document()
all_text = d2t.process("mydata.docx")
# print(all_text)
words=["hey","wow"]
for i in range words:
all_text=all_text.replace(i,"your word variable")
document.add_paragraph(updated + "\n")
print(all_text)
document.save('data.docx')

Regex line by line over large string

I have a lot of rows like below in a file:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
I first tried importing this as a dictionary with the json module so I could just print the values of the keys. The problem is some of the lines are missing the right curly bracket or have other issues and the fields aren't in the same order per line. That is preventing the import.
So now I am trying to do this with a regex. I have this:
fo = open("c:\\newgoodtestsample.txt", "r")
x = fo.read()
match1 = re.search('first_name"(.*?)"(.*?)"', x)
if match1:
print match1.group(2)
That returns the value of just the name. I would like to be able to return other fields as well. This worked in a regex tester but I can't get it to work in my code:
(first_name|last_name|age)"(.*?)"(.*?)"
Lastly, once that is figured out, I need to read each line in the file (not just the first one) and print the requested regex data from each line into a file. I have tried inserting a for loop but I keep getting the first line repeated over and over so I must be inserting it incorrectly. Any assistance is appreciated.
The following seems to do what you want, the regex should give you back as matching groups all the value fields from the JSON (although not the keywords under which those values are stored).
I also encourage you to use the with context manager as that will close the file handle automatically after all lines have been read, which is easily done just with a for loop.
with open("c:\\newgoodtestsample.txt", "r") as fo:
for line in fo:
result = re.findallr'"(\w*?)":"?(\w*)"?', line)
d = {k:v for k,v in re.findall(r'"(\w*?)":"?(\w*)"?', line)}
if 'first_name' in d:
# print first_name into file
else:
# print empty first_name field

Python turn text file into dictionary

I'm writing a spell checking function and I have a text file that looks like this
teh the
cta cat
dgo dog
dya day
frmo from
memeber member
The incorrect spelling is on the left (which will be my key) and the correct spelling is on the right (my value).
def spell():
corrections=open('autoCorrect.txt','r')
dictCorrect={}
for line in corrections:
corrections[0]=[1]
list(dictCorrect.items())
I know what I want my function to do but can't figure out how to execute it.
Use this:
with open('dictionary.txt') as f:
d = dict(line.strip().split(None, 1) for line in f)
d is the dictionary.
disclaimer:
This will work for the simple structure you have illustrated above, for more complex file structures you will need to do much more complex parsing.
You problably want to use split to get the words, then map the misspelled word to the correctly spelled one:
def spell():
dictCorrect={}
with open('autoCorrect.txt','r') as corrections:
for line in corrections:
wrong, right = line.split(' ')
dictCorrect[wrong] = right
return dictCorrect

Categories