Python docx module to modify multiple files - python

I am trying to use small, nested for-loops to iterate through a list of values, replace items in a Word doc with the docx module, changing one value each loop, and saving a new doc each time and naming it as the changed value's name. Everything works except the value I'm trying to change with each loop is not updating on the Word files. The first value in my list is the value that shows up on every Word doc. When I output/save the files, the filename is correctly updating - but the content of the value on the document is always the same. Here's the code:
def makeWord(ident): #'ident' is a list of alphanumeric I.D. numbers
wordpath = path + 'Word_Template.docx'
outpath = path + 'Word Files\\'
wordfile = Document(wordpath)
for cn in ident:
myvars = {'$MO':datamonth, '$YEAR':datayear, '$idnum', cn} #data month/year are constants
for key, val in myvars.items():
for para in wordfile.paragraphs:
replace_text_in_paragraph(para, key, val)
saveword(wordfile, cn, outpath)
def replace_text_in_paragraph(paragraph, key, value):
if key in paragraph.text:
inline = paragraph.runs
for item in inline:
if key in item.text:
item.text = item.text.replace(key, value)
def saveword(wordfile, idnum, outp):
wordfile.save(outp + idnum + '_' + datamonth + '_' + datayear + '.docx')
My list of values is in the "ident" list. If the first value in the list is "B25", that value is placed in all of my Word docs, though the filename of the Word docs appropriately changes as the loop runs. I have been stuck for two days on this. Any help is greatly appreciated!

If your goal is to replace words in docx files with variables then my advice would be to use the docxtpl library.
You can open the file you want to edit, then use a dictionary to replace values, it's much cleaner-
Heres an example that gives you an idea of how powerful this is:
# If you have multiple forms to do in a list
for form in forms:
# Document
doc = DocxTemplate(form[0])
# Feed variables in as a Docx Jinga2 will convert on-page
context = {
'text_in_doc' : "something",
'text_in_doc2' : "something",
'text_in_doc3': "something",
}
# Render new Doc with Jinga2 swapped
doc.render(context)
# Save new Doc after Jinga2 conversion
doc.save("/save/to/directory/sample.docx")
GL

Related

Using python and regex to find stings and put them together to replace the filename of a .pdf- the rename fails when using more than one group

I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break

How do I use a list or set as keys in file renaming

Is something like this possible? Id like to use a dictionary or set as the key for my file renamer. I have a lot of key words that id like to filter out of the file names but the only way iv found to do it so far is to search by string such as key720 = "720" this make it functions correctly but creates bloat. I have to have a version of the code at bottom for each keyword I want to remove.
how do I get the list to work as keys in the search?
I tried to take the list and make it a string with:
str1 = ""
keyres = (str1.join(keys))
This was closer but it makes a string of all the entry's I think and didn't pick up any keywords.
so iv come to this at the moment.
keys = ["720p", "720", "1080p", "1080"]
for filename in os.listdir(dirName):
if keys in filename:
filepath = os.path.join(dirName, filename)
newfilepath = os.path.join(dirName, filename.replace(keys,""))
os.rename(filepath, newfilepath)
Is there a way to maybe go by index and increment it one at a time? would that allow the strings in the list to be used as strings?
What I'm trying to do is take a file name and rename it by removing all occurrences of the key words.
How about using Regular Expressions, specifically the sub function?
from re import sub
KEYS = ["720p", "720", "1080p", "1080"]
old_filename = "filename1080p.jpg"
new_filename = sub('|'.join(KEYS),'',old_filename)
print(new_filename)

Adding new keys/values to nested dictionary in geojson file

I have a .js containing data used to plot points on a map. In the file, I have a single line var statement which contains nested dictionaries.
var json_Project = {"type":"FeatureCollection","name":"project","crs":{"type":"name","properties":{"name":"urn:ogc:def:crs:OGC:1.3:CRS84"}},"features":[{"type":"Feature","properties":{"ID":"001","field1":"someText","field2":"someText"},"geometry":{"type":"Point","coordinates":[-1.14,60.15]}},{"type":"Feature","properties":{"ID":"002","field1":"someText","field2":"someText"},"geometry":{"type":"Point","coordinates":[-1.14,60.15]}}]}
One of these contains the "features" key which stores a list of dictionaries for each point (see below for readability).
var json_Project = {"type":"FeatureCollection","name":"project","crs":{"type":"name","properties":{"name":"urn:ogc:def:crs:OGC:1.3:CRS84"}},"features":[
{"type":"Feature","properties":{"ID":"001","field1":"someText","field2":"someText"},"geometry":{"type":"Point","coordinates":[-1.14,60.15]}},
{"type":"Feature","properties":{"ID":"002","field1":"someText","field2":"someText"},"geometry":{"type":"Point","coordinates":[-1.14,60.15]}}]}
The "properties" for the first point has "ID":"001", second point has "ID":"001" etc.
I would like to insert a new property for each point with "INFORMATION:" and get the value from another dictionary using the keys from "ID".
info_dict = {'001': 'This is Project 1',
'002': 'This is Project 2'}
Specifically, I would like this new property to be inserted before "field2" so that it looks like:
var json_Project = {"type":"FeatureCollection","name":"project","crs":{"type":"name","properties":{"name":"urn:ogc:def:crs:OGC:1.3:CRS84"}},"features":[
{"type":"Feature","properties":{"ID":"001","field1":"someText","INFORMATION":"This is Project 1","field2":"someText"},"geometry":{"type":"Point","coordinates":[-1.14,60.15]}},
{"type":"Feature","properties":{"ID":"002","field1":"someText","INFORMATION":"This is Project 2","field2":"someText"},"geometry":{"type":"Point","coordinates":[-1.14,60.15]}}]}
My futile attempt involved converting the one-liner to a string and then extract everything after the "features:" string as that is unique in the file. But not sure how to add the new key/value pair.
with open('project.js') as f:
contents = f.read().split()
contents_toString = ''.join(contents)
new_contents = re.findall('(?<="features":\[).*$', contents_toString)
EDIT
Thanks to #TenaciousB, I can read the geojson file and add in the "INFORMATION" property with the correct values from the dictionary:
with open('project.js') as f:
contents = f.read()
x = json.loads(contents)
for y in x['features']:
key = y['properties']['ID']
description = dictionary[key]
y['properties']['INFORMATION'] = description
I am still unsure how to:
Remove var json_Project = programmatically before being read;
Place the "INFORMATION" property before "field2";
Re-insert var json_Project = , save and ensure the single-line format is retained.

Need some help in deleting the data from list and again append it in same list

I have developed a Django app where user can upload multiple files. I can upload all the multiple files and its paths in the form of a list separated by comma(,) in MySql database.For example I have uploaded three files
Logging a Defect.docx,
2.Mocks (1).pptx and
3.Mocksv2.pptx
and it gets stored in database as following( Converting the individual file path into list and joining all the paths results in following form) :
FileStore/client/Logging a Defect.docx,FileStore/client/Mocks (1).pptx,FileStore/client/Mocksv2.pptx,
Now I need help while deleting particular file. For example when I'm deleting Logging a Defect.docx then I should be deleting first element of list alone and retain the other two paths. I'll be sending only name of document.
I'm retrieving the path as list and then I have to check if the name of doc being passed is there in each element of the list and if it matches then I should delete that element keeping the other elements intact. How to approach this ? It sounds like more of python question than Django question.
Use list-expression to filter the splitted text, and rebuild the string using join function
>>> db_path = 'FileStore/client/Logging a Defect.docx,FileStore/client/Mocks (1).pptx,FileStore/client/Mocksv2.pptx'
>>> file_to_delete = 'Logging a Defect.docx'
>>> file_separator = ","
>>> new_db_path = [
... path.strip()
... for path in db_path.split(file_separator)
... if path.strip() and file_to_delete not in path
... ]
>>> string_to_save = file_separator.join(new_db_path)
>>> string_to_save
'FileStore/client/Mocks (1).pptx,FileStore/client/Mocksv2.pptx'
You can read the text in your database and then use remove method of the list in python and then write back the new value into databse:
text = "FileStore/client/Logging a Defect.docx,FileStore/client/Mocks (1).pptx,FileStore/client/Mocksv2.pptx,"
splitted = text.split(',')
#filename is the one you want to delete
entry = "FileStore/client/{filename}".format(filename="Mocks (1).pptx")
if entry in splitted:
splitted.remove(entry)
newtext = ""
for s in splitted:
newtext += s
newtext += ','
now write back newtext to database
Not boasting or anything but I came up with my own logic for my question. It looks far less complicated but it works fine.
db_path = 'FileStore/client/Logging a Defect.docx,FileStore/client/Mocks (1).pptx,FileStore/client/Mocksv2.pptx'
path_list = db_path.split(",")
doc = 'Logging a Defect.docx'
for i in path_list :
if doc in i:
y.remove("FileStore/"+client+"/"+doc)
new_path = ",".join(y)
print new_path

Pyparsing: How can I parse data and then edit a specific value in a .txt file?

my data is located in a .txt file (no, I can't change it to a different format) and it looks like this:
varaiablename = value
something = thisvalue
youget = the_idea
Here is my code so far (taken from the examples in Pyparsing):
from pyparsing import Word, alphas, alphanums, Literal, restOfLine, OneOrMore, \
empty, Suppress, replaceWith
input = open("text.txt", "r")
src = input.read()
# simple grammar to match #define's
ident = Word(alphas + alphanums + "_")
macroDef = ident.setResultsName("name") + "= " + ident.setResultsName("value") + Literal("#") + restOfLine.setResultsName("desc")
for t,s,e in macroDef.scanString(src):
print t.name,"=", t.value
So how can I tell my script to edit a specific value for a specific variable?
Example:
I want to change the value of variablename, from value to new_value.
So essentially variable = (the data we want to edit).
I probably should make it clear that I don't want to go directly into the file and change the value by changing value to new_value but I want to parse the data, find the variable and then give it a new value.
Even though you have already selected another answer, let me answer your original question, which was how to do this using pyparsing.
If you are trying to make selective changes in some body of text, then transformString is a better choice than scanString (although scanString or searchString are fine for validating your grammar expression by looking for matching text). transformString will apply token suppression or parse action modifications to your input string as it scans through the text looking for matches.
# alphas + alphanums is unnecessary, since alphanums includes all alphas
ident = Word(alphanums + "_")
# I find this shorthand form of setResultsName is a little more readable
macroDef = ident("name") + "=" + ident("value")
# define values to be updated, and their new values
valuesToUpdate = {
"variablename" : "new_value"
}
# define a parse action to apply value updates, and attach to macroDef
def updateSelectedDefinitions(tokens):
if tokens.name in valuesToUpdate:
newval = valuesToUpdate[tokens.name]
return "%s = %s" % (tokens.name, newval)
else:
raise ParseException("no update defined for this definition")
macroDef.setParseAction(updateSelectedDefinitions)
# now let transformString do all the work!
print macroDef.transformString(src)
Gives:
variablename = new_value
something = thisvalue
youget = the_idea
For this task you do not need to use special utility or module
What you need is reading lines and spliting them in list, so first index is left and second index is right side.
If you need these values later you might want to store them in dictionary.
Well here is simple way, for somebody new in python. Uncomment lines whit print to use it as debug.
f=open("conf.txt","r")
txt=f.read() #all text is in txt
f.close()
fwrite=open("modified.txt","w")
splitedlines = txt.splitlines():
#print splitedlines
for line in splitedlines:
#print line
conf = line.split('=')
#conf[0] is what it is on left and conf[1] is what it is on right
#print conf
if conf[0] == "youget":
#we get this
conf[1] = "the_super_idea" #the_idea is now the_super_idea
#join conf whit '=' and write
newline = '='.join(conf)
#print newline
fwrite.write(newline+"\n")
fwrite.close()
Actually, you should have a look at the config parser module
Which parses exactly your syntax (you need only to add [section] at the beginning).
If you insist on your implementation, you can create a dictionary :
dictt = {}
for t,s,e in macroDef.scanString(src):
dictt[t.name]= t.value
dictt[variable]=new_value
ConfigParser
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read('example.txt')
variablename = config.get('variablename', 'float')
It'll yell at you if you don't have a [section] header, though, but it's ok, you can fake one.

Categories