Python pptx (Power Point) Find and replace text (ctrl + H)

Python pptx (Power Point) Find and replace text (ctrl + H) - python

Question in Short: How can I use the find and replace option (Ctrl+H) using the Python-pptx module?
Example Code:
from pptx import Presentation
nameOfFile = "NewPowerPoint.pptx" #Replace this with: path name on your computer + name of the new file.
def open_PowerPoint_Presentation(oldFileName, newFileName):
prs = Presentation(oldFileName)
prs.save(newFileName)
open_PowerPoint_Presentation('Template.pptx', nameOfFile)
I have a Power Point document named "Template.pptx". With my Python program I am adding some slides and putting some pictures in them. Once all the pictures are put into the document it saves it as another power point presentation.
The problem is that this "Template.pptx" has all the old week numbers in it, Like "Week 20". I want to make Python find and replace all these word combinations to "Week 25" (for example).

Posting code from my own project because none of the other answers quite managed to hit the mark with strings that have complex text with multiple paragraphs without losing formating:
prs = Presentation('blah.pptx')
# To get shapes in your slides
slides = [slide for slide in prs.slides]
shapes = []
for slide in slides:
for shape in slide.shapes:
shapes.append(shape)
def replace_text(self, replacements: dict, shapes: List):
"""Takes dict of {match: replacement, ... } and replaces all matches.
Currently not implemented for charts or graphics.
"""
for shape in shapes:
for match, replacement in replacements.items():
if shape.has_text_frame:
if (shape.text.find(match)) != -1:
text_frame = shape.text_frame
for paragraph in text_frame.paragraphs:
for run in paragraph.runs:
cur_text = run.text
new_text = cur_text.replace(str(match), str(replacement))
run.text = new_text
if shape.has_table:
for row in shape.table.rows:
for cell in row.cells:
if match in cell.text:
new_text = cell.text.replace(match, replacement)
cell.text = new_text
replace_text({'string to replace': 'replacement text'}, shapes)

For those of you who just want some code to copy and paste into your program that finds and replaces text in a PowerPoint while KEEPING formatting (just like I was,) here you go:
def search_and_replace(search_str, repl_str, input, output):
""""search and replace text in PowerPoint while preserving formatting"""
#Useful Links ;)
#https://stackoverflow.com/questions/37924808/python-pptx-power-point-find-and-replace-text-ctrl-h
#https://stackoverflow.com/questions/45247042/how-to-keep-original-text-formatting-of-text-with-python-powerpoint
from pptx import Presentation
prs = Presentation(input)
for slide in prs.slides:
for shape in slide.shapes:
if shape.has_text_frame:
if(shape.text.find(search_str))!=-1:
text_frame = shape.text_frame
cur_text = text_frame.paragraphs[0].runs[0].text
new_text = cur_text.replace(str(search_str), str(repl_str))
text_frame.paragraphs[0].runs[0].text = new_text
prs.save(output)
The prior is a combination of many answers, but it gets the job done. It simply replaces search_str with repl_str in every occurrence of search_str.
In the scope of this answer, you would use:
search_and_replace('Week 20', 'Week 25', "Template.pptx", "NewPowerPoint.pptx")

Merging responses above and other in a way that worked well for me (PYTHON 3). All the original format was keeped:
from pptx import Presentation
def replace_text(replacements, shapes):
"""Takes dict of {match: replacement, ... } and replaces all matches.
Currently not implemented for charts or graphics.
"""
for shape in shapes:
for match, replacement in replacements.items():
if shape.has_text_frame:
if (shape.text.find(match)) != -1:
text_frame = shape.text_frame
for paragraph in text_frame.paragraphs:
whole_text = "".join(run.text for run in paragraph.runs)
whole_text = whole_text.replace(str(match), str(replacement))
for idx, run in enumerate(paragraph.runs):
if idx != 0:
p = paragraph._p
p.remove(run._r)
if bool(paragraph.runs):
paragraph.runs[0].text = whole_text
if __name__ == '__main__':
prs = Presentation('input.pptx')
# To get shapes in your slides
slides = [slide for slide in prs.slides]
shapes = []
for slide in slides:
for shape in slide.shapes:
shapes.append(shape)
replaces = {
'{{var1}}': 'text 1',
'{{var2}}': 'text 2',
'{{var3}}': 'text 3'
}
replace_text(replaces, shapes)
prs.save('output.pptx')

You would have to visit each slide on each shape and look for a match using the available text features. It might not be pretty because PowerPoint has a habit of splitting runs up into what may seem like odd chunks. It does this to support features like spell checking and so forth, but its behavior there is unpredictable.
So finding the occurrences with things like Shape.text will probably be the easy part. Replacing them without losing any font formatting they have might be more difficult, depending on the particulars of your situation.

I know this question is old, but I have just finished a project that uses python to update a powerpoint daily. Bascially every morning the python script is run and it pulls the data for that day from a database, places the data in the powerpoint, and then executes powerpoint viewer to play the powerpoint.
To asnwer your question, you would have to loop through all the Shapes on the page and check if the string you're searching for is in the shape.text. You can check to see if the shape has text by checking if shape.has_text_frame is true. This avoids errors.
Here is where things get trickey. If you were to just replace the string in shape.text with the text you want to insert, you will probably loose formatting. shape.text is actually a concatination of all the text in the shape. That text may be split into lots of 'runs', and all of those runs may have different formatting that will be lost if you write over shape.text or replace part of the string.
On the slide you have shapes, and shapes can have a text_frame, and text_frames have paragraphs (atleast one. always. even when its blank), and paragraphs can have runs. Any level can have formatting, and you have no way of determining how many runs your string is split over.
In my case I made sure that any string that was going to be replaced was in its own shape. You still have to drill all the way down to the run and set the text there so that all formatting would be preserved. Also, the string you match in shape.text may actually be spread across multiple runs, so when setting the text in the first run, I also set the text in all other runs in that paragraph to blank.
random code snippit:
from pptx import Presentation
testString = '{{thingToReplace}}'
replaceString = 'this will be inserted'
ppt = Presentation('somepptxfile.pptx')
def replaceText(shape, string,replaceString):
#this is the hard part
#you know the string is in there, but it may be across many runs
for slide in ppt.slides:
for shape in slide.shapes:
if shape.has_text_frame:
if(shape.text.find(testString)!=-1:
replaceText(shape,testString,replaceString)
Sorry if there are any typos. Im at work.....

I encountered a similar issue that the formatted placeholder spreads over multiple run object. I would like to keep the format, so i could not do the replacement in the paragraph level. Finally, i figure out a way to replace the placeholder.
variable_pattern = re.compile("{{(\w+)}}")
def process_shape_with_text(shape, variable_pattern):
if not shape.has_text_frame:
return
whole_paragraph = shape.text
matches = variable_pattern.findall(whole_paragraph)
if len(matches) == 0:
return
is_found = False
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
matches = variable_pattern.findall(run.text)
if len(matches) == 0:
continue
replace_variable_with(run, data, matches)
is_found = True
if not is_found:
print("Not found the matched variables in the run segment but in the paragraph, target -> %s" % whole_paragraph)
matches = variable_pattern.finditer(whole_paragraph)
space_prefix = re.match("^\s+", whole_paragraph)
match_container = [x for x in matches];
need_modification = {}
for i in range(len(match_container)):
m = match_container[i]
path_recorder = space_prefix.group(0)
(start_0, end_0) = m.span(0)
(start_1, end_1) = m.span(1)
if (i + 1) > len(match_container) - 1 :
right = end_0 + 1
else:
right = match_container[i + 1].start(0)
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
segment = run.text
path_recorder += segment
if len(path_recorder) >= start_0 + 1 and len(path_recorder) <= right:
print("find it")
if len(path_recorder) <= start_1:
need_modification[run] = run.text.replace('{', '')
elif len(path_recorder) <= end_1:
need_modification[run] = data[m.group(1)]
elif len(path_recorder) <= right:
need_modification[run] = run.text.replace('}', '')
else:
None
if len(need_modification) > 0:
for key, value in need_modification.items():
key.text = value

Since PowerPoint splits the text of a paragraph into seemingly random runs (and on top each run carries its own - possibly different - character formatting) you can not just look for the text in every run, because the text may actually be distributed over a couple of runs and in each of those you'll only find part of the text you are looking for.
Doing it at the paragraph level is possible, but you'll lose all character formatting of that paragraph, which might screw up your presentation quite a bit.
Using the text on paragraph level, doing the replacement and assigning that result to the paragraph's first run while removing the other runs from the paragraph is better, but will change the character formatting of all runs to that of the first one, again screwing around in places, where it shouldn't.
Therefore I wrote a rather comprehensive script that can be installed with
python -m pip install python-pptx-text-replacer
and that creates a command python-pptx-text-replacer that you can use to do those replacements from the command line, or you can use the class TextReplacer in that package in your own Python scripts. It is able to change text in tables, charts and wherever else some text might appear, while preserving any character formatting specified for that text.
Read the README.md at https://github.com/fschaeck/python-pptx-text-replacer for more detailed information on usage. And open an issue there if you got any problems with the code!
Also see my answer at python-pptx - How to replace keyword across multiple runs? for an example of how the script deals with character formatting...

Here's some code that could help. I found it here:
search_str = '{{{old text}}}'
repl_str = 'changed Text'
ppt = Presentation('Presentation1.pptx')
for slide in ppt.slides:
for shape in slide.shapes:
if shape.has_text_frame:
shape.text = shape.text.replace(search_str, repl_str)
ppt.save('Presentation1.pptx')

Related

Removing '\n' from a string without using .translate, .replace or strip()

I'm making a simple text-based game as a learning project. I'm trying to add a feature where the user can input 'save' and their stats will be written onto a txt file named 'save.txt' so that after the program has been stopped, the player can then upload their previous stats and play from where they left off.
Here is the code for the saving:
user inputs 'save' and class attributes are saved onto the text file as text, one line at a time
elif first_step == 'save':
f = open("save.txt", "w")
f.write(f'''{player1.name}
{player1.char_type} #value is 'Wizard'
{player1.life}
{player1.energy}
{player1.strength}
{player1.money}
{player1.weapon_lvl}
{player1.wakefulness}
{player1.days_left}
{player1.battle_count}''')
f.close()
But, I also need the user to be able to load their saved stats next time they run the game. So they would enter 'load' and their stats will be updated.
I'm trying to read the text file one line at a time and then the value of that line would become the value of the relevant class attribute in order, one at a time. If I do this without converting it first to a string I get issues, such as some lines being skipped as python is reading 2 lines as one and putting them altogether as a list.
So, I tried the following:
In the below example, I'm only showing the data from the class attributes 'player1.name' and 'player1.char_type' as seen above as to not make this question as short as possible.
elif first_step == 'load':
f = open("save.txt", 'r')
player1.name_saved = f.readline() #reads the first line of the text file and assigns it's value to player1.name_saved
player1.name_saved2 = str(player1.name_saved) # converts the value of player1.name_saved to a string and saves that string in player1.name_saved2
player1.name = player1.name_saved2 #assigns the value of player1.name_saved to the class attribute player1.name
player1.char_type_saved = f.readlines(1) #reads the second line of the txt file and saves it in player1.char_type_saved
player1.char_type_saved2 = str(player1.char_type_saved) #converts the value of player1.char_type_saved into a string and assigns that value to player1.char_type_saved2
At this point, I would assign the value of player1.char_type_saved2 to the class attribute player1.char_type so that the value of player1.char_type enables the player to load the previous character type from the last time they played the game. This should make the value of player1.char_type = 'Wizard' but I'm getting '['Wizard\n']'
I tried the following to remove the brackets and \n:
final_player1.char_type = player1.char_type_saved2.translate({ord(c): None for c in "[']\n" }) #this is intended to remove everything from the string except for Wizard
For some reason, the above only removes the square brackets and punctuation marks but not \n from the end.
I then tried the following to remove \n:
final_player1.char_type = final_player1.char_type.replace("\n", "")
final_player1.char_type is still 'Wizard\n'
I've also tried using strip() but I've been unsuccessful.
If anyone could help me with this I would greatly appreciate it. Sorry if I have overcomplicated this question but it's hard to articulate it without lots of info. Let me know if this is too much or if more info is needed to answer.

If '\n' is always at the end it may be best to use:
s = 'wizard\n'
s = s[:-1]
print(s, s)
Output:
wizard wizard
But I still think strip() is best:
s = 'wizard\n'
s = s.strip()
print(s, s)
Output:
wizard wizard

Normaly it should work with just
char_type = "Wizard\n"
char_type.replace("\n", "")
print(char_type)
The output will be "Wizard"

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.

If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.

As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

Is there a better way than using python-docx to extract text-chunks from a high amount of unstructured MS Word-documents?

For a text-classification problem I need to extract a huge amount of text-chunks out of Word documents. I need to write these chunks of text into a jsonlines-file so the text can be annotated using an annotation tool. The function that I wrote is thisone:
def writeParagraphsToFile(filename):
textblocks = []
paragraphs = []
txtblock = ""
para_ID = 0
lineNr = 0
startNum = re.compile(r'\A\d+')
startsWithWhitespaces = re.compile(r'\A\s+')
d = Document(path2 + filename)
amountOfPara = len(d.paragraphs)
for p in d.paragraphs:
lineNr += 1
if (lineNr == 1):
titel = p.text
elif(lijnNr > 1 and startNum.match(p.text)):
TuTitel = ""
if txtblock != "":
textblocks.append(txtblok)
txtblok = ""
for run in p.runs:
if run.bold and run.underline:
TuTitel += " "+run.text
elif (lineNr >1):
if p.text == "":
txtblock += "\n"
elif len(p.text)>6:
txtblok += " "+p.text
if lineNr == amountOfPara-1:
textblocks.append(txtblok)
for tb in textblocks:
paragraphs.append(tb.strip().splitlines())
paragrafen_nieuw = [p for p in paragrafen if p]
for r in paragrafen_nieuw:
for t in r:
if t != "":
para_ID +=1
writer.write({"text": t.lstrip(), "meta": {"Bestandsnaam": filename, "Doc_id": getDocID(filename), "Para_id": para_ID}})
This code works for 1 specific kind of Word Documents, namely those ones that start with 1 line which is the document title and where the subtitles start with a number and are written in "bold" & "underline" font.
The problem is: I have MANY documents, with MANY layouts. Some start with a title, some don't, some start with a title existing out of 2 lines, some have actual "heading 1", "heading 2"-attributes for subtitles but most of them don't. Most of the subtitles are manualy put in "bold" and/or "underline" font. (sometimes only 1 or another). Some documents have a subtitle starting with a number and a subsubtitle starting with a letter.
And then there Is me, who barely has experience with all of this. I think I'll need to write a lot of "if-else" statements to be able to extract the text-chunks from all of these documents, no matter which document the program gets as an input but I'm really wondering if there would be another point of vieuw (a better one) from someone who's more experienced?
Any help would be greatly appreciated.
Here are some sample documents, all with a (slightly) different layout:
[From this file I can parse the textblocks already witht the code you can find above.][1]
The next links are from other 'types of' documents. These are not the only ones but there are too many different kinds because different people worked on these documents and everybody has his way / knowledge on how to write a document.
https://www.scribd.com/document/436242588/example-1
https://www.scribd.com/document/436242590/example-2
https://www.scribd.com/document/436242592/example-3
https://www.scribd.com/document/436242589/example-4
https://www.scribd.com/document/436242591/example-5
https://www.scribd.com/document/436242593/example-6
I don't need subtitles or subsubtitles in my jsonstrings. I just need chunks of normal text. Small "blocks" of text basicly. For example in B, from subtitle 3, I would get 3 json-strings with 1 textblock in each of them. In example 4 all the parts that are separated by an empty line could be 1 textblock and so 1 jsonstring. I hope this explanation makes my original question more clear.

How to extract specific line in text file

I am text mining a large document. I want to extract a specific line.
CONTINUED ON NEXT PAGE CONTINUATION SHEET REFERENCE NO. OF DOCUMENT BEING CONTINUED: PAGE 4 OF 16 PAGES
SPE2DH-20-T-0133 SECTION B
PR: 0081939954 NSN/MATERIAL: 6530015627381
ITEM DESCRIPTION
BOTTLE, SAFETY CAP
BOTTLE, SAFETY CAP RPOO1: DLA PACKAGING REQUIREMENTS FOR PROCUREMENT
RAQO1: THIS DOCUMENT INCORPORATES TECHNICAL AND/OR QUALITY REQUIREMENTS (IDENTIFIED BY AN 'R' OR AN 'I' NUMBER) SET FORTH IN FULL TEXT IN THE DLA MASTER LIST OF TECHNICAL AND QUALITY REQUIREMENTS FOUND ON THE WEB AT:
I want to extract the description immediately under ITEM DESCRIPTION.
I have tried many unsuccessful attempts.
My latest attempt was:
for line in text:
if 'ITEM' and 'DESCRIPTION'in line:
print ('Possibe Descript:\n', line)
But it did not find the text.
Is there a way to find ITEM DESCRIPTION and get the line after it or something similar?

The following function finds the description on the line below some given pattern, e.g. "ITEM DESCRIPTION", and also ignores any blank lines that may be present in between. However, beware that the function does not handle the special case when the pattern exists, but the description does not.
txt = '''
CONTINUED ON NEXT PAGE CONTINUATION SHEET REFERENCE NO. OF DOCUMENT BEING CONTINUED: PAGE 4 OF 16 PAGES
SPE2DH-20-T-0133 SECTION B
PR: 0081939954 NSN/MATERIAL: 6530015627381
ITEM DESCRIPTION
BOTTLE, SAFETY CAP
BOTTLE, SAFETY CAP RPOO1: DLA PACKAGING REQUIREMENTS FOR PROCUREMENT
RAQO1: THIS DOCUMENT INCORPORATES TECHNICAL AND/OR QUALITY REQUIREMENTS (IDENTIFIED BY AN 'R' OR AN 'I' NUMBER) SET FORTH IN FULL TEXT IN THE DLA MASTER LIST OF TECHNICAL AND QUALITY REQUIREMENTS FOUND ON THE WEB AT:
'''
I've assumed you got your text as a text string, and thus the function below will split it into a list of lines ..
pattern = "ITEM DESCRIPTION" # to search for
def find_pattern_in_txt(txt, pattern):
lines = [line for line in txt.split("\n") if line] # remove empty lines
if pattern in lines: return lines[lines.index(pattern)+1]
return None
print(find_pattern_in_txt(txt, pattern)) # prints: "BOTTLE, SAFETY CAP"

Test like this :
description = False
for line in text:
if 'ITEM DESCRIPTION' in line:
description = True
if description:
print(line)
Know this will work but you need something to stop reading the description, maybe another title like this
description = False
for line in text:
if 'ITEM DESCRIPTION' in line:
description = True
if description:
print(line)
if "END OF SOMETHING":
description = False

Use the string function 'find' as in the following, 'find' will return the index of the string you are looking for, so a positive number shows that you have found it.
code:
txt = "Hello, welcome to my world."
x = txt.find("welcome")
if x > 0:
print(x)
***
output:
***
7

f=open("aa.txt","r")
a=[]
for i in f:
a.append(i.split())
t1=0
for j in range(len(a)):
for i in range(len(a[j])):
if(a[j][i]=="ITEM" and a[j][i+1]=="DESCRIPTION"):
t1=j
for i in range(t1+1,len(a)):
for j in range(len(a[i])):
print(a[i][j]),

Use regex
import re
pattern = re.compile("(ITEM DESCRIPTION)\n.*") #if the information is directly
below without white space
pattern = re.compile("(ITEM DESCRIPTION)\n\n.*") #if there is a white space
before the information
for i, line in enumerate(open('file.txt')):
for match in re.finditer(pattern, line):
print 'Found on line %s: %s' % (i+1, match.group())

Highlighting substrings in print statements

I have been trying to write a short script to be run directly in jupyter notebook. It simply scrolls through texts (400 words on avg.) in pandas df and asks user for a label.
I am struggling with finding an elegant solution that would highlight all substrings 'eu' in the text to be printed out.
In an other thread, I have found this printmd function that I use to highlight the "eu" substring. However, this only works for the first appearance and breaks the lines as well.
import sys
from IPython.display import clear_output
from IPython.display import Markdown, display
def printmd(string):
display(Markdown(string))
printmd('**bold**')
labels = []
for i in range(0,len(SampleDf)):
clear_output() # clear the output before displaying another article
print(SampleDf.loc[i]['article_title'])
lc = SampleDf.loc[i]['article_body'].lower() # the search is case sensitive
pos = lc.find('eu') # where is the 'eu' mentioned
print(SampleDf.loc[i]['article_body'][:pos])
printmd('**eu**')
print(SampleDf.loc[i]['article_body'][pos+2:])
var = input("press y if the text is irrelevant" )
if var == 'y':
label = 0 # 0 for thrash
else:
label = 1 # 1 for relevant
labels.append(label)
I would love to get rid of the line breaks introduced by the separate print statements and highlight all mentions of the "eu".

Look at this as string processing, not an output problem. If I'm understanding your needs properly, this is a simple replace usage:
new_text = old_text.replace("eu", "**eu**")
If you still need your single-token mode, then
suppressing a line feed is a simple matter of using the print parameter for that purpose:
print('**eu**', end='')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.