I have been trying to write a short script to be run directly in jupyter notebook. It simply scrolls through texts (400 words on avg.) in pandas df and asks user for a label.
I am struggling with finding an elegant solution that would highlight all substrings 'eu' in the text to be printed out.
In an other thread, I have found this printmd function that I use to highlight the "eu" substring. However, this only works for the first appearance and breaks the lines as well.
import sys
from IPython.display import clear_output
from IPython.display import Markdown, display
def printmd(string):
display(Markdown(string))
printmd('**bold**')
labels = []
for i in range(0,len(SampleDf)):
clear_output() # clear the output before displaying another article
print(SampleDf.loc[i]['article_title'])
lc = SampleDf.loc[i]['article_body'].lower() # the search is case sensitive
pos = lc.find('eu') # where is the 'eu' mentioned
print(SampleDf.loc[i]['article_body'][:pos])
printmd('**eu**')
print(SampleDf.loc[i]['article_body'][pos+2:])
var = input("press y if the text is irrelevant" )
if var == 'y':
label = 0 # 0 for thrash
else:
label = 1 # 1 for relevant
labels.append(label)
I would love to get rid of the line breaks introduced by the separate print statements and highlight all mentions of the "eu".
Look at this as string processing, not an output problem. If I'm understanding your needs properly, this is a simple replace usage:
new_text = old_text.replace("eu", "**eu**")
If you still need your single-token mode, then
suppressing a line feed is a simple matter of using the print parameter for that purpose:
print('**eu**', end='')
Related
I'm making a simple text-based game as a learning project. I'm trying to add a feature where the user can input 'save' and their stats will be written onto a txt file named 'save.txt' so that after the program has been stopped, the player can then upload their previous stats and play from where they left off.
Here is the code for the saving:
user inputs 'save' and class attributes are saved onto the text file as text, one line at a time
elif first_step == 'save':
f = open("save.txt", "w")
f.write(f'''{player1.name}
{player1.char_type} #value is 'Wizard'
{player1.life}
{player1.energy}
{player1.strength}
{player1.money}
{player1.weapon_lvl}
{player1.wakefulness}
{player1.days_left}
{player1.battle_count}''')
f.close()
But, I also need the user to be able to load their saved stats next time they run the game. So they would enter 'load' and their stats will be updated.
I'm trying to read the text file one line at a time and then the value of that line would become the value of the relevant class attribute in order, one at a time. If I do this without converting it first to a string I get issues, such as some lines being skipped as python is reading 2 lines as one and putting them altogether as a list.
So, I tried the following:
In the below example, I'm only showing the data from the class attributes 'player1.name' and 'player1.char_type' as seen above as to not make this question as short as possible.
elif first_step == 'load':
f = open("save.txt", 'r')
player1.name_saved = f.readline() #reads the first line of the text file and assigns it's value to player1.name_saved
player1.name_saved2 = str(player1.name_saved) # converts the value of player1.name_saved to a string and saves that string in player1.name_saved2
player1.name = player1.name_saved2 #assigns the value of player1.name_saved to the class attribute player1.name
player1.char_type_saved = f.readlines(1) #reads the second line of the txt file and saves it in player1.char_type_saved
player1.char_type_saved2 = str(player1.char_type_saved) #converts the value of player1.char_type_saved into a string and assigns that value to player1.char_type_saved2
At this point, I would assign the value of player1.char_type_saved2 to the class attribute player1.char_type so that the value of player1.char_type enables the player to load the previous character type from the last time they played the game. This should make the value of player1.char_type = 'Wizard' but I'm getting '['Wizard\n']'
I tried the following to remove the brackets and \n:
final_player1.char_type = player1.char_type_saved2.translate({ord(c): None for c in "[']\n" }) #this is intended to remove everything from the string except for Wizard
For some reason, the above only removes the square brackets and punctuation marks but not \n from the end.
I then tried the following to remove \n:
final_player1.char_type = final_player1.char_type.replace("\n", "")
final_player1.char_type is still 'Wizard\n'
I've also tried using strip() but I've been unsuccessful.
If anyone could help me with this I would greatly appreciate it. Sorry if I have overcomplicated this question but it's hard to articulate it without lots of info. Let me know if this is too much or if more info is needed to answer.
If '\n' is always at the end it may be best to use:
s = 'wizard\n'
s = s[:-1]
print(s, s)
Output:
wizard wizard
But I still think strip() is best:
s = 'wizard\n'
s = s.strip()
print(s, s)
Output:
wizard wizard
Normaly it should work with just
char_type = "Wizard\n"
char_type.replace("\n", "")
print(char_type)
The output will be "Wizard"
I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
Im using difflib and tried to compare the two sentence and get the difference.
Somewhat like this.
i have this code but instead of word by word it analyzed letter by letter.
import difflib
# define original text
# taken from: https://en.wikipedia.org/wiki/Internet_Information_Services
original = ["IIS 8.5 has several improvements related"]
# define modified text
edited = ["It has several improvements related"]
# initiate the Differ object
d = difflib.Differ()
# calculate the difference between the two texts
diff = d.compare(original, edited)
# output the result
print ('\n'.join(diff))
If you remove the []'s from your strings, and call .split() on them in the .compare() perhaps you'll get what you want.
import difflib
# define original text
# taken from: https://en.wikipedia.org/wiki/Internet_Information_Services
original = "IIS 8.5 has several improvements related"
# define modified text
edited = "It has several improvements related"
# initiate the Differ object
d = difflib.Differ()
# calculate the difference between the two texts
diff = d.compare(original.split(), edited.split())
# output the result
print ('\n'.join(diff))
Output
+ It
- IIS
- 8.5
has
several
improvements
related
I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:
Is there a way to load the data properly to pandas?
On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via
Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
Loading the dataframe (in a python console)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(not sure that this will work for the entirety of your file though).
Consider below code which reads the text in myfile.text which looks like below:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended.
Code:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
Output:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.
Here are a few more options to add to the mix:
You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.
Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)
A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...
For now the following solution was found:
sep = '[|"|]'
Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.
Question in Short: How can I use the find and replace option (Ctrl+H) using the Python-pptx module?
Example Code:
from pptx import Presentation
nameOfFile = "NewPowerPoint.pptx" #Replace this with: path name on your computer + name of the new file.
def open_PowerPoint_Presentation(oldFileName, newFileName):
prs = Presentation(oldFileName)
prs.save(newFileName)
open_PowerPoint_Presentation('Template.pptx', nameOfFile)
I have a Power Point document named "Template.pptx". With my Python program I am adding some slides and putting some pictures in them. Once all the pictures are put into the document it saves it as another power point presentation.
The problem is that this "Template.pptx" has all the old week numbers in it, Like "Week 20". I want to make Python find and replace all these word combinations to "Week 25" (for example).
Posting code from my own project because none of the other answers quite managed to hit the mark with strings that have complex text with multiple paragraphs without losing formating:
prs = Presentation('blah.pptx')
# To get shapes in your slides
slides = [slide for slide in prs.slides]
shapes = []
for slide in slides:
for shape in slide.shapes:
shapes.append(shape)
def replace_text(self, replacements: dict, shapes: List):
"""Takes dict of {match: replacement, ... } and replaces all matches.
Currently not implemented for charts or graphics.
"""
for shape in shapes:
for match, replacement in replacements.items():
if shape.has_text_frame:
if (shape.text.find(match)) != -1:
text_frame = shape.text_frame
for paragraph in text_frame.paragraphs:
for run in paragraph.runs:
cur_text = run.text
new_text = cur_text.replace(str(match), str(replacement))
run.text = new_text
if shape.has_table:
for row in shape.table.rows:
for cell in row.cells:
if match in cell.text:
new_text = cell.text.replace(match, replacement)
cell.text = new_text
replace_text({'string to replace': 'replacement text'}, shapes)
For those of you who just want some code to copy and paste into your program that finds and replaces text in a PowerPoint while KEEPING formatting (just like I was,) here you go:
def search_and_replace(search_str, repl_str, input, output):
""""search and replace text in PowerPoint while preserving formatting"""
#Useful Links ;)
#https://stackoverflow.com/questions/37924808/python-pptx-power-point-find-and-replace-text-ctrl-h
#https://stackoverflow.com/questions/45247042/how-to-keep-original-text-formatting-of-text-with-python-powerpoint
from pptx import Presentation
prs = Presentation(input)
for slide in prs.slides:
for shape in slide.shapes:
if shape.has_text_frame:
if(shape.text.find(search_str))!=-1:
text_frame = shape.text_frame
cur_text = text_frame.paragraphs[0].runs[0].text
new_text = cur_text.replace(str(search_str), str(repl_str))
text_frame.paragraphs[0].runs[0].text = new_text
prs.save(output)
The prior is a combination of many answers, but it gets the job done. It simply replaces search_str with repl_str in every occurrence of search_str.
In the scope of this answer, you would use:
search_and_replace('Week 20', 'Week 25', "Template.pptx", "NewPowerPoint.pptx")
Merging responses above and other in a way that worked well for me (PYTHON 3). All the original format was keeped:
from pptx import Presentation
def replace_text(replacements, shapes):
"""Takes dict of {match: replacement, ... } and replaces all matches.
Currently not implemented for charts or graphics.
"""
for shape in shapes:
for match, replacement in replacements.items():
if shape.has_text_frame:
if (shape.text.find(match)) != -1:
text_frame = shape.text_frame
for paragraph in text_frame.paragraphs:
whole_text = "".join(run.text for run in paragraph.runs)
whole_text = whole_text.replace(str(match), str(replacement))
for idx, run in enumerate(paragraph.runs):
if idx != 0:
p = paragraph._p
p.remove(run._r)
if bool(paragraph.runs):
paragraph.runs[0].text = whole_text
if __name__ == '__main__':
prs = Presentation('input.pptx')
# To get shapes in your slides
slides = [slide for slide in prs.slides]
shapes = []
for slide in slides:
for shape in slide.shapes:
shapes.append(shape)
replaces = {
'{{var1}}': 'text 1',
'{{var2}}': 'text 2',
'{{var3}}': 'text 3'
}
replace_text(replaces, shapes)
prs.save('output.pptx')
You would have to visit each slide on each shape and look for a match using the available text features. It might not be pretty because PowerPoint has a habit of splitting runs up into what may seem like odd chunks. It does this to support features like spell checking and so forth, but its behavior there is unpredictable.
So finding the occurrences with things like Shape.text will probably be the easy part. Replacing them without losing any font formatting they have might be more difficult, depending on the particulars of your situation.
I know this question is old, but I have just finished a project that uses python to update a powerpoint daily. Bascially every morning the python script is run and it pulls the data for that day from a database, places the data in the powerpoint, and then executes powerpoint viewer to play the powerpoint.
To asnwer your question, you would have to loop through all the Shapes on the page and check if the string you're searching for is in the shape.text. You can check to see if the shape has text by checking if shape.has_text_frame is true. This avoids errors.
Here is where things get trickey. If you were to just replace the string in shape.text with the text you want to insert, you will probably loose formatting. shape.text is actually a concatination of all the text in the shape. That text may be split into lots of 'runs', and all of those runs may have different formatting that will be lost if you write over shape.text or replace part of the string.
On the slide you have shapes, and shapes can have a text_frame, and text_frames have paragraphs (atleast one. always. even when its blank), and paragraphs can have runs. Any level can have formatting, and you have no way of determining how many runs your string is split over.
In my case I made sure that any string that was going to be replaced was in its own shape. You still have to drill all the way down to the run and set the text there so that all formatting would be preserved. Also, the string you match in shape.text may actually be spread across multiple runs, so when setting the text in the first run, I also set the text in all other runs in that paragraph to blank.
random code snippit:
from pptx import Presentation
testString = '{{thingToReplace}}'
replaceString = 'this will be inserted'
ppt = Presentation('somepptxfile.pptx')
def replaceText(shape, string,replaceString):
#this is the hard part
#you know the string is in there, but it may be across many runs
for slide in ppt.slides:
for shape in slide.shapes:
if shape.has_text_frame:
if(shape.text.find(testString)!=-1:
replaceText(shape,testString,replaceString)
Sorry if there are any typos. Im at work.....
I encountered a similar issue that the formatted placeholder spreads over multiple run object. I would like to keep the format, so i could not do the replacement in the paragraph level. Finally, i figure out a way to replace the placeholder.
variable_pattern = re.compile("{{(\w+)}}")
def process_shape_with_text(shape, variable_pattern):
if not shape.has_text_frame:
return
whole_paragraph = shape.text
matches = variable_pattern.findall(whole_paragraph)
if len(matches) == 0:
return
is_found = False
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
matches = variable_pattern.findall(run.text)
if len(matches) == 0:
continue
replace_variable_with(run, data, matches)
is_found = True
if not is_found:
print("Not found the matched variables in the run segment but in the paragraph, target -> %s" % whole_paragraph)
matches = variable_pattern.finditer(whole_paragraph)
space_prefix = re.match("^\s+", whole_paragraph)
match_container = [x for x in matches];
need_modification = {}
for i in range(len(match_container)):
m = match_container[i]
path_recorder = space_prefix.group(0)
(start_0, end_0) = m.span(0)
(start_1, end_1) = m.span(1)
if (i + 1) > len(match_container) - 1 :
right = end_0 + 1
else:
right = match_container[i + 1].start(0)
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
segment = run.text
path_recorder += segment
if len(path_recorder) >= start_0 + 1 and len(path_recorder) <= right:
print("find it")
if len(path_recorder) <= start_1:
need_modification[run] = run.text.replace('{', '')
elif len(path_recorder) <= end_1:
need_modification[run] = data[m.group(1)]
elif len(path_recorder) <= right:
need_modification[run] = run.text.replace('}', '')
else:
None
if len(need_modification) > 0:
for key, value in need_modification.items():
key.text = value
Since PowerPoint splits the text of a paragraph into seemingly random runs (and on top each run carries its own - possibly different - character formatting) you can not just look for the text in every run, because the text may actually be distributed over a couple of runs and in each of those you'll only find part of the text you are looking for.
Doing it at the paragraph level is possible, but you'll lose all character formatting of that paragraph, which might screw up your presentation quite a bit.
Using the text on paragraph level, doing the replacement and assigning that result to the paragraph's first run while removing the other runs from the paragraph is better, but will change the character formatting of all runs to that of the first one, again screwing around in places, where it shouldn't.
Therefore I wrote a rather comprehensive script that can be installed with
python -m pip install python-pptx-text-replacer
and that creates a command python-pptx-text-replacer that you can use to do those replacements from the command line, or you can use the class TextReplacer in that package in your own Python scripts. It is able to change text in tables, charts and wherever else some text might appear, while preserving any character formatting specified for that text.
Read the README.md at https://github.com/fschaeck/python-pptx-text-replacer for more detailed information on usage. And open an issue there if you got any problems with the code!
Also see my answer at python-pptx - How to replace keyword across multiple runs? for an example of how the script deals with character formatting...
Here's some code that could help. I found it here:
search_str = '{{{old text}}}'
repl_str = 'changed Text'
ppt = Presentation('Presentation1.pptx')
for slide in ppt.slides:
for shape in slide.shapes:
if shape.has_text_frame:
shape.text = shape.text.replace(search_str, repl_str)
ppt.save('Presentation1.pptx')