Python docx - find and replace words with italicized version

Python docx - find and replace words with italicized version - python

I have thought of a few ways to accomplish this, but each is uglier than the next. I'm trying to think of a way to search for all instances of a word in a word document and italicize them.
I can't upload a word document, but here's what I had in mind:
A working example would find all instances of billybob, including the one in the table, and italicize them. The problem is the way the runs are frequently aligned means that one run might have billy and the next one might have bob so there's no straightforward way to find all of them.

I'm going to leave this open because the approach I came up with isn't perfect, but it works in the vast majority of the cases. Here is the code:
document = Document(<YOUR_DOC>)
# Data will be a list of rows represented as dictionaries
# containing each row's data.
characters = {}
for paragraph in <YOUR_PARAGRAPHS>:
run_string = ""
run_index = {}
i = 0
for x, run in enumerate(paragraph.runs):
# Create a string consisting of all the runs' text. Theoretically this
# should always be the same as parapgrah.text, but I didn't check
run_string = run_string + run.text
# The index i represents the starting position of the run in question
# within the string. We are creating a dictionary of form
# {<run_start_location>: <pointer_to_run>}
run_index[i] = x
# This will be the start of the next run
i = i + len(run.text)
word_you_wanted_to_find = re.findall("some_regex", paragraph.text)
for word in word_you_wanted_to_find:
# [m.start() for m in re.finditer(word, run_string)] returns the starting
# positions of each word that was found
for word_start in [m.start() for m in re.finditer(word, run_string)]:
word_end = word_start + len(word)
# This will be a list of the indices of the runs which have part
# of the word we want to include
included_runs = []
for key in run_index.keys():
# Remember, the key is the location in the string of the start of
# the run. In this case, the start of the word start should be less than
# the key+len(run) and the end of the word should be greater
# than the key (the start of the run)
if word_start <= (key + len(paragraph.runs[run_index[key]].text)) and key < word_end:
included_runs.append(key)
# If the key is larger than or equal to the end of the word,
# this means we have found all relevant keys. We don't need
# to loop over the rest (we could, it just wouldn't be efficient)
if key >= word_end:
break
# At this point, included_runs is a full list of indices to the relevant
# runs so we can modify each one in turn.
for run_key in included_runs:
paragraph.runs[run_index[run_key]].italic = True
document.save(<MODIFIED_DOC>)
Problem 1
The problem with this approach is that, while uncommon (at least in my doc), it is possible for a single run to contain more than just your target word. So you might end up italicizing an entire run that includes your run and then some. For my use case it didn't make sense to fix that problem here.
Solution
If you were to perfect what I did above you would have to change this code block:
if word_start <= (key + len(paragraph.runs[run_index[key]].text)) and key < word_end:
included_runs.append(key)
Here you have identified the run that has your word. You would need to extend the code to separate the word into its own run and remove it from the current run. Then you could separately italicize that run.
Problem 2
The code shown above doesn't handle both the table and normal text. I didn't need to for my use case, but in the general case you would have to check both.

Related

Create a function in python that replaces "to be honest" in a sentence with "TBH"

Create a function in python that replaces at least four different words or phrases with internet slang acronyms such as LOL, OMG, TBH. For example, if the user enters a sentence "Oh my god, I am scared to be honest." The output should be "OMG I am scared TBH". The program must not use any built-in find, replace, encode, index, or translate functions. The program can use indexing (i.e., [ ] ), slicing (i.e., :), the in operator, and the len() function.
This is what I have so far:
user_string = (input("Please enter a string: ")).lower()
punctuations = '''.,!##$%^&*()[]{};:-'"\|<>/?_~'''
new_string = ""
list = []
for i in range(0, len(user_string)):
if (user_string[i] not in punctuations):
new_string = new_string + user_string[i]
print(new_string)
slang = "to be honest"
for i in range(0, len(slang)):
for j in range(0, len(new_string)):
if (new_string[j] == slang[i]):
list.append(j)
if (i < len(slang)):
i = i + 1
elif (new_string[j] != slang[i]):
if (len(list) > 0):
list.pop()
print(list)
First I am getting the sentence from the user and removing all the punctuations from the sentence. Then I have created a variable called slang which holds the slang that I want to replace in the sentence with the acronym "TBH".
I have nested for loops which compare the string that the user has entered to the first letter of the slang variable. If the letters are the same, it compares the next letter of the string with the next letter of the slang.
I'm getting an error from the last part. How do I check if "to be honest" is in the string that the user has entered? And if it is in the string, how do I replace it with "TBH"?

I cannot see any python errors that your code will actually produce, given the number of guard clauses, so I will assume what you mean by error is actually the program not working as you intended.
With that in mind, the main problem with your code is that you have nested for loops. This means that for any one character in slang, you check it against every character in new_string.
If you run through your code with this in mind, you will see that for every character in slang, you are attempting to add one value to the list and remove len(slang) - 1 values from it. Your clause, however, prevents this from causing an python error.
I would also like to mention that the statement
if (i < Len(slang)):
i = i + 1
is completely unnecessary because i is already automatically incremented by the for loop, which could cause issues later. It is guarded by a clause though, which is why it isn't a problem yet.

If you're still stuck on this problem, here's my version on how to solve this exercise:
# This is a dictionary so we can automate the replacement on the `__main__` scope
targets = {'to be honest': 'TBH', 'oh my god': 'OMG'}
# Returns a list of intervals that tells where all occurences of the
# `sequence` passed as parameter resides inside `source`.
#
# If `sequence` is not present, the list will be empty.
def findSequences(source, sequence):
# This is our return value.
intervals = []
# len is O(1). But if you need to implement your own len function,
# this might be handy to save for the linear complexity.
srcLength = len(source)
seqLength = len(sequence)
# If the sequence is larger than source, it's not inside
if (seqLength > srcLength):
return intervals
# If it's smaller or equal than source, it might be
else:
buffer = ''
for i in range(srcLength):
buffer = ''
# From a starting character on index `i`, we will create
# a temporary buffer with the length of sequence.
for j in range(seqLength):
# We must take care to not go out of the source string
# otherwise, there's no point in continuing on building
# buffer.
if (i+j >= srcLength):
break
else:
buffer += source[i+j]
# If this temporary buffer equals sequence, we found the
# substring!
if (buffer == sequence):
# Return the interval of the substring
intervals.append((i, i+j))
# Out of the for-loop.
return intervals
# Takes out any characters inside `punctuation` from source.
#
# Uses the `in` keyword on the if-statement. But as the post says,
# it's allowed.
def takeOutPunctuation(source, punctuation='.,!##$%^&*()[]{};:-\'"\\|<>/?_~'):
buffer = ''
for char in source:
if (char not in punctuation):
buffer += char
return buffer
# A naive approach would not to find all intervals, but to find the first
# `phrase` occurence inside the `source` string, and replace it. If you do
# that, it will get replacing "TBH" to "TBH2" infinitelly, always append "2"
# to the string.
#
# This function is smart enough to avoid that.
#
# It replaces all occurences of the `phrase` string into a `target` string.
#
# As `findSequences` returns a list of all capture's intervals, the
# replacement will not get stuck in an infinite loop if we use
# parameters such as: myReplace(..., "TBH", "TBH2")
def myReplace(source, phrase, target):
intervals = findSequences(source, phrase)
if (len(intervals) == 0):
return source
else:
# Append everything until the first capture
buffer = source[:intervals[0][0]]
# We insert this first interval just for writting less code inside the for-loop.
#
# This is not a capture, it's just so we can access (i-1) when the iteration
# starts.
intervals.insert(0, (0, intervals[0][0]))
# Start a the second position of the `intervals` array so we can access (i-1)
# at the start of the iteration.
for i in range(1, len(intervals)):
# For every `phrase` capture, we append:
# - everything that comes before the capture
# - the `target` string
buffer += source[intervals[i-1][1]+1:intervals[i][0]] + target
# Once the iteration ends, we must append everything that comes later
# after the last capture.
buffer += source[intervals[-1][1]+1:]
# Return the modified string
return buffer
if __name__ == '__main__':
# Note: I didn't wrote input() here so we can see what the actual input is.
user_string = 'Oh my god, I am scared to be honest and to be honest and to be honest!'.lower()
user_string = takeOutPunctuation(user_string)
# Automated Replacement
for key in targets:
user_string = myReplace(user_string, key, targets[key])
# Print the output:
print(user_string)
# -> OMG i am scared TBH and TBH and TBH
Note: I used Python 3.10.2 to run this script.

Extracting multiple data from a single list

I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:
CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:
P03133 Auxiliar helper un 203.02 417.54 437.22 675.80
My approach to it until now has been:
Creating lists to storage each information:
codes = []
description = []
unity = []
cost = []
Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.
Finding a code's easy, it's a distinct type of information amongst the other data.
For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:
The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.
The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.
The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.
for i, carc in enumerate(txtl):
if carc[0] == "P" and carc[1].isnumeric():
codes.append(carc)
j = 0
while not txtl[i+j].isnumeric():
j = j + 1
description.append(" ".join(txtl[i+1:i+j-2]))
unity.append(txtl[i+j-1])
cost.append(txtl[i+j])
I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:
while not txtl[i+j].isnumeric():
txtl[i+j] list index out of range.
Accepting any solution to debug my code or even new solutions to problem.
OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!

A slight addition to your code should resolve this:
while i+j < len(txtl) and not txtl[i+j].isnumeric():
j += 1
The first condition fails when out of bounds, so the second one doesn't get checked.
Also, please use a list of dict items instead of 4 different lists, fe:
thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})
In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...
EDIT after comment interactions:
Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.
from pprint import pprint # just for pretty printing
textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []
def handle_line(textl: str):
description = ''
unity = None
values = []
for word in textl.split()[1:]:
# it splits on space characters by default
# you can ignore the first item in the list, as this will always be the code
# str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
if not word.replace(',', '').replace('.', '').isnumeric():
if len(description) == 0:
description = word
else:
description = f'{description} {word}' # I like f-strings
elif not unity:
# if unity is still None, that means it has not been set yet
unity = word
else:
values.append(word)
return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}
the_list.append(handle_line(textl))
pprint(the_list)
str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296

Return the rank of word in Gensim Word2vec

I am now working on a project using Gensim.word2vec, and I am a total freshman for this field.
Actually I already got a model. Are there any way that I can get the similarity rank of a word for another word. For example, the top 2 most similar words for the word 'girl' is 'lady' and then 'woman'. Are there any functions I can use if i enter 'lady' is can return 1, if i enter 'woman' it can return 2?
Thanks!

There's no gensim API for this, but you can use basic Python code to find which position (if any) a word appears in a longer sequence – such as the list of results given by gensim's most_similar().
For example:
origin_word = 'apple'
query_word = 'orange'
all_sims = w2v_model.most_similar(origin_word, topn=0) # topn=0 gets all results
query_index = -1
for i, sim_tuple in enumerate(all_sims):
if sim_tuple[0] == query_word:
query_index = i
break
At the end of this code, query_index will either be the (0-based) position of 'orange' in the list-of-all-similars, or -1 if not found.
Note that the most expensive step is the creation of the all_sims ordered-list of all similar words; if you are going to be checking the ranks of multiple query words against one origin word, you'd definitely want to keep the all_sims around rather than re-compute it each time.
In fact, if you were sure you were going to do lots of such lookups, potentially down through the very-deepest words, you might do a single pass to change the results into a dict:
word_to_sims_index = {}
for i, sim_tuple in enumerate(all_sims):
word_to_sims_index[i] = sim_tuple[0]
After that, finding the index of a word would be a (quick constant-time) dict lookup...
query_index = word_to_sims_index[query_word]
...that will throw a KeyError if the query word isn't in the dict. (You could use word_to_sims_index.get(query_word, -1) if you instead wanted a default -1 response when the key is not present.)

I think this is a duplicate, and as they say in the other answer you can use model.rank('girl', 'lady')==1.

Python - program for searching for relevant cells in excel does not work correctly

I've written a code to search for relevant cells in an excel file. However, it does not work as well as I had hoped.
In pseudocode, this is it what it should do:
Ask for input excel file
Ask for input textfile containing keywords to search for
Convert input textfile to list containing keywords
For each keyword in list, scan the excelfile
If the keyword is found within a cell, write it into a new excelfile
Repeat with next word
The code works, but some keywords are not found while they are present within the input excelfile. I think it might have something to do with the way I iterate over the list, since when I provide a single keyword to search for, it works correctly. This is my whole code: https://pastebin.com/euZzN3T3
This is the part I suspect is not working correctly. Splitting the textfile into a list works fine (I think).
#IF TEXTFILE
elif btext == True:
#Split each line of textfile into a list
file = open(txtfile, 'r')
#Keywords in list
for line in file:
keywordlist = file.read().splitlines()
nkeywords = len(keywordlist)
print(keywordlist)
print(nkeywords)
#Iterate over each string in list, look for match in .xlsx file
for i in range(1, nkeywords):
nfound = 0
ws_matches.cell(row = 1, column = i).value = str.lower(keywordlist[i-1])
for j in range(1, worksheet.max_row + 1):
cursor = worksheet.cell(row = j, column = c)
cellcontent = str.lower(cursor.value)
if match(keywordlist[i-1], cellcontent) == True:
ws_matches.cell(row = 2 + nfound, column = i).value = cellcontent
nfound = nfound + 1
and my match() function:
def match(keyword, content):
"""Check if the keyword is present within the cell content, return True if found, else False"""
if content.find(keyword) == -1:
return False
else:
return True
I'm new to Python so my apologies if the way I code looks like a warzone. Can someone help me see what I'm doing wrong (or could be doing better?)? Thank you for taking the time!

Splitting the textfile into a list works fine (I think).
This is something you should actually test (hint: it does but is inelegant). The best way to make easily testable code is to isolate functional units into separate functions, i.e. you could make a function that takes the name of a text file and returns a list of keywords. Then you can easily check if that bit of code works on its own. A more pythonic way to read lines from a file (which is what you do, assuming one word per line) is as follows:
with open(filename) as f:
keywords = f.readlines()
The rest of your code may actually work better than you expect. I'm not able to test it right now (and don't have your spreadsheet to try it on anyway), but if you're relying on nfound to give you an accurate count for all keywords, you've made a small but significant mistake: it's set to zero inside the loop, and thus you only get a count for the last keyword. Move nfound = 0 outside the loop.
In Python, the way to iterate over lists - or just about anything - is not to increment an integer and then use that integer to index the value in the list. Rather loop over the list (or other iterable) itself:
for keyword in keywordlist:
...
As a hint, you shouldn't need nkeywords at all.
I hope this gets you on the right track. When asking questions in future, it'd be a great help to provide more information about what goes wrong, and preferably enough to be able to reproduce the error.

Python issue with replace statement?

I've been write this practice program for while now, the whole purpose of the code is to get user input and generate passwords, everything almost works, but the replace statements are driving me nuts. Maybe one of you smart programmers can help me, because I'm kinda new to this whole field of programming. The issue is that replace statement only seems to work with the first char in Strng, but not the others one. The other funcs blower the last run first and then the middle one runs.
def Manip(Strng):
#Strng = 'jayjay'
print (Strng.replace('j','h',1))
#Displays: 'hayjay'
print (Strng.replace('j','h',4))
#Displays: 'hayhay'
return
def Add_nums(Strng):
Size=len(str(Strng))
Total_per = str(Strng).count('%')
# Get The % Spots Position, So they only get replaced with numbers during permutation
currnt_Pos = 0
per = [] # % position per for percent
rGen = ''
for i in str(Strng):
if i == str('%'):
per.append(currnt_Pos)
currnt_Pos+=1
for num,pos in zip(str(self.ints),per):
rGen = Strng.replace(str(Strng[pos]),str(num),4);
return rGen
for pos in AlphaB: # DataBase Of The Positions Of Alphabets
for letter in self.alphas: #letters in The User Inputs
GenPass=(self.forms.replace(self.forms[pos],letter,int(pos)))
# Not Fully Formatted yet; you got something like Cat%%%, so you can use another function to change % to nums
# And use the permutations function to generate other passwrds and then
# continue to the rest of this for loop which will generate something like cat222 or cat333
Add_nums(GenPass) # The Function That will add numbers to the Cat%%%
print (rGen);exit()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python docx - find and replace words with italicized version - python

Related

Create a function in python that replaces "to be honest" in a sentence with "TBH"

Extracting multiple data from a single list

Return the rank of word in Gensim Word2vec

Python - program for searching for relevant cells in excel does not work correctly

Python issue with replace statement?

Categories

Resources