I'm new in python. I'm trying to reverse each word in the sentence. I wrote following code for that and it is working perfeclty.
My code:
[From answer]
import re
str = "I am Mike!"
def reverse_word(matchobj):
return matchobj.group(1)[::-1]
res = re.sub(r"([A-Za-z]+)", reverse_word, str)
print(res)
But I want to add one condition in that..only words should reverse not any symbol.[ except alphanumerical words and words contains hyphen]
Updated##
Sample:
input: "I am Mike! and123 my-age is 12"
current output: "I ma ekiM! dna123 ym-ega si 12"
required output: "I ma ekiM! 321dna ege-ym si 21"
The Regex: ([A-Za-z]+)
You can use the character class [A-Za-z] for checking any word with one or more length, capture it then reverse the group 1 using a function using the re.sub function.
import re
str = "I am Mike!"
def reverse_word(matchobj):
return matchobj.group(1)[::-1]
res = re.sub(r"([A-Za-z]+)", reverse_word, str)
print(res)
Outputting:
'I ma ekiM!'
Update:
You can tweak the code a little to acheive your results:
import re
str = "I am Mike! and123 my-age is 12"
def reverse_word(matchobj):
hyphen_word_pattern = r"([A-Za-z]+)\-([A-Za-z]+)"
match = re.search(hyphen_word_pattern, matchobj.group(1))
if match:
return re.sub(hyphen_word_pattern, f"{match.group(2)[::-1]}-{match.group(1)[::-1]}", match.group(0))
else:
return matchobj.group(1)[::-1]
res = re.sub(r"([A-Za-z]+\-?[A-Za-z]+)", reverse_word, str)
print(res)
Outputting:
I ma ekiM! dna123 ega-ym si 12
Don't use re at all
def reverse_words_in_string(string):
spl = string.split()
for i, word in enumerate(spl):
spl[i] = word[::-1]
return ' '.join(spl)
gives
'I ma !ekiM 321dna ega-ym si 21'
One approach which might work would be to make an additional iteration over the list of words and use re.sub to move an optional leading punctuation character back to the end of the now reversed word:
s = "I am Mike!"
split_s = s.split()
r_word = [word[::-1] for word in split_s]
r_word = [re.sub(r'^([^\s\w])(.*)$', '\\2\\1', i) for i in r_word]
new_s = " ".join(r_word)
print(new_s)
I ma ekiM!
Related
Input:
string = "My dear adventurer, do you understand the nature of the given discussion?"
expected output:
string = 'My dear ##########, do you ########## the nature ## the given ##########?'
How can you replace the third word in a string of words with the # length equivalent of that word while avoiding counting special characters found in the string such as apostrophes('), quotations("), full stops(.), commas(,), exclamations(!), question marks(?), colons(:) and semicolons (;).
I took the approach of converting the string to a list of elements but am finding difficulty filtering out the special characters and replacing the words with the # equivalent. Is there a better way to go about it?
I solved it with:
s = "My dear adventurer, do you understand the nature of the given discussion?"
def replace_alphabet_with_char(word: str, replacement: str) -> str:
new_word = []
alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
for c in word:
if c in alphabet:
new_word.append(replacement)
else:
new_word.append(c)
return "".join(new_word)
every_nth_word = 3
s_split = s.split(' ')
result = " ".join([replace_alphabet_with_char(s_split[i], '#') if i % every_nth_word == every_nth_word - 1 else s_split[i] for i in range(len(s_split))])
print(result)
Output:
My dear ##########, do you ########## the nature ## the given ##########?
There are more efficient ways to solve this question, but I hope this is the simplest!
My approach is:
Split the sentence into a list of the words
Using that, make a list of every third word.
Remove unwanted characters from this
Replace third words in original string with # times the length of the word.
Here's the code (explained in comments) :
# original line
line = "My dear adventurer, do you understand the nature of the given discussion?"
# printing original line
print(f'\n\nOriginal Line:\n"{line}"\n')
# printing somehting to indicate that next few prints will be for showing what is happenning after each lone
print('\n\nStages of parsing:')
# splitting by spaces, into list
wordList = line.split(' ')
# printing wordlist
print(wordList)
# making list of every third word
thirdWordList = [wordList[i-1] for i in range(1,len(wordList)+1) if i%3==0]
# pritning third-word list
print(thirdWordList)
# characters that you don't want hashed
unwantedCharacters = ['.','/','|','?','!','_','"',',','-','#','\n','\\',':',';','(',')','<','>','{','}','[',']','%','*','&','+']
# replacing these characters by empty strings in the list of third-words
for unwantedchar in unwantedCharacters:
for i in range(0,len(thirdWordList)):
thirdWordList[i] = thirdWordList[i].replace(unwantedchar,'')
# printing third word list, now without punctuation
print(thirdWordList)
# replacing with #
for word in thirdWordList:
line = line.replace(word,len(word)*'#')
# Voila! Printing the result:
print(f'\n\nFinal Output:\n"{line}"\n\n')
Hope this helps!
Following works and does not use regular expressions
special_chars = {'.','/','|','?','!','_','"',',','-','#','\n','\\'}
def format_word(w, fill):
if w[-1] in special_chars:
return fill*(len(w) - 1) + w[-1]
else:
return fill*len(w)
def obscure(string, every=3, fill='#'):
return ' '.join(
(format_word(w, fill) if (i+1) % every == 0 else w)
for (i, w) in enumerate(string.split())
)
Here are some example usage
In [15]: obscure(string)
Out[15]: 'My dear ##########, do you ########## the nature ## the given ##########?'
In [16]: obscure(string, 4)
Out[16]: 'My dear adventurer, ## you understand the ###### of the given ##########?'
In [17]: obscure(string, 3, '?')
Out[17]: 'My dear ??????????, do you ?????????? the nature ?? the given ???????????'
With help of some regex. Explanation in the comments.
import re
imp = "My dear adventurer, do you understand the nature of the given discussion?"
every_nth = 3 # in case you want to change this later
out_list = []
# split the input at spaces, enumerate the parts for looping
for idx, word in enumerate(imp.split(' ')):
# only do the special logic for multiples of n (0-indexed, thus +1)
if (idx + 1) % every_nth == 0:
# find how many special chars there are in the current segment
len_special_chars = len(re.findall(r'[.,!?:;\'"]', word))
# ^ add more special chars here if needed
# subtract the number of special chars from the length of segment
str_len = len(word) - len_special_chars
# repeat '#' for every non-special char and add the special chars
out_list.append('#'*str_len + word[-len_special_chars] if len_special_chars > 0 else '')
else:
# if the index is not a multiple of n, just add the word
out_list.append(word)
print(' '.join(out_list))
A mixed of regex and string manipulation
import re
string = "My dear adventurer, do you understand the nature of the given discussion?"
new_string = []
for i, s in enumerate(string.split()):
if (i+1) % 3 == 0:
s = re.sub(r'[^\.:,;\'"!\?]', '#', s)
new_string.append(s)
new_string = ' '.join(new_string)
print(new_string)
I use a list of synonym to replace word in my sentence by them. The function works but there is a slightly problem with the output
#Function
eda(t, alpha_sr=0.1, num_aug=3)
Original : "Un abricot est bon."
New sentence : 'Un aubercot est bon .'
As you can see the replacement was made but the punctuation is far the last word est the original. I would like to modify so that I will obtain this result for each pounction.
New sentence : 'Un aubercot est bon.'
augmented_sentences.append(' '.join(a_words)) # the problem arise here since, I joined the words after splitting them the punctuation is also join with space.
Sinc I am working with some quite long review, the punctuation is really important.
The code is below :
def cleaning(texte):
texte = re.sub(r"<iwer>.*?</iwer>", " ", str(texte)) # clean
return texte
def eda(sentence, alpha_sr=float, num_aug=int):
sentence = cleaning(sentence)
sent_doc = nlp(sentence)
words = [token.text for token in sent_doc if token.pos_ != "SPACE"]
num_words = len(words)
augmented_sentences = []
num_new_per_technique = int(num_aug/4)+1
if (alpha_sr > 0):
n_sr = max(1, int(alpha_sr*num_words))
for _ in range(num_new_per_technique):
a_words = synonym_replacement(words, n_sr)
print(a_words)
augmented_sentences.append(' '.join(a_words)) # the problem is here since, I joined the words adfter using
shuffle(augmented_sentences)
#trim so that we have the desired number of augmented sentences
if num_aug >= 1:
augmented_sentences = augmented_sentences[:num_aug]
else:
keep_prob = num_aug / len(augmented_sentences)
augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
#append the original sentence
augmented_sentences.append(sentence)
#print(len(augmented_sentences))
return augmented_sentences
def synonym_replacement(words, n):
new_words = words.copy()
random_word_list = [word for word in words if word not in stop_words]
random_word_list = ' '.join(new_words)
#print("random list :", random_word_list)
sent_doc = nlp(random_word_list)
random_word_list = [token.lemma_ for token in sent_doc if token.pos_ == "NOUN" or token.pos_ == "ADJ" or token.pos_ == "VERB" or token.pos_ == "ADV"]
random.shuffle(random_word_list)
num_replaced = 0
for random_word in random_word_list:
synonyms = get_synonyms(random_word)
if len(synonyms) >= 1:
synonym = random.choice(list(synonyms))
new_words = [synonym if word == random_word else word for word in new_words]
#print("replaced", random_word, "with", synonym)
num_replaced += 1
if num_replaced >= n: #only replace up to n words
break
#this is stupid but we need it, trust me
sentence = ' '.join(new_words)
new_words = sentence.split(' ')
return new_words
def get_synonyms(word):
synonyms = []
for k_syn, v_syn in word_syn_map.items():
if k_syn == word:
print(v_syn)
synonyms.extend(v_syn)
synonyms = set(synonyms)
if word in synonyms:
synonyms.remove(word)
return list(synonyms)
the dictionnary of synonym look like this :
#word_syn_map
defaultdict(<class 'list'>,
{'ADN': ['acide désoxyribonucléique', 'acide désoxyribonucléique'],
'abdomen': ['bas-ventre',
'bide',
'opisthosome',
'panse',
'ventre',
'bas-ventre',
'bide',
'opisthosome',
'panse',
'ventre'],
'abricot': ['aubercot', 'michemis', 'aubercot', 'michemis']})
tokenization
import stanza
import spacy_stanza
stanza.download('fr')
nlp = spacy_stanza.load_pipeline('fr', processors='tokenize,mwt,pos,lemma')
Two answers to this:
I can't see your nlp function, so I don't know how you're tokenising the string, but it looks like you're doing it by treating punctuation as a separate token. That's why it's picking up a space, because it's being treated like any other word. You either need to adjust your tokenisation algorithm so that it includes the punctuation in the word or if you can't do that then you need to do an extra pass through the words list at the start to stick punctuation back onto the token it belongs to (ie. if a given token is punctuation, and you'll need a list of punctuation tokens, then glue it together with the token before). Either way then you need to adjust your matching algorithm so it ignores punctuation and matches the rest of the word.
This feels like you're overcomplicating the problem. I'd be inclined to do something like this:
import re
def get_synonym(wordmatch):
synonym = # pick one synonym for word at random
return wordmatch.group(1) + synonym + matchobj.group(2)
new_sentence = sentence.copy # Or however you want to take a copy of sentence, eg. copy.copy()
for original_word in defaultdict:
wordexp = re.compile((^|" ") + word + ([ .!?-,])) # Add more punctuation to this list
new_sentence = re.sub(wordexp, get_synonym, new_sentence, flags=re.IGNORECASE)
Not guaranteed to work, I haven't tested it (and you'll certainly need to do something to maintain capitalisation or it'll lowercase everything), but I'd do something with regexes, myself.
I have a string in which I want to make a regular expression in python to find three character repetend words who's first and last character should be same and middle one can any character
Sample string
s = 'timtimdsikmunmunjuityakbonbonjdjjdkitkatghdnjsamsunksuwjkhokhojeuhjjimjamjsju'
I want to extract all the highlighted words from above string...
My solution, but not matching with my requirement
import re
s='timtimdsikmunmunjuityakbonbonjdjjdkitkatghdnjsamsunksuwjkhokhojeuhjjimjamjsju'
re.findall(r'([a-z].[a-z])(\1)',s)
this is giving me this
[('tim', 'tim'), ('mun', 'mun'), ('bon', 'bon'), ('kho', 'kho')]
I want this
[('kit', 'kat'), ('sam', 'sun'), ('jim', 'jam'),('nmu', 'nju')]
Thanks
You can use capturing groups and references:
s='timtimdsikmunmunjuityakbonbonjdjjdkitkatghdnjsamsunksuwjkhokhojeuhjjimjamjsju'
import re
out = re.findall(r'((.).(.)\2.\3)', s)
[e[0] for e in out]
output:
['timtim', 'munmun', 'bonbon', 'kitkat', 'khokho', 'jimjam']
ensuring the middle letter is different:
[e[0] for e in re.findall(r'((.)(.)(.)\2(?!\3).\4)', s)]
output:
['nmunju', 'kitkat', 'jimjam']
edit: split output:
>>> [(e[0][:3], e[0][3:]) for e in re.findall(r'((.)(.)(.)\2(?!\3).\4)', s)]
[('nmu', 'nju'), ('kit', 'kat'), ('jim', 'jam')]
There is always the pure Python way:
s = 'timtimdsikmunmunjuityakbonbonjdjjdkitkatghdnjsamsunksuwjkhokhojeuhjjimjamjsju'
result = []
for i in range(len(s) - 5):
word = s[i:(i+6)]
if (word[0] == word[3] and word[2] == word[5] and word[1] != word[4]):
result.append(word)
print(result)
['nmunju', 'kitkat', 'jimjam']
You can use this regex in python:
(?P<first>([a-z])(.)([a-z]))(?P<second>\2(?!\3).\4)
Group first is for first word and second is for the second word.
(?!\3) is negative lookahead to make sure second character is not same in 2nd word.
RegEx Demo
import re
rx = re.compile(r"(?P<first>([a-z])(.)([a-z]))(?P<second>\2(?!\3).\4)")
s = 'timtimdsikmunmunjuityakbonbonjdjjdkitkatghdnjsamsunksuwjkhokhojeuhjjimjamjsju'
for m in rx.finditer(s): print(m.group('first'), m.group('second'))
Output:
nmu nju
kit kat
jim jam
You can do it faster with for loop:
result2 = []
for i in range(len(s)):
try:
if s[i] == s[i+3] and s[i+2] == s[i+5]:
result2.append((s[i:i+3], s[i+3:i+6]))
except IndexError:pass
print(result2)
I have a string. Now I want to split the string into parts if anything matches from two different lists. how can I do that ? there what i have.
dummy_word = "I have a HTML file"
dummy_type = ["HTML","JSON","XML"]
dummy_file_type = ["file","document","paper"]
for e in dummy_type:
if e in dummy_word:
type_found = e
print("type ->" , e)
dum = dummy_word.split(e)
complete_dum = "".join(dum)
for c in dummy_file_type:
if c in complete_dum:
then = complete_dum.split("c")
print("file type ->",then)
In the given scenario my expected output is ["I have a", "HTML","file"]
These sort of tasks a handled pretty well by itertools.groupby(). Here the key will translate to individual words if the words is in the set of words, or False if it's not. This allows all the non-special words to group together and each special word to become its own element:
from itertools import groupby
dummy_word = "I have a HTML file"
dummy_type = ["HTML","JSON","XML"]
dummy_file_type = ["file","document","paper"]
words = set(dummy_type).union(dummy_file_type)
[" ".join(g) for k, g in
groupby(dummy_word.split(), key=lambda word: (word in words) and word)]
# ['I have a', 'HTML', 'file']
This worked for me:
dummy_word = "I have a HTML file"
dummy_type = ["HTML","JSON","XML"]
dummy_file_type = ["file","document","paper"]
temp = ""
dummy_list = []
for word in dummy_word.split():
if word in dummy_type or word in dummy_file_type:
if temp:
dummy_list.append(temp)
print(temp, "delete")
print(temp)
new_word = word + " "
dummy_list.append(new_word)
temp = ""
else:
temp += word + " "
print(temp)
print(dummy_list)
One more way using re:
>>> list(map(str.strip, re.sub("|".join(dummy_type + dummy_file_type), lambda x: "," + x.group(), dummy_word).split(',')))
['I have a', 'HTML', 'file']
>>>
First, form a regex pattern by concatenating all the types using join. Using re.sub, the string is replaced where tokens are prepended with a comma, and then we split the string using comma separator. map is used to strip the whitespaces.
I have a user entered string and I want to search it and replace any occurrences of a list of words with my replacement string.
import re
prohibitedWords = ["MVGame","Kappa","DatSheffy","DansGame","BrainSlug","SwiftRage","Kreygasm","ArsonNoSexy","GingerPower","Poooound","TooSpicy"]
# word[1] contains the user entered message
themessage = str(word[1])
# would like to implement a foreach loop here but not sure how to do it in python
for themessage in prohibitedwords:
themessage = re.sub(prohibitedWords, "(I'm an idiot)", themessage)
print themessage
The above code doesn't work, I'm sure I don't understand how python for loops work.
You can do that with a single call to sub:
big_regex = re.compile('|'.join(map(re.escape, prohibitedWords)))
the_message = big_regex.sub("repl-string", str(word[1]))
Example:
>>> import re
>>> prohibitedWords = ['Some', 'Random', 'Words']
>>> big_regex = re.compile('|'.join(map(re.escape, prohibitedWords)))
>>> the_message = big_regex.sub("<replaced>", 'this message contains Some really Random Words')
>>> the_message
'this message contains <replaced> really <replaced> <replaced>'
Note that using str.replace may lead to subtle bugs:
>>> words = ['random', 'words']
>>> text = 'a sample message with random words'
>>> for word in words:
... text = text.replace(word, 'swords')
...
>>> text
'a sample message with sswords swords'
while using re.sub gives the correct result:
>>> big_regex = re.compile('|'.join(map(re.escape, words)))
>>> big_regex.sub("swords", 'a sample message with random words')
'a sample message with swords swords'
As thg435 points out, if you want to replace words and not every substring you can add the word boundaries to the regex:
big_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, words)))
this would replace 'random' in 'random words' but not in 'pseudorandom words'.
try this:
prohibitedWords = ["MVGame","Kappa","DatSheffy","DansGame","BrainSlug","SwiftRage","Kreygasm","ArsonNoSexy","GingerPower","Poooound","TooSpicy"]
themessage = str(word[1])
for word in prohibitedwords:
themessage = themessage.replace(word, "(I'm an idiot)")
print themessage
Based on Bakariu's answer,
A simpler way to use re.sub would be like this.
words = ['random', 'words']
text = 'a sample message with random words'
new_sentence = re.sub("random|words", "swords", text)
The output is "a sample message with swords swords"
Code:
prohibitedWords =["MVGame","Kappa","DatSheffy","DansGame",
"BrainSlug","SwiftRage","Kreygasm",
"ArsonNoSexy","GingerPower","Poooound","TooSpicy"]
themessage = 'Brain'
self_criticism = '(I`m an idiot)'
final_message = [i.replace(themessage, self_criticism) for i in prohibitedWords]
print final_message
Result:
['MVGame', 'Kappa', 'DatSheffy', 'DansGame', '(I`m an idiot)Slug', 'SwiftRage',
'Kreygasm', 'ArsonNoSexy', 'GingerPower', 'Poooound','TooSpicy']