Conditionally replace substring in a string - python

I have a string that looks like-
str1="lol-tion haha-futures-tion yo-tion ard-tion pomo-tion"
I want to replace the substring tion with cloud IF it has only 1 - between lol and tion lol-tion
str2=str1.replace('tion','cloud')
But when the word has two - like instancehaha-futures-tion I want to replace it like below-
str3=str1.replace('tion','')
Expecting output-> haha-futures
How can I accomplish both these conditional replacements?

You can try this code with regex and condition,
import re
str1="lol-tion haha-futures-tion yo-tion ard-tion pomo-tion"
str2=""
for sub_string in str1.split(' '):
if re.search(r'^[a-z,-]+-tion$', sub_string):
if re.match(r"[a-zA-Z]+[-][a-zA-Z]+-tion$",sub_string):
print("match")
sub_string=sub_string.replace('tion','')
else:
sub_string=sub_string.replace('tion','cloud')
print(sub_string)
str2+=sub_string+" "
print(str2)

Your description isn't really clear, nor does it fully specify what should happen in all cases, but here's one interpretation:
def replace_tion(word):
replacements = {1: 'cloud', 2: ''}
replacement = replacements.get(word.count('-'), 'tion')
return word.replace('tion', replacement)
str1 = "lol-tion haha-futures-tion yo-tion ard-tion pomo-tion"
tions_replaced = ' '.join(replace_tion(word) for word in str1.split(' '))
# 'lol-cloud haha-futures- yo-cloud ard-cloud pomo-cloud'

IIUC this will work:
new_str1 = ' '.join([string.strip('-tion')
if string.count('-')==2
else string.replace('tion','cloud')
for string in str1.split(' ')])
print(new_str1)
'lol-cloud haha-futures yo-cloud ard-cloud pomo-cloud'

Related

How to filter a sentence based on list of the allowed words in python?

I have allow_wd as words that I want to search.
The sentench is an array of the main database.
The output need:
Newsentench = ['one three','']
Please help
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
It is difficult to understand what you are asking. Assuming you want any word in sentench to be kept if it contains anything in allow_wd, something like the following will work:
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
result = []
for sentence in sentench:
filtered = []
for word in sentence.split():
for allowed_word in allow_wd:
if allowed_word.lower() in word.lower():
filtered.append(word)
result.append(" ".join(filtered))
print(result)
If you want the word in the word to be exactly equal to an allowed word instead of just contain, change if allowed_word.lower() in word.lower(): to if allowed_word.lower() == word.lower()
Using regex boundaries with \b will ensure that two will be strictly matched and won't match twoo.
import re
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
newsentench = []
for sent in sentench:
output = []
for wd in allow_wd:
if re.findall('\\b' + wd + '\\b',sent):
output.append(wd)
newsentench.append(' '.join(word for word in output))
print(newsentench)
Thanks for your clarification, this should be what you want.
sentench=['one from twooo or three people are here','he is here']
allow_wd=['one','two','three','four']
print([" ".join([word for word in s.split(" ") if word in allow_wd]) for s in sentench])
returning: ['one three', '']

How to remove duplicate chars in a string?

I've got this problem and I simply can't get it right. I have to remove duplicated chars from a string.
phrase = "oo rarato roeroeu aa rouroupa dodo rerei dde romroma"
The output should be: "O rato roeu a roupa do rei de roma"
I tried things like:
def remove_duplicates(value):
var=""
for i in value:
if i in value:
if i in var:
pass
else:
var=var+i
return var
print(remove_duplicates(entrada))
But it's not there yet...
Any pointers to guide me here?
It seems from your example that you want to remove REPEATED SEQUENCES of characters, not duplicate chars across the whole string. So this is what I'm solving here.
You can use a regular expression.. not sure how horribly inefficient it is but it
works.
>>> import re
>>> phrase = str("oo rarato roeroeu aa rouroupa dodo rerei dde romroma")
>>> re.sub(r'(.+?)\1+', r'\1', phrase)
'o rato roeu a roupa do rei de roma'
How this substitution proceeds down the string:
oo -> o
" " -> " "
rara -> ra
to -> to
" "-> " "
roeroe -> roe
etc..
Edit: Works for the other example string which should not be modified:
>>> phrase = str("Barbara Bebe com Bernardo")
>>> re.sub(r'(.+?)\1+', r'\1', phrase)
'Barbara Bebe com Bernardo'
What you can do is form a set out of the string and then sort the remaining letters according to their original order.
def remove_duplicates(word):
unique_letters = set(word)
sorted_letters = sorted(unique_letters, key=word.index) # this will give you a list
return ''.join(sorted_letters)
words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)
String in python is a list of chars, right? But lists can have duplicates... sets cannot. So, if we convert list to set, then back to list, we'll get a list without duplicates ;P
I've seen a suggestion to use regex for replacing patterns. This will work, but that'll be a slow, and overcomplicated solution (human unfriendly to read also).
Regex is a heavy and costly weapon.
Also, you do not remove duplicated from string provided, but from words in the string:
First, split your string into lists of words.
for each of the words, remove duplicate letters
put back words to string
`
phrase = "oo rarato roeroeu aa rouroupa dodo rerei dde romroma"
words = phrase.split(' ')
`
words ['oo', 'rarato', 'roeroeu', 'aa', 'rouroupa', 'dodo', 'rerei', 'dde', 'romroma']
words_without_duplicates = []
for word in words:
word = ''.join(letter for letter in list(set(word)))
words_without_duplicates.append(word_without_duplicates)
phrase = ' '.join(word in words_without_duplicates)
phrase 'o oatr oeur a auopr od eir ed oamr'
Of curse, that can be optimized, but you wanted to be guided, so this is better to show the idea. It will be faster than regex too.
Actually I add a space end of the space. After that this is working
code
phrase =("oo rarato roeroeu aa rouroupa dodo rerei dde romroma ")
print(phrase)
ch=""
ali=[]
for i in phrase:
if i ==" ":
print(ch)
ch=""
if i not in ch:
ch=ch+i
Output
o
rato
roeu
a
roupa
do
rei
de
roma

Clean list of string containing escape sequence in python

I'm working on an OCR and the text extract from the image gets appended to a list that has a lot of escape sequences in it.
How can I clean a list of string like this
extracted = ["b'i)\\nSYRUP\\na\\n\\x0c'",
"b'mi.\\n\\x0c'",
"b'100\\n\\x0c'",
"b'Te eT ran\\nSYRUP\\n\\x0c'",
"b'tamol, Ambroxol k\\n\\x0c'",
"b'Guaiphenesin\\n\\x0c'",
"b'Syrup\\n\\x0c'",
"b'ol HCl &\\n\\x0c'",
"b'quantity.\\n\\x0c'"]
to this
cleaned= ["SYRUP",
"mi",
"100",
"Te eT ran SYRUP",
"tamol, Ambroxol k",
"Guaiphenesin",
"Syrup",
"ol HCl &"
"quantity"]
I tried replacing them but nothing works out and it goes back to how it was when extracted. Any suggestions? Thanks in advance.
For a start you could try:
for i, s in enumerate(extracted):
extracted[i] =(s.replace("b'", '')
.replace("i)", '')
.replace('\\na', '')
.replace('\\n', '')
.replace("\\x0c'", '')
.replace('.', ''))
This seems to be strings of bytecode string representation, which you can decode to utf-8. We use literal_eval from ast for safe evaluation.
This will get you most of the way there, oddities from OCR like i) you'll need to manually fix by replacing.
import ast
extracted = [
"b'i)\\nSYRUP\\na\\n\\x0c'",
"b'mi.\\n\\x0c'",
"b'100\\n\\x0c'",
"b'Te eT ran\\nSYRUP\\n\\x0c'",
"b'tamol, Ambroxol k\\n\\x0c'",
"b'Guaiphenesin\\n\\x0c'",
"b'Syrup\\n\\x0c'",
"b'ol HCl &\\n\\x0c'",
"b'quantity.\\n\\x0c'"]
def fix_string(s):
eval_str = ast.literal_eval(s)
dec_str = eval_str.decode('utf-8')
fix_str = dec_str.strip().replace('\n', ' ')
return fix_str
for e in extracted:
print(fix_string(e))
Output:
i) SYRUP a
mi.
100
Te eT ran SYRUP
tamol, Ambroxol k
Guaiphenesin
Syrup
ol HCl &
quantity.
Here is an answer that assumes the substring you are looking for in each string is either between two newlines or at the beginning of a string and followed by a newline.
import re
def find_substring(string):
string = (eval(string)).decode('UTF-8')
pattern = r"\n?.*\.?\n"
lst = re.findall(pattern,string)
if len(lst) == 1:
substring = lst[0].strip(".\n")
else:
pattern2 = r"\n.*\n"
lst2 = re.findall(pattern2,"".join(lst))
substring = lst2[0].strip("\n")
return substring
Then, map to the list like so.
list(map(find_substring,extracted))
This outputs:
['SYRUP',
'mi',
'100',
'SYRUP',
'tamol, Ambroxol k',
'Guaiphenesin',
'Syrup',
'ol HCl &',
'quantity']

Should I be using regex in Python

I have a string like so:
'cathy is a singer on fridays'
and I want to be able to replace the fourth word with other verbs
so
'cathy is a dancer on fridays'
I assumed the right way to do this would be to use regex and stop when you reach the third whitespace but you can do groupings with regex and * which accepts any char. I can't seem to get it working.
Any advice would be useful. I am new to Python so please dont judge.Also is regex appropriate for this or should I use another method?
Thank you
No, Regex is not needed for this. See below:
>>> mystr = 'cathy is a singer on fridays'
>>> x = mystr.split()
>>> x
['cathy', 'is', 'a', 'singer', 'on', 'fridays']
>>> x[3] = "dancer"
>>> x
['cathy', 'is', 'a', 'dancer', 'on', 'fridays']
>>> " ".join(x)
'cathy is a dancer on fridays'
Or, more compact:
>>> mystr = 'cathy is a singer on fridays'
>>> x = mystr.split()
>>> " ".join(x[:3] + ["dancer"] + x[4:])
'cathy is a dancer on fridays'
>>>
The core principle here is the .split method of a string.
You can get what you want by splitting and joining the string after substituting the desired piece
stringlist = 'cathy is a singer on fridays'.split()
stringlist[3] = 'dancer'
print(' '.join(stringlist))
Here is the solution using backreferences and the sub function from re
Documentation here
import re
msg = 'cathy is a singer on fridays'
print re.sub('(\w+) (\w+) (\w+) (\w+)', r'\1 \2 \3 dancer', msg, 1)
Output
>>> cathy is a dancer on fridays
if you really just want the third word, split/slice/join is easier:
mytext = 'cathy is a singer on fridays'
mysplit = mytext.split(' ')
' '.join(mysplit[:3] + ['dancer',] + mysplit[4:])
regex can do much more complicated things, and there is a re.split, and there might be a faster way to do it, but this is reasonable and readable.
You can either split the string using split(' ') or a tokenizer like nltk which might also provide you some more functionality for this specific use case with part of speech analysis. If you are trying to replace it with random nouns of profession look for a word bank. Regex is overkill for what you need.
If you already know the position of the word you want to replace in the string, you could simply use:
def replace_word(sentence, new_word, position):
sent_list = sentence.split()
sent_list[position] = new_word
return " ".join(sent_list)

String comparison in python words ending with

I have a set of words as follows:
['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
In the above sentences i need to identify all sentences ending with ? or . or 'gy'. and print the final word.
My approach is as follows:
# words will contain the string i have pasted above.
word = [w for w in words if re.search('(?|.|gy)$', w)]
for i in word:
print i
The result i get is:
Hey, how are you?
My name is Mathews.
I hate vegetables
French fries came out soggy
The expected result is:
you?
Mathews.
soggy
Use endswith() method.
>>> for line in testList:
for word in line.split():
if word.endswith(('?', '.', 'gy')) :
print word
Output:
you?
Mathews.
soggy
Use endswith with a tuple.
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in line.split():
if word.endswith(('?', '.', 'gy')):
print word
Regular expression alternative:
import re
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in re.findall(r'\w+(?:\?|\.|gy\b)', line):
print word
You were close.
You just need to escape the special characters (? and .) in the pattern:
re.search(r'(\?|\.|gy)$', w)
More details in the documentation.

Categories