replace a list of words with regex - python

I have a list of words list_words =["cat","dog","animals"]
And i have a text = "I have a lot of animals a cat and a dog"
I want a regex code that is able to add a comma at the end of every word before any word in the list given.
I want my text to be like that: text = "I have a lot of, animals a, cat and a, dog"
My code so far:
import re
list_words = ["cat", "dog", "animals","adam"]
text = "I have a lot of animals, cat and a dog"
for word in list_words:
if word in text:
word = re.search(r" (\s*{})".format(word), text).group(1)
text = text.replace(f" {word}", f", {word}")
print(text)
But i have 2 issues here:
1: if i have a text like this : text= I have a lot of animals cat and a dogy
it turns it into : text= I have a lot of, animals, cat and a, dogy
which is not the result wanted, i wanted to replace only the word itself not with
addition like dogy
2: if i have a text like this: text= I have a lot of animals, cat and a dogy
it still add another comma which is not what i want

You can use
,*(\s*\b(?:cat|dog|animals|adam))\b
See the regex demo. Details:
,* - zero or more commas
(\s*\b(?:cat|dog|animals|adam)) - Group 1:
\s* - zero or more whitespaces
\b - a word boundary
(?:cat|dog|animals|adam) - one of the words
\b - word boundary
See the Python demo:
import re
list_words = ["cat", "dog", "animals", "adam"]
text = "I have a lot of animals, cat and a dog"
pattern = r",*(\s*\b(?:{}))\b".format("|".join(list_words))
print( re.sub(pattern, r",\1", text) )
# => I have a lot of, animals, cat and a, dog

All words get a comma:
import re
list_words = ["cat", "dog", "animals"]
text = "I have a lot of animals a cat and a dog"
for word in list_words:
word = re.search(r" (\s*{}) ".format(word), text)
text = text.replace(f" {word}", f", {word}")

Go with a simpler method.
list_words =["cat","dog","animals"]
text = "I have a lot of animals a cat and a dog"
test_list_words=[]
for new in text.split(" "):
if new in list_words:
new=new+","
test_list_words.append(new)
else:
test_list_words.append(new)
print(' '.join(test_list_words))

Related

How to replace multiple substring at once and not sequentially?

I want to replace multiple substring at once, for instance, in the following statement I want to replace dog with cat and cat with dog:
I have a dog but not a cat.
However, when I use sequential replace string.replace('dog', 'cat') and then string.replace('cat', 'dog'), I get the following.
I have a dog but not a dog.
I have a long list of replacements to be done at once so a nested replace with temp will not help.
One way using re.sub:
import re
string = "I have a dog but not a cat."
d = {"dog": "cat", "cat": "dog"}
new_string = re.sub("|".join(d), lambda x: d[x.group(0)], string)
Output:
'I have a cat but not a dog.'
string.replace('dog', '#temp').replace('cat', 'dog').replace('#temp', 'cat')
The simplest way is using count occurrences:
words = 'I have a dog but not a cat'
words = words.replace('cat', 'dog').replace('dog', 'cat', 1)
print(words)
# I have a cat but not a dog

Trying to remove symbol (" - ") with whitespace while keeping symbol ("-") without whitespace

I have a txt file I open in Python. And I'm trying to remove the symbols and order the remaining words alphabetically. Removing the periods, the commas etc. isn't a problem. However, I can't seem to remove the dash symbol with whitespaces when I add it to a list together with the rest of the symbols.
This is an example of what I open:
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
This is what I want (periods removed, and dash symbols which aren't attached to a word removed):
content = "The quick brown fox who was hungry jumps over the 7-year old lazy dog"
But I either get this (all dash symbols removed):
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
Or this (dash symbol unremoved):
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog"
This is my entire code. Adding a content.replace() works. But that isn't what I want:
f = open("article.txt", "r")
# Create variable (Like this removing " - " works)
content = f.read()
content = content.replace(" - ", " ")
# Create list
wordlist = content.split()
# Which symbols (If I remove the line "content = content.replace(" - ", " ")", the " - " in this list doesn't get removed here)
chars = [",", ".", "'", "(", ")", "‘", "’", " - "]
# Remove symbols
words = []
for element in wordlist:
temp = ""
for ch in element:
if ch not in chars:
temp += ch
words.append(temp)
# Print words, sort alphabetically and do not print duplicates
for word in sorted(set(words)):
print(word)
It works like this. But when I remove the content = content.replace(" - ", " "), the "whitespace + dash symbol + whitspace" in chars doesn't get removed.
And if I replace it with "-" (no whitespaces), I get this which I don't want:
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
Is it possible at all to do this with a list like chars or is my only option to do this with a .replace().
And is there a particular reason why Python orders capitalized words alphabetically first, and uncapitalized words later separately?
Like this (The letters ABC are just added to emphasize what I'm trying to say):
7-year
A
B
C
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who
You can use re.sub like this:
>>> import re
>>> strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
>>> content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
>>> strip_chars.sub("", content)
'The quick brown fox who was hungry jumps over the 7-year old lazy dog'
>>> strip_chars.sub("", content).split()
['The', 'quick', 'brown', 'fox', 'who', 'was', 'hungry', 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog']
>>> print(*sorted(strip_chars.sub("", content).split()), sep='\n')
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who
Summarizing my comments and putting it all together:
from pathlib import Path
from collections import Counter
import re
strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
article = Path('/path/to/your/article.txt')
content = article.read_text()
words = Counter(strip_chars.sub('', content).split())
for word in sorted(words, key=lambda x: x.lower()):
print(word)
If The and the, for example, count as duplicate words then you just need to convert content to lower case letters. The code would be this one instead:
from pathlib import Path
from collections import Counter
import re
strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
article = Path('/path/to/your/article.txt')
content = article.read_text().lower()
words = Counter(strip_chars.sub('', content).split())
for word in sorted(words):
print(word)
Finally, as a good side effect of using collections.Counter, you also get a words counter in words and you can answer questions like "what are the top ten most common words?" with something like:
words.most_common(10)
After
wordlist = content.split()
your list no longer contains anything with starting/ending whitespaces.
str.split()
removes consecutive whitespaces. So there is no ' - ' in your split list.
Doku: https://docs.python.org/3/library/stdtypes.html#str.split
str.split(sep=None, maxsplit=-1)
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Replacing ' - ' seems right - the other way to keep close to your code would be to remove exactly '-' from your split list:
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
wordlist = content.split()
print(wordlist)
chars = [",", ".", "'", "(", ")", "‘", "’"] # modified
words = []
for element in wordlist:
temp = ""
if element == '-': # skip pure -
continue
for ch in element: # handle characters to be removed
if ch not in chars:
temp += ch
words.append(temp)
Output:
['The', 'quick', 'brown', 'fox', '-', 'who', 'was', 'hungry', '-',
'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog.']
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who

If a certain word is not before the search word then add to list python

I would like the program to detect whether a certain word is before the search word and if it is not to add it to a list.
This is what I have come up with myself:
sentence = "today i will take my dog for a walk, tomorrow i will not take my dog for a walk"
all = ["take", "take"]
all2= [w for w in all if not(re.search(r'not' + w + r'\b', sentence))]
print(all2)
The excpected output is ["take"], but it remains the same with ["take, "take]
Watch how it should be formulated: gather all take word occurrences that aren't preceded with word not:
import re
sentence = "today i will take my dog for a walk, tomorrow i will not take my dog for a walk"
search_word = 'take'
all_takes_without_not = re.findall(fr'(?<!\bnot)\s+({search_word})\b', sentence)
print(all_takes_without_not)
The output:
['take']
It may be simpler to first convert you sentence to a list of words.
from itertools import chain
# Get individual words from the string
words = sentence.split()
# Create an iterator which yields the previous word at each position
previous = chain([None], words)
output = [word for prev, word in zip(previous, words) if word=='take' and prev != 'not']

Add quotes to a list of words in a sentence in python using regular expressions

I have a list of words like:
["apple", "orange", "plum"]
I would like to add quotes only to these words in a string :
Rita has apple ----> Rita has "apple"
Sita has "apple" and plum ----> Sita has "apple" and "plum"
How can I achieve this in python using regular expression?
You can use re.sub with an alternation pattern created by joining the words in the list. Enclose the alternation pattern in word boundary assertions \b so that it would only match whole words. Use negative lookbehind and lookahead to avoid matching words already enclosed in double quotes:
import re
words = ["apple", "orange", "plum"]
s = 'Sita has apple and "plum" and loves drinking snapple'
print(re.sub(r'\b(?!<")(%s)(?!")\b' % '|'.join(words), r'"\1"', s))
This outputs:
Sita has "apple" and "plum" and loves drinking snapple
Demo: https://ideone.com/Tf9Aka
re.sub can handle this for you nicely
import re
mystr = "Rita has apple"
mylist = ["apple", "orange", "plum"]
for item in mylist:
mystr = re.sub(item, '\"%s\"'%item, mystr)
print(mystr)
Solution without using regex:
txt = "Sita has apple and plum"
words = ["apple", "orange", "plum"]
txt = " ".join(["\""+w+"\"" if w in words else w for w in txt.split()])
print (txt)
txt = "Rita drinks apple flavored snapple?"
txt = " ".join(["\""+w+"\"" if w in words else w for w in txt.split()])
print (txt)

String comparison in python words ending with

I have a set of words as follows:
['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
In the above sentences i need to identify all sentences ending with ? or . or 'gy'. and print the final word.
My approach is as follows:
# words will contain the string i have pasted above.
word = [w for w in words if re.search('(?|.|gy)$', w)]
for i in word:
print i
The result i get is:
Hey, how are you?
My name is Mathews.
I hate vegetables
French fries came out soggy
The expected result is:
you?
Mathews.
soggy
Use endswith() method.
>>> for line in testList:
for word in line.split():
if word.endswith(('?', '.', 'gy')) :
print word
Output:
you?
Mathews.
soggy
Use endswith with a tuple.
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in line.split():
if word.endswith(('?', '.', 'gy')):
print word
Regular expression alternative:
import re
lines = ['Hey, how are you?\n','My name is Mathews.\n','I hate vegetables\n','French fries came out soggy\n']
for line in lines:
for word in re.findall(r'\w+(?:\?|\.|gy\b)', line):
print word
You were close.
You just need to escape the special characters (? and .) in the pattern:
re.search(r'(\?|\.|gy)$', w)
More details in the documentation.

Categories