Suppose that I have the following sentence:
bean likes to sell his beans
and I want to replace all occurrences of specific words with other words. For example, bean to robert and beans to cars.
I can't just use str.replace because in this case it'll change the beans to roberts.
>>> "bean likes to sell his beans".replace("bean","robert")
'robert likes to sell his roberts'
I need to change the whole words only, not the occurrences of the word in the other word. I think that I can achieve this by using regular expressions but don't know how to do it right.
If you use regex, you can specify word boundaries with \b:
import re
sentence = 'bean likes to sell his beans'
sentence = re.sub(r'\bbean\b', 'robert', sentence)
# 'robert likes to sell his beans'
Here 'beans' is not changed (to 'roberts') because the 's' on the end is not a boundary between words: \b matches the empty string, but only at the beginning or end of a word.
The second replacement for completeness:
sentence = re.sub(r'\bbeans\b', 'cars', sentence)
# 'robert likes to sell his cars'
If you replace each word one at a time, you might replace words several times (and not get what you want). To avoid this, you can use a function or lambda:
d = {'bean':'robert', 'beans':'cars'}
str_in = 'bean likes to sell his beans'
str_out = re.sub(r'\b(\w+)\b', lambda m:d.get(m.group(1), m.group(1)), str_in)
That way, once bean is replaced by robert, it won't be modified again (even if robert is also in your input list of words).
As suggested by georg, I edited this answer with dict.get(key, default_value).
Alternative solution (also suggested by georg):
str_out = re.sub(r'\b(%s)\b' % '|'.join(d.keys()), lambda m:d.get(m.group(1), m.group(1)), str_in)
This is a dirty way to do this. using folds
reduce(lambda x,y : re.sub('\\b('+y[0]+')\\b',y[1],x) ,[("bean","robert"),("beans","cars")],"bean likes to sell his beans")
"bean likes to sell his beans".replace("beans", "cars").replace("bean", "robert")
Will replace all instances of "beans" with "cars" and "bean" with "robert". This works because .replace() returns a modified instance of original string. As such, you can think of it in stages. It essentially works this way:
>>> first_string = "bean likes to sell his beans"
>>> second_string = first_string.replace("beans", "cars")
>>> third_string = second_string.replace("bean", "robert")
>>> print(first_string, second_string, third_string)
('bean likes to sell his beans', 'bean likes to sell his cars',
'robert likes to sell his cars')
Related
I need to filter out everything that is not a letter or digit using .isalpha() and .isdigit().
Currently am do definition and .replace().
This is my program
def remove_special_chars(string):
output = []
for c in string:
if c.isalpha() or c.isdigit();
output.append(c)
return ''.join(output)
You can easily achieved that in the following way:
from sets import Set
allowed_chars = Set('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
validationString = 'inv#lid'
if Set(validationString).issubset(allowed_chars):
return validationString # nothing to remove from this string
else:
return ''.join(i for i in validationString if Set(i).issubset(allowed_chars))
use regular expression split and filter and isdigit and isalpha
sentence ="1 I have bought several of the Vitality canned dog food products and have found them all to be of good quality. huh? The product looks more like a stew than a processed meat and it smells better-looks better. My Labrador is finicky-pampered and she appreciates this product better than most."
sentence=sentence.lower()
words=re.split("[, |-|\?|\!|\.]",sentence)
words=filter(lambda w: ((w.isdigit() or w.isalpha()) and len(w)>0),words)
print(*words)
output:
1 i have bought several of the vitality canned dog food products and have found them all to be of good quality huh the product looks more like a stew than a processed meat and it smells better my labrador is and she appreciates this product better than most
Hi I have a large document saved as a sentence and a list of proper names that might be in the document.
I would like to replace instances of the list with the tag [PERSON]
ex: sentence = "John and Marie went to school today....."
list = ["Maria", "John"....]
result = [PERSON] and [PERSON] went to school today
as you can see there might be variations of the name that I still want to catch like Maria and Marie as they are spelled differently but close.
I know I can use a loop but since the list and the sentence is large there might be a more efficient way to do this. Thanks
Use fuzzywuzzy to check if each word in the sentence matches closely (with a match percentage above 80%) with that of a name and if so replace it with [PERSON]
>>> from fuzzywuzzy import process, fuzz
>>> names = ["Maria", "John"]
>>> sentence = "John and Marie went to school today....."
>>>
>>> match = lambda word: process.extractOne(word, names, scorer=fuzz.ratio, score_cutoff=80)
>>> ' '.join('[PERSON]' if match(word) else word for word in sentence.split())
'[PERSON] and [PERSON] went to school today.....'
You can use regex inside your input list, to match words with spell variations. For example, if you need to match Marie and Maria, you can use Mari(e|a) as regex. Here is the consequent code you can use:
import re
mySentence = "John and Marie and Maria went to school today....."
myList = ["Mari(e|a)", "John"]
myNewSentence = re.compile("|".join(myList)).sub('[PERSON]', mySentence)
print(myNewSentence) # [PERSON] and [PERSON] and [PERSON] went to school today.....
I'm building a reddit bot for practice that converts US dollars into other commonly used currencies, and I've managed to get the conversion part working fine, but now I'm a bit stuck trying to pass the characters that directly follow a dollar sign to the converter.
This is sort of how I want it to work:
def run_bot():
subreddit = r.get_subreddit("randomsubreddit")
comments = subreddit.get_comments(limit=25)
for comment in comments:
comment_text = comment.body
#If comment contains a string that starts with '$'
# Pass the rest of the 'word' to a variable
So for example, if it were going over a comment like this:
"I bought a boat for $5000 and it's awesome"
It would assign '5000' to a variable that I would then put through my converter
What would be the best way to do this?
(Hopefully that's enough information to go off, but if people are confused I'll add more)
You could use re.findall function.
>>> import re
>>> re.findall(r'\$(\d+)', "I bought a boat for $5000 and it's awesome")
['5000']
>>> re.findall(r'\$(\d+(?:\.\d+)?)', "I bought two boats for $5000 $5000.45")
['5000', '5000.45']
OR
>>> s = "I bought a boat for $5000 and it's awesome"
>>> [i[1:] for i in s.split() if i.startswith('$')]
['5000']
If you dealing with prices as in float number, you can use this:
import re
s = "I bought a boat for $5000 and it's awesome"
matches = re.findall("\$(\d*\.\d+|\d+)", s)
print(matches) # ['5000']
s2 = "I bought a boat for $5000.52 and it's awesome"
matches = re.findall("\$(\d*\.\d+|\d+)", s2)
print(matches) # ['5000.52']
I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.
In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.
Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.