python regex selecting whole words - python

I am writing a script that introduces misspellings into sentence. I am using python re module to replace the original word with the misspelling. The script looks like this:
# replacing original word by error
pattern = re.compile(r'%s' % original_word)
replace_by = r'\1' + err
modified_sentence = re.sub(pattern, replace_by, sentence, count=1)
But the problem is this will replace even if original_word was part of another word for example:
If i had
original_word = 'in'
err = 'il'
sentence = 'eating food in'
it would replace the occurrence of 'in' in eating like:
> 'eatilg food in'
I was checking in the re documentation but it doesn't give any example on how to include regex options, for example:
If my pattern is:
regex_pattern = '\b%s\b' % original_word
this would solve the problem as \b represents 'word boundary'. But it doesn't seem to work.
I tried to find to find a work around it by doing:
pattern = re.compile(r'([^\w])%s' % original_word)
but that does not work. For example :
original_word = 'to'
err = 'vo'
sentence = 'I will go tomorrow to the'
it replaces it to:
> I will go vomorrow to the
Thank you, any help appreciated

See here for an example of word boundaries in python re module. It looks like you were close just need to put it all together. The following script gives you the output you want...
import re
original_word = 'to'
err = 'vo'
sentence = 'I will go tomorrow to the'
pattern = re.compile(r'\b%s\b' % re.escape(original_word))
modified_sentence = re.sub(pattern, err, sentence, count=1)
print modified_sentence
Output --> I will go tomorrow vo the

Related

Find and remove slightly different substring on string

I want to find out if a substring is contained in the string and remove it from it without touching the rest of the string. The thing is that the substring pattern that I have to perform the search on is not exactly what will be contained in the string. In particular the problem is due to spanish accent vocals and, at the same time, uppercase substring, so for example:
myString = 'I'm júst a tésting stríng'
substring = 'TESTING'
Perform something to obtain:
resultingString = 'I'm júst a stríng'
Right now I've read that difflib library can compare two strings and weight it similarity somehow, but I'm not sure how to implement this for my case (without mentioning that I failed to install this lib).
Thanks!
This normalize() method might be a little overkill and maybe using the code from #Harpe at https://stackoverflow.com/a/71591988/218663 works fine.
Here I am going to break the original string into "words" and then join all the non-matching words back into a string:
import unicodedata
def normalize(text):
return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()
myString = "I'm júst a tésting stríng"
substring = "TESTING"
newString = " ".join(word for word in myString.split(" ") if normalize(word) != normalize(substring))
print(newString)
giving you:
I'm júst a stríng
If your "substring" could be multi-word I might think about switching strategies to a regex:
import re
import unicodedata
def normalize(text):
return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()
myString = "I'm júst á tésting stríng"
substring = "A TESTING"
match = re.search(f"\\s{ normalize(substring) }\\s", normalize(myString))
if match:
found_at = match.span()
first_part = myString[:found_at[0]]
second_part = myString[found_at[1]:]
print(f"{first_part} {second_part}".strip())
I think that will give you:
I'm júst stríng
You can use the package unicodedata to normalize accented letters to ascii code letters like so:
import unicodedata
output = unicodedata.normalize('NFD', "I'm júst a tésting stríng").encode('ascii', 'ignore')
print(str(output))
which will give
b"I'm just a testing string"
You can then compare this with your input
"TESTING".lower() in str(output).lower()
which should return True.

Python Regex Replaces All Matches

I have a string such as "Hey people #Greetings how are we? #Awesome" and every time there is a hashtag I need to replace the word with another string.
I have the following code which works when only one hashtag but the problem is that because it uses the sub to replace all instances, it overwrites the every string with the last string.
match = re.findall(tagRE, content)
print(match)
for matches in match:
print(matches)
newCode = "The result is: " + matches + " is it correct?"
match = re.sub(tagRE, newCode, content)
What should I be doing instead to replace just the current match? Is there a way of using re.finditer to replace the current match or another way?
Peter's method would work. You could also just supply the match object as the regex string so that it only replaces that specific match. Like so:
newCode = "whatever" + matches + "whatever"
content = re.sub(matches, newCode, content)
I ran some sample code and this was the output.
import re
content = "This is a #wonderful experiment. It's #awesome!"
matches = re.findall('#\w+', content)
print(matches)
for match in matches:
newCode = match[1:]
print(content)
content = re.sub(match, newCode, content)
print(content)
#['#wonderful', '#awesome']
#This is a #wonderful experiment. It's #awesome!
#This is a wonderful experiment. It's #awesome!
#This is a wonderful experiment. It's #awesome!
#This is a wonderful experiment. It's awesome!
You can try like this:
In [1]: import re
In [2]: s = "Hey people #Greetings how are we? #Awesome"
In [3]: re.sub(r'(?:^|\s)(\#\w+)', ' replace_with_new_string', s)
Out[3]: 'Hey people replace_with_new_string how are we? replace_with_new_string'

Python - remove parts of a string

I have many fill-in-the-blank sentences in strings,
e.g. "6d) We took no [pains] to hide it ."
How can I efficiently parse this string (in Python) to be
"We took no to hide it"?
I also would like to be able to store the word in brackets (e.g. "pains") in a list for use later. I think the regex module could be better than Python string operations like split().
This will give you all the words inside the brackets.
import re
s="6d) We took no [pains] to hide it ."
matches = re.findall('\[(.*?)\]', s)
Then you can run this to remove all bracketed words.
re.sub('\[(.*?)\]', '', s)
just for fun (to do the gather and substitution in one iteration)
matches = []
def subber(m):
matches.append(m.groups()[0])
return ""
new_text = re.sub("\[(.*?)\]",subber,s)
print new_text
print matches
import re
s = 'this is [test] string'
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
Output
'test'
For your example you could use this regex:
(.*\))(.+)\[(.+)\](.+)
You will get four groups that you can use to create your resulting string and save the 3. group for later use:
6d)
We took no
pains
to hide it .
I used .+ here because I don't know if your strings always look like your example. You can change the .+ to alphanumeric or sth. more special to your case.
import re
s = '6d) We took no [pains] to hide it .'
m = re.search(r"(.*\))(.+)\[(.+)\](.+)", s)
print(m.group(2) + m.group(4)) # "We took no to hide it ."
print(m.group(3)) # pains
import re
m = re.search(".*\) (.*)\[.*\] (.*)","6d) We took no [pains] to hide it .")
if m:
g = m.groups()
print g[0] + g[1]
Output :
We took no to hide it .

Split group of special characters from string

In test.txt:
quiet confidence^_^
want:P
(:let's start
Codes:
import re
file = open('test.txt').read()
for line in file.split('\n'):
line = re.findall(r"[^\w\s$]+|[a-zA-z]+|[^\w\s$]+", line)
print " ".join(line)
Results showed:
quiet confidence^_^
want : P
(: let ' s start
I tried to separate group of special characters from string but still incorrect.
Any suggestion?
Expected results:
quiet confidence ^_^
want :P
(: let's start
as #interjay said, you must define what you consider a word and what is "special characters". Still I would use 2 separate regexes to find what a word is and what is not.
word = re.compile("[a-zA-Z\']+")
not_word = re.compile("[^a-zA-Z\']+")
for line in file.split('\n'):
matched_words = re.findall(word, line)
non_matching_words = re.findall(not_word, line)
print " ".join(matched_words)
print " ".join(non_matching_words)
Have in mind that spaces \s+ will be grouped as non words.

python regex question

What is the best way to search for matching words inside a string?
Right now I do something like the following:
if re.search('([h][e][l][l][o])',file_name_tmp, re.IGNORECASE):
Which works but its slow as I have probably around 100 different regex statements searching for full words so I'd like to combine several using a | separator or something.
>>> words = ('hello', 'good\-bye', 'red', 'blue')
>>> pattern = re.compile('(' + '|'.join(words) + ')', re.IGNORECASE)
>>> sentence = 'SAY HeLLo TO reD, good-bye to Blue.'
>>> print pattern.findall(sentence)
['HeLLo', 'reD', 'good-bye', 'Blue']
Can you try:
if 'hello' in longtext:
or
if 'HELLO' in longtext.upper():
to match hello/Hello/HELLO.
If you are trying to check 'hello' or a complete word in a string, you could also do
if 'hello' in stringToMatch:
... # Match found , do something
To find various strings, you could also use find all
>>>toMatch = 'e3e3e3eeehellloqweweemeeeeefe'
>>>regex = re.compile("hello|me",re.IGNORECASE)
>>>print regex.findall(toMatch)
>>>[u'me']
>>>toMatch = 'e3e3e3eeehelloqweweemeeeeefe'
>>>print regex.findall(toMatch)
>>>[u'hello', u'me']
>>>toMtach = 'e3e3e3eeeHelLoqweweemeeeeefe'
>>>print regex.findall(toMatch)
>>>[u'HelLo', u'me']
You say you want to search for WORDS. What is your definition of a "word"? If you are looking for "meet", do you really want to match the "meet" in "meeting"? If not, you might like to try something like this:
>>> import re
>>> query = ("meet", "lot")
>>> text = "I'll meet a lot of friends including Charlotte at the town meeting"
>>> regex = r"\b(" + "|".join(query) + r")\b"
>>> re.findall(regex, text, re.IGNORECASE)
['meet', 'lot']
>>>
The \b at each end forces it to match only at word boundaries, using re's definition of "word" -- "isn't" isn't a word, it's two words separated by an apostrophe. If you don't like that, look at the nltk package.

Categories