How do I replace a set of characters in a string unless they're part of a word? For example, if I have the text "ur the wurst person ur", I want to replace "ur" with "youre". So the final text would be "youre the wurst person youre". I don't want the "ur" inside of wurst to be changed because it's inside of a word. Is there a generic regex way to do this in python? I don't want to have to worry if "ur" has a space before or after, etc., only if it's part of another word. Thanks!
What I've tried so far is a simple
result = re.sub("ur", "youare", text)
but this also replaces the "ur" inside of "wurst". If I use the word boundaries as in
result = re.sub(r"\bur\b", "youare", text)
it will miss the last occurrence of "ur" in the string.
Without using regular expressions...
You could split the string at each space with string.split() and then, in a list comprehension, replace words 'ur' with 'youre'. This may look something like:
s = "ur the wurst person ur"
result = " ".join(['youre' if w == 'ur' else w for w in s.split()])
Hope this helps!
result = re.sub(r'\bur\b', r'youare', "ur the wurst person ur")
from the python documentation:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
Related
I am learning regular expressions and have below question.
I referred the page and got below information
\b Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, so the
precise set of characters deemed to be alphanumeric depends on the
values of the UNICODE and LOCALE flags. For example, r'\bfoo\b'
matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or
'foo3'. Inside a character range, \b represents the backspace
character, for compatibility with Python’s string literals.
Code:
import re
abc="A \ncat and a rat"+ "\ncan't be friends."
print (abc)
if re.search(r'\bcat\b',abc):
print ("Found")
else:
print ("not found")
I would like to find all cases where
there has to be either number or white space before and after my string.
So '1cat4', 'cat', '1cat ', ' cat ', '(cat)' should return positive when I search for 'cat'.
How should I update my code?
Looks like you want to find any cat surrounded by non-alphabethic characters or at the beginning or end of the text:
abc="cat. A \ncat and a rat\ncan't be friends, how about 1cat23 and concatenate?"
re.findall(r'(?:[^a-zA-Z]|^)(cat)(?:[^a-zA-Z]|$)',abc)
#['cat', 'cat', 'cat']
Here are the contexts of the found cats:
re.findall(r'(?:[^a-zA-Z]|^)cat(?:[^a-zA-Z]|$)',abc)
#['cat.', '\ncat ', '1cat2']
Unfortunately, this regex does not recognize herds of cats ("catcat", "cat cat", and the like). If this is an issue, you can add more clauses to the regex.
if I have the following string 'some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888' and I want to find 15 digit numbers (so only 151283917503423) how do I make it so that it doesn't match the bigger number and also deal with the possibility that the string can just be '151283917503423' therefore I cannot identify it by it possibly containing spaces on both sides?
serial = re.compile('[0-9]{15}')
serial.findall('some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888')
this returns both 66666666666666666667867866 and 151283917503423 but I only want the latter
Use word boundaries:
serial = re.compile(r'\b[0-9]{15}\b')
\b Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, so the
precise set of characters deemed to be alphanumeric depends on the
values of the UNICODE and LOCALE flags. For example, r'\bfoo\b'
matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or
'foo3'. Inside a character range, \b represents the backspace
character, for compatibility with Python’s string literals.
You need to use word boundaries to ensure you don't match unwanted text on either side of your match:
>>> serial = re.compile(r'\b\d{15}\b')
>>> serial.findall('some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888')
['151283917503423']
Include word boundaries. Let s be your string. You can use
>>> re.findall(r'\b\d{15}\b' ,s)
['151283917503423']
where \b asserts a word boundary (^\w|\w$|\W\w|\w\W)
Since word boundaries \b contain 2 assertions each, I would use a single assertion
instead.
(?<![0-9])[0-9]{15}(?![0-9])
should be quicker?
I have a regex that matches all three characters words in a string:
\b[^\s]{3}\b
When I use it with the string:
And the tiger attacked you.
this is the result:
regex = re.compile("\b[^\s]{3}\b")
regex.findall(string)
[u'And', u'the', u'you']
As you can see it matches you as a word of three characters, but I want the expression to take "you." with the "." as a 4 chars word.
I have the same problem with ",", ";", ":", etc.
I'm pretty new with regex but I guess it happens because those characters are treated like word boundaries.
Is there a way of doing this?
Thanks in advance,
EDIT
Thaks to the answers of #BrenBarn and #Kendall Frey I managed to get to the regex I was looking for:
(?<!\w)[^\s]{3}(?=$|\s)
If you want to make sure the word is preceded and followed by a space (and not a period like is happening in your case), then use lookaround.
(?<=\s)\w{3}(?=\s)
If you need it to match punctuation as part of words (such as 'in.') then \w won't be adequate, and you can use \S (anything but a space)
(?<=\s)\S{3}(?=\s)
As described in the documentation:
A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.
So if you want a period to count as a word character and not a word boundary, you can't use \b to indicate a word boundary. You'll have to use your own character class. For instance, you can use a regex like \s[^\s]{3}\s if you want to match 3 non-space characters surrounded by spaces. If you still want the boundary to be zero-width (i.e., restrict the match but not be included in it), you could use lookaround, something like (?<=\s)[^\s]{3}(?=\s).
This would be my approach. Also matches words that come right after punctuations.
import re
r = r'''
\b # word boundary
( # capturing parentheses
[^\s]{3} # anything but whitespace 3 times
\b # word boundary
(?=[^\.,;:]|$) # dont allow . or , or ; or : after word boundary but allow end of string
| # OR
[^\s]{2} # anything but whitespace 2 times
[\.,;:] # a . or , or ; or :
)
'''
s = 'And the tiger attacked you. on,bla tw; th: fo.tes'
print re.findall(r, s, re.X)
output:
['And', 'the', 'on,', 'bla', 'tw;', 'th:', 'fo.', 'tes']
i want it match only the end of every word
example:
"i am test-ing., i am test.ing-, i am_, test_ing,"
output should be:
"i am test-ing i am test.ing i am test_ing"
>>> import re
>>> test = "i am test-ing., i am test.ing-, i am_, test_ing,"
>>> re.sub(r'([^\w\s]|_)+(?=\s|$)', '', test)
'i am test-ing i am test.ing i am test_ing'
Matches one or more non-alphanumeric characters ([^\w\s]|_) followed by either a space (\s) or the end of the string ($). The (?= ) construct is a lookahead assertion: it makes sure that a matching space is not included in the match, so it doesn't get replaced; only the [\W_]+ gets replaced.
Okay, but why [^\w\s]|_, you ask? The first part matches anything that's not alphanumeric or an underscore ([^\w]) or whitespace ([^\s]), i.e. punctuation characters. Except we do want to eliminate underscores, so we then include those with |_.
Example;
X=This
Y=That
not matching;
ThisWordShouldNotMatchThat
ThisWordShouldNotMatch
WordShouldNotMatch
matching;
AWordShouldMatchThat
I tried (?<!...) but seems not to be easy :)
^(?!This).*That$
As a free-spacing regex:
^ # Start of string
(?!This) # Assert that "This" can't be matched here
.* # Match the rest of the string
That # making sure we match "That"
$ # right at the end of the string
This will match a single word that fulfills your criteria, but only if this word is the only input to the regex. If you need to find words inside a string of many other words, then use
\b(?!This)\w*That\b
\b is the word boundary anchor, so it matches at the start and at the end of a word. \w means "alphanumeric character. If you also want to allow non-alphanumerics as part of your "word", then use \S instead - this will match anything that's not a space.
In Python, you could do words = re.findall(r"\b(?!This)\w*That\b", text).