What is word boundary while using regex in python - python

What is a word boundary in a Python regex? Can someone please explain this on these examples:
Example 1
>>> x = '456one two three123'
>>> y=re.search(r"\btwo\b",x)
>>> y
<_sre.SRE_Match object at 0x2aaaaab47d30>
Example 2
>>> y=re.search(r"two",x)
>>> y
<_sre.SRE_Match object at 0x2aaaaab47d30>
Example 3
>>> ip="192.168.254.1234"
>>> if re.search(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",ip):
... print ip
...
Example 4
>>> ip="192.168.254.1234"
>>> if re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",ip):
... print ip
192.168.254.1234

"word boundary" means exactly what it says: the boundary of a word, i.e. either the beginning or the end.
It does not match any actual character in the input, but it will only match if the current match position is at the beginning or end of the word.
This is important because, unlike if you just matched whitespace, it will also match at the beginning or end of the entire input.
So '\bfoo' will match 'foobar' and 'foo bar' and 'bar foo', but not 'barfoo'.
'foo\b' will match 'foo bar' and 'bar foo' and 'barfoo', but not 'foobar'.

Try this:
ip="192.168.254.1234"
res = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",ip)
print(res)
Notice how I correctly escaped the dots.
The ip is found because the regex doesn't care what comes after the last 1-3 digits.
Now:
ip="192.168.254.1234"
res = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",ip)
print(res)
This will not work, since the last 1-3 digits are NOT ENDING AT A BOUNDARY.

Related

Putting words in parenthesis with regex python

How can I put in brackets / parenthesis some words following another word in python?
For 2 words it looks like:
>>> p=re.compile(r"foo\s(\w+)\s(\w+)")
>>> p.sub( r"[\1] [\2]", "foo bar baz")
'[bar] [baz]'
I want for undefined number of words. I came up with this, but it doesn't seem to work.
>>> p=re.compile(r"foo(\s(\w+))*")
>>> p.sub( r"[\2] [\2] [\2]", "foo bar baz bax")
'[bax] [bax] [bax]'
The desired result in this case would be
'[bar] [baz] [bax]'
You may use a solution like
import re
p = re.compile(r"(foo\s+)([\w\s]+)")
r = re.compile(r"\w+")
s = "foo bar baz"
print( p.sub( lambda x: "{}{}".format(x.group(1), r.sub(r"[\g<0>]", x.group(2))), s) )
See the Python demo
The first (foo\s+)([\w\s]+) pattern matches and captures foo followed with 1+ whitespaces into Group 1 and then captures 1+ word and whitespace chars into Group 2.
Then, inside the re.sub, the replacement argument is a lambda expression where all 1+ word chunks are wrapped with square brackets using the second simple \w+ regex (that is done to ensure the same amount of whitespaces between the words, else, it can be done without a regex).
Note that [\g<0>] replacement pattern inserts [, the whole match value (\g<0>) and then ].
I suggest you the following simple solution:
import re
s = "foo bar baz bu bi porte"
p = re.compile(r"foo\s([\w\s]+)")
p = p.match(s)
# Here: p.group(1) is "bar baz bu bi porte"
# p.group(1).split is ['bar', 'baz' ,'bu' ,'bi', 'porte']
print(' '.join([f'[{i}]' for i in p.group(1).split()])) # for Python 3.6+ (due to f-strings)
# [bar] [baz] [bu] [bi] [porte]
print(' '.join(['[' + i + ']' for i in p.group(1).split()])) # for other Python versions
# [bar] [baz] [bu] [bi] [porte]

Python Regex matching already matched sub-string

I'm fairly new to Python Regex and I'm not able to understand the following:
I'm trying to find one small letter surrounded by three capital letters.
My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
['AbAD']
I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.
The second issue is the following regex:
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
[]
Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?
It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.
(?=([A-Z][a-z][A-Z][A-Z]))
Code:
>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']
DEMO
For the 2nd one, you should use negative lookahead and lookbehind assertion like below,
(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))
Code:
>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']
DEMO
The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.
The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.
pat = '(?=([A-Z][a-z][A-Z][A-Z]))'
For your second regex again do the same.
print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)
.For more insights see
1)After first match the string left is aDD as the first part has matched.
2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.
1st issue,
You should use this pattern,
r'([A-Z]{1}[a-z]{1}[A-Z]{1})'
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']
2nd issue
You should use,
(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']

Extract first 3 numbers from string

How can I extract first 3 numbers of a string:
in:
"Box 123 (NO) 456"
out:
123
Just search for \d{3} and grab the first match:
match = re.search(r'\d{3}', inputstring)
if match:
print match.group(0)
Demo:
>>> import re
>>> inputstring = "Box 123 (NO) 456"
>>> match = re.search(r'\d{3}', inputstring)
>>> if match:
... print match.group(0)
...
123
Note that the above also matches a substring; if you have a number that is four digits long it'll match the first 3 digits of that number.
Your post is very sparse on details; let's presume that the above is not enough but that your numbers are delimited by whitespace, then you can match exactly 3 digits by using \b anchors:
match = re.search(r'\b\d{3}\b', inputstring)
which match only 3 digits between non-word characters (the start or end of the string, whitespace, punctuation, etc. Anything not a letter or a number or an underscore):
>>> re.search(r'\b\d{3}\b', inputstring)
<_sre.SRE_Match object at 0x106c4f100>
>>> re.search(r'\b\d{3}\b', "Box 1234")
>>> re.search(r'\b\d{3}\b', "Box 123")
<_sre.SRE_Match object at 0x106c4f1d0>

How do I separate words using regex in python while considering words with apostrophes?

I tried separate m's in a python regex by using word boundaries and find them all. These m's should either have a whitespace on both sides or begin/end the string:
r = re.compile("\\bm\\b")
re.findall(r, someString)
However, this method also finds m's within words like I'm since apostrophes are considered to be word boundaries. How do I write a regex that doesn't consider apostrophes as word boundaries?
I've tried this:
r = re.compile("(\\sm\\s) | (^m) | (m$)")
re.findall(r, someString)
but that just doesn't match any m. Odd.
Using lookaround assertion:
>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I'm a boy")
[]
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I m a boy")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "pm")
['m']
(?=...)
Matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example, Isaac
(?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match
for ... that ends at the current position. This is called a positive
lookbehind assertion. (?<=abc)def will find a match in abcdef, ...
from Regular expression syntax
BTW, using raw string (r'this is raw string'), you don't need to escape \.
>>> r'\s' == '\\s'
True
You don't even need look-around (unless you want to capture the m without the spaces), but your second example was inches away. It was the extra spaces (ok in python, but not within a regex) which made them not work:
>>> re.findall(r'\sm\s|^m|m$', "I m a boy")
[' m ']
>>> re.findall(r'\sm\s|^m|m$', "mamam")
['m', 'm']
>>> re.findall(r'\sm\s|^m|m$', "mama")
['m']
>>> re.findall(r'\sm\s|^m|m$', "I'm a boy")
[]
>>> re.findall(r'\sm\s|^m|m$', "I'm a boym")
['m']
falsetru's answer is almost the equivalent of "\b except apostrophes", but not quite. It will still find matches where a boundary is missing. Using one of falsetru's examples:
>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']
It finds 'm', but there is no occurrence of 'm' in 'mama' that would match '\bm\b'. The first 'm' matches '\bm', but that's as close as it gets.
The regex that implements "\b without apostrophes" is shown below:
(?<=\s)m(?=\s)|^m(?=\s)|(?<=\s)m$|^m$
This will find any of the following 4 cases:
'm' with white space before and after
'm' at beginning followed by white space
'm' at end preceded by white space
'm' with nothing preceding or following it (i.e. just literally the string "m")

Python regular expression to match either a quoted or unquoted string

I am trying to write a regular expression in Python that will match either a quoted string with spaces or an unquoted string without spaces. For example given the string term:foo the result would be foo and given the string term:"foo bar" the result would be foo bar. So far I've come up with the following regular expression:
r = re.compile(r'''term:([^ "]+)|term:"([^"]+)"''')
The problem is that the match can come in either group(1) or group(2) so I have to do something like this:
m = r.match(search_string)
term = m.group(1) or m.group(2)
Is there a way I can do this all in one step?
Avoid grouping, and instead use lookahead/lookbehind assertions to eliminate the parts that are not needed:
s = 'term:foo term:"foo bar" term:bar foo term:"foo term:'
re.findall(r'(?<=term:)[^" ]+|(?<=term:")[^"]+(?=")', s)
Gives:
['foo', 'foo bar', 'bar']
It doesn't seem that you really want re.match here. Your regex is almost right, but you're grouping too much. How about this?
>>> s
('xyz term:abc 123 foo', 'foo term:"abc 123 "foo')
>>> re.findall(r'term:([^ "]+|"[^"]+")', '\n'.join(s))
['abc', '"abc 123 "']

Categories