Why do Python regex spans extend one place past the actual match? - python

Looking at the spans returned from my regex matches, I noticed that they always return one past the actual match; e.g. in the example at Regular Expression HOWTO
>>> print(p.match('::: message'))
None
>>> m = p.search('::: message'); print(m)
<_sre.SRE_Match object at 0x...>
>>> m.group()
'message'
>>> m.span()
(4, 11)
The resulting span in the example is (4, 11) vs. the actual location (4, 10). This causes some trouble for me as the left-hand and right-hand boundaries have different meanings and I need to compare the relative positions of the spans.
Is there a good reason for this or can I go ahead and modify the spans to my liking by subtracting one from the right boundary?

Because in Python, slicing and ranges never the end value is always exclusive, and '::: message'[4:11] reflects the actual matched text:
>>> '::: message'[4:11]
'message'
Thus, you can use the MatchObject.span() results to slice the matched text from the original string:
>>> import re
>>> s = '::: message'
>>> match = p.search(s)
>>> match.span()
(4, 11)
>>> s[slice(*match.span())]
'message'

Related

How do I extract the number at the beginning of a string in Python 3.7?

I'm using Python 3.7. I'm having difficulty extractng a number from teh beginning of a string. The string is derived from an HTML element, like so
elt.text
'3 reviews'
However, when I try and get the number using logic here -- Extract Number from String in Python , I get the error below
int(filter(str.isdigit, elt.text))
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: int() argument must be a string, a bytes-like object or a number, not 'filter'
Is there a better way to get the number from the beginning of the string?
As the comments on that answer note, in Python 3, filter returns a filter generator object, so you must iterate over it and build a new string before you can call int:
>>> s = '3 reviews'
>>> filter(str.isdigit, s)
<filter object at 0x800ad5f98>
>>> int(''.join(filter(str.isdigit, s)))
3
However, as other answers in that same thread point out, this is not necessarily a good way to do the job at all:
>>> s = '3 reviews in 12 hours'
>>> int(''.join(filter(str.isdigit, s)))
312
It might be better to use a regular expression matcher to find the number at the front of the string. You can then decide whether to allow signs (+ and -) and leading white-space:
>>> import re
>>> m = re.match(r'\s*([-+])?\d+', s)
>>> m
<_sre.SRE_Match object; span=(0, 1), match='3'>
>>> m.group()
'3'
>>> int(m.group())
3
Now if your string contains a malformed number, m will be None, and if it contains a sign, the sign is allowed:
>>> m = re.match(r'\s*([-+])?\d+', 'not a number')
>>> print(m)
None
>>> m = re.match(r'\s*([-+])?\d+', ' -42')
>>> m
<_sre.SRE_Match object; span=(0, 5), match=' -42'>
>>> int(m.group())
-42
If you wish to inspect what came after the number, if anything, add more to the regular expression (including some parentheses for grouping) and use m.group(1) to get the matched number. Replace \d+ with \d* to allow an empty number-match, if that's meaningful (but then be mindful of matching a lone - or + sign, if you still allow signs).
You can amend the top answer in the link you send to this:
str1 = "3158 is a great number"
print(int("".join(filter(str.isdigit, str1))))
#3158
As to why the answer doesn't work now, I'm not sure.
The easiest way if the number is always at the beginning of the string, given it's a single digit:
number = int(elt.text[0])
Or for more than one digit:
number = int(elt.text.split()[0])
there's a more intuitive way to do it. I'll make an assumption and think that there's a posibility that in a given string more than one number will appear. So, you want to iterate the words of the input.
numbers = [int(s) for s in input_string.split(' ') if s.isdigit()]
The first element of the list is the first number found on the given string, it is available by taking it out of the list numbers[0].
If you are certain and there's not a chance that the first 'element' of the input string isn't anything else but a number, you can just split the string by spaces (or the separator you are using) and cast it to an integer or float.
int(input_string.split(' ')[0]) or float(input_string.split(' ')[0])
If you aren't certain, wrap it into a try and take the response either of the succesful try or the except.

Python regular expression with or and re.search

Say I have two types of strings:
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
For both of these, I want to match 'Foobar' (which could be anything). I have tried the following:
m = re.compile('((?<=Thing: ).+(?= Analysis))|((?<=\d ).+(?= Analysis))')
ind1 = m.search(str1).span()
match1 = str1[ind1[0]:ind1[1]]
ind2 = m.search(str2).span()
match2 = str2[ind2[0]:ind2[1]]
However, match1 comes out to 'A Thing: Foobar', which seems to be the match for the second pattern, not the first. Applied individually, (pattern 1 to str1 and pattern 2 to str2, without the |), both patterns match 'Foobar'. I expected this, then, to stop when matched by the first pattern. This doesn't seem to be the case. What am I missing?
According to the documentation,
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
But the behavior seems to be different:
import re
THING = r'(?<=Thing: )(?P<THING>.+)(?= Analysis)'
NUM = r'(?<=\d )(?P<NUM>.+)(?= Analysis)'
MIXED = THING + '|' + NUM
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
print(re.match(THING, str1))
# <... match='Foobar'>
print(re.match(NUM, str1))
# <... match='A Thing: Foobar'>
print(re.match(MIXED, str1))
# <... match='A Thing: Foobar'>
We would expect that because THING matches 'Foobar', the MIXED pattern would get that 'Foobar' and quit searching. (as per the documentation)
Because it is not working as documented, the solution has to rely on Python's or short-circuiting:
print(re.search(THING, str1) or re.search(NUM, str1))
# <_sre.SRE_Match object; span=(17, 23), match='Foobar'>
print(re.search(THING, str2) or re.search(NUM, str2))
# <_sre.SRE_Match object; span=(8, 14), match='Foobar'>
If you use named groups, eg (?P<name>...) you'll be able to debug easier. But note the docs for span.
https://docs.python.org/2/library/re.html#re.MatchObject.span
span([group]) For MatchObject m, return the 2-tuple (m.start(group),
m.end(group)). Note that if group did not contribute to the match,
this is (-1, -1). group defaults to zero, the entire match.
You're not passing in the group number.
Why are you using span anyway? Just use m.search(str1).groups() or similar

Python: Getting text of a Regex match

I have a regex match object in Python. I want to get the text it matched. Say if the pattern is '1.3', and the search string is 'abc123xyz', I want to get '123'. How can I do that?
I know I can use match.string[match.start():match.end()], but I find that to be quite cumbersome (and in some cases wasteful) for such a basic query.
Is there a simpler way?
You can simply use the match object's group function, like:
match = re.search(r"1.3", "abc123xyz")
if match:
doSomethingWith(match.group(0))
to get the entire match. EDIT: as thg435 points out, you can also omit the 0 and just call match.group().
Addtional note: if your pattern contains parentheses, you can even get these submatches, by passing 1, 2 and so on to group().
You need to put the regex inside "()" to be able to get that part
>>> var = 'abc123xyz'
>>> exp = re.compile(".*(1.3).*")
>>> exp.match(var)
<_sre.SRE_Match object at 0x691738>
>>> exp.match(var).groups()
('123',)
>>> exp.match(var).group(0)
'abc123xyz'
>>> exp.match(var).group(1)
'123'
or else it will not return anything:
>>> var = 'abc123xyz'
>>> exp = re.compile("1.3")
>>> print exp.match(var)
None

Python regular expression; why do the search & match appear to find alpha chars in a number string?

I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)

Python Regex - How to Get Positions and Values of Matches

How can I get the start and end positions of all matches using the re module? For example given the pattern r'[a-z]' and the string 'a1b2c3d4' I'd want to get the positions where it finds each letter. Ideally, I'd like to get the text of the match back too.
import re
p = re.compile("[a-z]")
for m in p.finditer('a1b2c3d4'):
print(m.start(), m.group())
Taken from
Regular Expression HOWTO
span() returns both start and end indexes in a single tuple. Since the
match method only checks if the RE matches at the start of a string,
start() will always be zero. However, the search method of RegexObject
instances scans through the string, so the match may not start at zero
in that case.
>>> p = re.compile('[a-z]+')
>>> print p.match('::: message')
None
>>> m = p.search('::: message') ; print m
<re.MatchObject instance at 80c9650>
>>> m.group()
'message'
>>> m.span()
(4, 11)
Combine that with:
In Python 2.2, the finditer() method is also available, returning a sequence of MatchObject instances as an iterator.
>>> p = re.compile( ... )
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
<callable-iterator object at 0x401833ac>
>>> for match in iterator:
... print match.span()
...
(0, 2)
(22, 24)
(29, 31)
you should be able to do something on the order of
for match in re.finditer(r'[a-z]', 'a1b2c3d4'):
print match.span()
For Python 3.x
from re import finditer
for match in finditer("pattern", "string"):
print(match.span(), match.group())
You shall get \n separated tuples (comprising first and last indices of the match, respectively) and the match itself, for each hit in the string.
note that the span & group are indexed for multi capture groups in a regex
regex_with_3_groups=r"([a-z])([0-9]+)([A-Z])"
for match in re.finditer(regex_with_3_groups, string):
for idx in range(0, 4):
print(match.span(idx), match.group(idx))

Categories