Python regular expression with or and re.search - python

Say I have two types of strings:
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
For both of these, I want to match 'Foobar' (which could be anything). I have tried the following:
m = re.compile('((?<=Thing: ).+(?= Analysis))|((?<=\d ).+(?= Analysis))')
ind1 = m.search(str1).span()
match1 = str1[ind1[0]:ind1[1]]
ind2 = m.search(str2).span()
match2 = str2[ind2[0]:ind2[1]]
However, match1 comes out to 'A Thing: Foobar', which seems to be the match for the second pattern, not the first. Applied individually, (pattern 1 to str1 and pattern 2 to str2, without the |), both patterns match 'Foobar'. I expected this, then, to stop when matched by the first pattern. This doesn't seem to be the case. What am I missing?

According to the documentation,
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
But the behavior seems to be different:
import re
THING = r'(?<=Thing: )(?P<THING>.+)(?= Analysis)'
NUM = r'(?<=\d )(?P<NUM>.+)(?= Analysis)'
MIXED = THING + '|' + NUM
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
print(re.match(THING, str1))
# <... match='Foobar'>
print(re.match(NUM, str1))
# <... match='A Thing: Foobar'>
print(re.match(MIXED, str1))
# <... match='A Thing: Foobar'>
We would expect that because THING matches 'Foobar', the MIXED pattern would get that 'Foobar' and quit searching. (as per the documentation)
Because it is not working as documented, the solution has to rely on Python's or short-circuiting:
print(re.search(THING, str1) or re.search(NUM, str1))
# <_sre.SRE_Match object; span=(17, 23), match='Foobar'>
print(re.search(THING, str2) or re.search(NUM, str2))
# <_sre.SRE_Match object; span=(8, 14), match='Foobar'>

If you use named groups, eg (?P<name>...) you'll be able to debug easier. But note the docs for span.
https://docs.python.org/2/library/re.html#re.MatchObject.span
span([group]) For MatchObject m, return the 2-tuple (m.start(group),
m.end(group)). Note that if group did not contribute to the match,
this is (-1, -1). group defaults to zero, the entire match.
You're not passing in the group number.
Why are you using span anyway? Just use m.search(str1).groups() or similar

Related

Python regex pattern max length in re.compile?

I try to compile a big pattern with re.compile in Python 3.
The pattern I try to compile is composed of 500 small words (I want to remove them from a text). The problem is that it stops the pattern after about 18 words
Python doesn't raise any error.
What I do is:
stoplist = map(lambda s: "\\b" + s + "\\b", stoplist)
stopstring = '|'.join(stoplist)
stopword_pattern = re.compile(stopstring)
The stopstring is ok (all the words are in) but the pattern is much shorter. It even stops in the middle of a word!
Is there a max length for the regex pattern?
Consider this example:
import re
stop_list = map(lambda s: "\\b" + str(s) + "\\b", range(1000, 2000))
stopstring = "|".join(stop_list)
stopword_pattern = re.compile(stopstring)
If you try to print the pattern, you'll see something like
>>> print(stopword_pattern)
re.compile('\\b1000\\b|\\b1001\\b|\\b1002\\b|\\b1003\\b|\\b1004\\b|\\b1005\\b|\\b1006\\b|\\b1007\\b|\\b1008\\b|\\b1009\\b|\\b1010\\b|\\b1011\\b|\\b1012\\b|\\b1013\\b|\\b1014\\b|\\b1015\\b|\\b1016\\b|\\b1017\\b|\)
which seems to indicate that the pattern is incomplete. However, this just seems to be a limitation of the __repr__ and/or __str__ methods for re.compile objects. If you try to perform a match against the "missing" part of the pattern, you'll see that it still succeeds:
>>> stopword_pattern.match("1999")
<_sre.SRE_Match object; span=(0,4), match='1999')

Regex match if not before and after

How can I match 'suck' only if not part of 'honeysuckle'?
Using lookbehind and lookahead I can match suck if not 'honeysuck' or 'suckle', but it also fails to catch something like 'honeysucker'; here the expression should match, because it doesn't end in le:
re.search(r'(?<!honey)suck(?!le)', 'honeysucker')
You need to nest the lookaround assertions:
>>> import re
>>> regex = re.compile(r"(?<!honey(?=suckle))suck")
>>> regex.search("honeysuckle")
>>> regex.search("honeysucker")
<_sre.SRE_Match object at 0x00000000029B6370>
>>> regex.search("suckle")
<_sre.SRE_Match object at 0x00000000029B63D8>
>>> regex.search("suck")
<_sre.SRE_Match object at 0x00000000029B6370>
An equivalent solution would be suck(?!(?<=honeysuck)le).
here is a solution without using regular expressions:
s = s.replace('honeysuckle','')
and now:
re.search('suck',s)
and this would work for any of these strings : honeysuckle sucks, this sucks and even regular expressions suck.
I believe you should separate your exceptions in a different Array, just in case in the future you wish to add a different rule. This will be easier to read, and will be faster in the future to change if needed.
My suggestion in Ruby is:
words = ['honeysuck', 'suckle', 'HONEYSUCKER', 'honeysuckle']
EXCEPTIONS = ['honeysuckle']
def match_suck word
if (word =~ /suck/i) != nil
# should not match any of the exceptions
return true unless EXCEPTIONS.include? word.downcase
end
false
end
words.each{ |w|
puts "Testing match of '#{w}' : #{match_suck(w)}"
}
>>>string = 'honeysucker'
>>>print 'suck' in string
True

python re, find expression containing an optional group

I have a regular expression that can have either from:
(src://path/to/foldernames canhave spaces/file.xzy)
(src://path/to/foldernames canhave spaces/file.xzy "optional string")
These expressions occur within a much longer string (they are not individual strings). I am having trouble matching both expressions when using re.search or re.findall (as there may be multiple expression in the string).
It's straightforward enough to match either individually but how can I go about matching either case so that two groups are returned, the first with src://path/... and the second with the optional string if it exists or None if not?
I am thinking that I need to somehow specify OR groups---for instance, consider:
The pattern \((.*)( ".*")\) matches the second instance but not the first because it does not contain "...".
r = re.search(r'\((.*)( ".*")\)', '(src://path/to/foldernames canhave spaces/file.xzy)'
r.groups() # Nothing found
AttributeError: 'NoneType' object has no attribute 'groups'
While \((.*)( ".*")?\) matches the first group but does not individually identify the "optional string" as a group in the second instance.
r = re.search(r'\((.*)( ".*")?\)', '(src://path/to/foldernames canhave spaces/file.xzy "optional string")')
r.groups()
('src://path/to/foldernames canhave spaces/file.xzy "optional string"', None)
Any thoughts, ye' masters of expressions (of the regular variety)?
The simplest way is to make the first * non-greedy:
>>> import re
>>> string = "(src://path/to/foldernames canhave spaces/file.xzy)"
>>> string2 = \
... '(src://path/to/foldernames canhave spaces/file.xzy "optional string")'
>>> re.findall(r'\((.*?)( ".*")?\)', string2)
[('src://path/to/foldernames canhave spaces/file.xzy', ' "optional string"')]
>>> re.findall(r'\((.*?)( ".*")?\)', string)
[('src://path/to/foldernames canhave spaces/file.xzy', '')]
Since " aren't usually allowed to appear in file names, you can simply exclude them from the first group:
r = re.search(r'\(([^"]*)( ".*")?\)', input)
This is generally the preferred alternative to ungreedy repetition, because tends to be a lot more efficient. If your file names can actually contain quotes for some reason, then ungreedy repetition (as in agf's answer) is your best bet.

Python: Getting text of a Regex match

I have a regex match object in Python. I want to get the text it matched. Say if the pattern is '1.3', and the search string is 'abc123xyz', I want to get '123'. How can I do that?
I know I can use match.string[match.start():match.end()], but I find that to be quite cumbersome (and in some cases wasteful) for such a basic query.
Is there a simpler way?
You can simply use the match object's group function, like:
match = re.search(r"1.3", "abc123xyz")
if match:
doSomethingWith(match.group(0))
to get the entire match. EDIT: as thg435 points out, you can also omit the 0 and just call match.group().
Addtional note: if your pattern contains parentheses, you can even get these submatches, by passing 1, 2 and so on to group().
You need to put the regex inside "()" to be able to get that part
>>> var = 'abc123xyz'
>>> exp = re.compile(".*(1.3).*")
>>> exp.match(var)
<_sre.SRE_Match object at 0x691738>
>>> exp.match(var).groups()
('123',)
>>> exp.match(var).group(0)
'abc123xyz'
>>> exp.match(var).group(1)
'123'
or else it will not return anything:
>>> var = 'abc123xyz'
>>> exp = re.compile("1.3")
>>> print exp.match(var)
None

Python regular expression; why do the search & match appear to find alpha chars in a number string?

I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)

Categories