How to improve the performance of this regular expression? - python

Consider the regular expression
^(?:\s*(?:[\%\#].*)?\n)*\s*function\s
It is intended to match Octave/MATLAB script files that start with a function definition.
However, the performance of this regular expression is incredibly slow, and I'm not entirely sure why. For example, if I try evaluating it in Python,
>>> import re, time
>>> r = re.compile(r"^(?:\s*(?:[\%\#].*)?\n)*\s*function\s")
>>> t0=time.time(); r.match("\n"*15); print(time.time()-t0)
0.0178489685059
>>> t0=time.time(); r.match("\n"*20); print(time.time()-t0)
0.532235860825
>>> t0=time.time(); r.match("\n"*25); print(time.time()-t0)
17.1298530102
In English, that last line is saying that my regular expression takes 17 seconds to evaluate on a simple string containing 25 newline characters!
What is it about my regex that is making it so slow, and what could I do to fix it?
EDIT: To clarify, I would like my regex to match the following string containing comments:
# Hello world
function abc
including any amount of whitespace, but not
x = 10
function abc
because then the string does not start with "function". Note that comments can start with either "%" or with "#".

Replace your \s with [\t\f ] so they don't catch newlines. This should only be done by the whole non-capturing group (?:[\t\f ]*(?:[\%\#].*)?\n).
The problem is that you have three greedy consumers that all match '\n' (\s*, (...\n)* and again \s*).
In your last timing example, they will try out all strings a, b and c (one for each consumer) that make up 25*'\n' or any substring d it begins with, say e is what is ignored, then d+e == 25*'\n'.
Now find all combinations of a, b, c and e so that a+b+c+e == d+e == 25*'\n' considering also the empty string for one or more variables. It's too late for me to do the maths right now but I bet the number is huge :D
By the way regex101 is a great site to try out regular expressions. They automatically break up expressions and explain their parts and they even provide a debugger.

To speedup you can use this regex:
p = re.compile(r"^\s*function\s", re.MULTILINE)
Since you're not actually capturing lines starting with # or % anyway, you can use MULTILINE mode and start matching from the same line where function keyword is found.

Related

Find first matching regex from list of regexes

Let's say I have a list of regexes like such (this is a simple example, the real code has more complex regexes):
regs = [r'apple', 'strawberry', r'pear', r'.*berry', r'fruit: [a-z]*']
I want to exactly match one of the regexes above (so ^regex$) and return the index. Additionally, I want to match the leftmost regex. So find('strawberry') should return 1 while find('blueberry') should return 3. I'm going to re-use the same set of regexes a lot, so precomputation is fine.
This is what I've coded, but it feels bad. The regex should be able to know which one got matched, and I feel this is terribly inefficient (keep in mind that the example above is simplified, and the real regexes are more complicated and in larger numbers):
import re
regs_compiled = [re.compile(reg) for reg in regs]
regs_combined = re.compile('^' +
'|'.join('(?:{})'.format(reg) for reg in regs) +
'$')
def find(s):
if re.match(regs_combined, s):
for i, reg in enumerate(regs_compiled):
if re.match(reg, s):
return i
return -1
Is there a way to find out which subexpression(s) were used to match the regex without looping explicitly?
The only way to figure out which subexpression of the regular expression matched the string would be to use capturing groups for every one and then check which group is not None. But this would require that no subexpression uses capturing groups on its own.
E.g.
>>> regs_combined = re.compile('^' +
'|'.join('({})'.format(reg) for reg in regs) +
'$')
>>> m = re.match(regs_combined, 'strawberry')
>>> m.groups()
(None, 'strawberry', None, None, None)
>>> m.lastindex - 1
1
Other than that, the standard regular expression implementation does not provide further information. You could of course build your own engine that exposes that information, but apart from your very special use case, it’s difficult to make this practically work in other situations—which is probably why this is not provided by existing solutions.

How to match the following regex python?

How to match the following with regex?
string1 = '1.0) The Ugly Duckling (TUD) (10 Dollars)'
string2 = '1.0) Little 1 Red Riding Hood (9.50 Dollars)'
I am trying the following:
groupsofmatches = re.match('(?P<booknumber>.*)\)([ \t]+)?(?P<item>.*)(\(.*\))?\(.*?((\d+)?(\.\d+)?).*([ \t]+)?Dollars(\))?', string1)
The issue is when I apply it to string2 it works fine, but when I apply the expression to string1, I am unable to get the "m.group(name)" because of the "(TUD)" part. I want to use a single expression that works for both strings.
I expect:
booknumber = 1.0
item = The Ugly Duckling (TUD)
Your problem is that .* matches greedily, and it may be consuming too much of the string. Printing all of the match groups will make this more obvious:
import re
string1 = '1.0) The Ugly Duckling (TUD) (10 Dollars)'
string2 = '1.0) Little 1 Red Riding Hood (9.50 Dollars)'
result = re.match(r'(.*?)\)([ \t]+)?(?P<item>.*)\(.*?(?P<dollaramount>(\d+)?(\.\d+)?).*([ \t]+)?Dollars(\))?', string1)
print repr(result.groups())
print result.group('item')
print result.group('dollaramount')
Changing them to *? makes the match the minimum.
This can be expensive in some RE engines, so you can also write eg \([^)]*\) to match all the parenthesis. If you're not processing a lot of text it probably doesn't matter.
btw, you should really use raw strings (ie r'something') for regexps, to avoid surprising backslash behaviour, and to give the reader a clue.
I see you had this group (\(.*?\))? which presumably was cutting out the (TUD), but if you actually want that in the title, just remove it.
You could impose some heavier restrictions on your repeated characters:
groupsofmatches = re.match('([^)]*)\)[ \t]*(?P<item>.*)\([^)]*?(?P<dollaramount>(?:\d+)?(?:\.\d+)?)[^)]*\)$', string1)
This will make sure that the numbers are taken from the last set of parentheses.
I would write it as:
num, name, value = re.match(r'(.+?)\) (.*?) \(([\d.]+) Dollars\)', s2).groups()
This is how I would do it with a Demo
(?P<booknumber>\d+(?:\.\d+)?)\)\s+(?P<item>.*?)\s+\(\d+(?:\.\d+)?\s+Dollars\)
I suggest you to use regex pattern
(?P<booknumber>[^)]*)\)\s+(?P<item>.*\S)\s+\((?!.*\()(?P<amount>\S+)\s+Dollars?\)

Python Regex match or potential match

Question:
How do I use Python's regular expression module (re) to determine if a match has been made, or that a potential match could be made?
Details:
I want a regex pattern which searches for a pattern of words in a correct order regardless of what's between them. I want a function which returns Yes if found, Maybe if a match could still be found or No if no match can be found. We are looking for the pattern One|....|Two|....|Three, here are some examples (Note the names, their count, or their order are not important, all I care about is the three words One, Two and Three, and the acceptable words in between are John, Malkovich, Stamos and Travolta).
Returns YES:
One|John|Malkovich|Two|John|Stamos|Three|John|Travolta
Returns YES:
One|John|Two|John|Three|John
Returns YES:
One|Two|Three
Returns MAYBE:
One|Two
Returns MAYBE:
One
Returns NO:
Three|Two|One
I understand the examples are not airtight, so here is what I have for the regex to get YES:
if re.match('One\|(John\||Malkovich\||Stamos\||Travolta\|)*Two\|(John\||Malkovich\||Stamos\||Travolta\|)*Three\|(John\||Malkovich\||Stamos\||Travolta\|)*', 'One|John|Malkovich|Two|John|Stamos|Three|John|Travolta') != None
return 'Yes'
Obviously if the pattern is Three|Two|One the above will fail, and we can return No, but how do I check for the Maybe case? I thought about nesting the parentheses, like so (note, not tested)
if re.match('One\|((John\||Malkovich\||Stamos\||Travolta\|)*Two(\|(John\||Malkovich\||Stamos\||Travolta\|)*Three\|(John\||Malkovich\||Stamos\||Travolta\|)*)*)*', 'One|John|Malkovich|Two|John|Stamos|Three|John|Travolta') != None
return 'Yes'
But I don't think that will do what I want it to do.
More Details:
I am not actually looking for Travoltas and Malkovichs (shocking, I know). I am matching against inotify Patterns such as IN_MOVE, IN_CREATE, IN_OPEN, and I am logging them and getting hundreds of them, then I go in and then look for a particular pattern such as IN_ACCESS...IN_OPEN....IN_MODIFY, but in some cases I don't want an IN_DELETE after the IN_OPEN and in others I do. I'm essentially pattern matching to use inotify to detect when text editors gone wild and they try to crush programmers souls by doing a temporary-file-swap-save instead of just modifying the file. I don't want to free up those logs instantly, but I only want to hold on to them for as long as is necessary. Maybe means dont erase the logs. Yes means do something then erase the log and No means don't do anything but still erase the logs. As I will have multiple rules for each program (ie. vim v gedit v emacs) I wanted to use a regular expression which would be more human readable and easier to write then creating a massive tree, or as user Joel suggested, just going over the words with a loop
I wouldn't use a regex for this. But it's definitely possible:
regex = re.compile(
r"""^ # Start of string
(?: # Match...
(?: # one of the following:
One() # One (use empty capturing group to indicate match)
| # or
\1Two() # Two if One has matched previously
| # or
\1\2Three() # Three if One and Two have matched previously
| # or
John # any of the other strings
| # etc.
Malkovich
|
Stamos
|
Travolta
) # End of alternation
\|? # followed by optional separator
)* # any number of repeats
$ # until the end of the string.""",
re.VERBOSE)
Now you can check for YES and MAYBE by checking if you get a match at all:
>>> yes = regex.match("One|John|Malkovich|Two|John|Stamos|Three|John|Travolta")
>>> yes
<_sre.SRE_Match object at 0x0000000001F90620>
>>> maybe = regex.match("One|John|Malkovich|Two|John|Stamos")
>>> maybe
<_sre.SRE_Match object at 0x0000000001F904F0>
And you can differentiate between YES and MAYBE by checking whether all of the groups have participated in the match (i. e. are not None):
>>> yes.groups()
('', '', '')
>>> maybe.groups()
('', '', None)
And if the regex doesn't match at all, that's a NO for you:
>>> no = regex.match("Three|Two|One")
>>> no is None
True
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski
Perhaps an algorithm like this would be more appropriate. Here is some pseudocode.
matchlist.current = matchlist.first()
for each word in input
if word = matchlist.current
matchlist.current = matchlist.next() // assuming next returns null if at end of list
else if not allowedlist.contains(word)
return 'No'
if matchlist.current = null // we hit the end of the list
return 'Yes'
return 'Maybe'

Regular expression in python to capture multiple forms of badly formatted addresses

I have been tweaking a regular expression over several days to try to capture, with a single definition, several cases of inconsistent format in the address field of a database.
I am new to Python and regular expressions, and have gotten great feedback here is stackoverflow, and with my new knowledge, I built a RegEx that is getting close to the final result, but still can't spot the problem.
import re
r1 = r"([\w\s+]+),?\s*\(?([\w\s+\\/]+)\)?\s*\(?([\w\s+\\/]+)\)?"
match1 = re.match(r1, 'caracas, venezuela')
match2 = re.match(r1, 'caracas (venezuela)')
match3 = re.match(r1, 'caracas, (venezuela) (df)')
group1 = match1.groups()
group2 = match2.groups()
group3 = match3.groups()
print group1
print group2
print group3
This thing should return 'caracas, venezuela' for groups 1 and 2, and 'caracas, venezuela, df' for group 3, instead, it returns:
('caracas', 'venezuel' 'a')
('caracas ', 'venezuel' 'a')
('caracas', 'venezuela', 'df')
The only perfect match is group 3. The other 2 are isolating the 'a' at the end, and the 2nd one has an extra space at the end of 'caracas '.
Thanks in advance for any insight.
Cheers!
Regular expressions might be overkill... what exactly is your problem statement? What do you need to capture?
Some things I caught (in order of appearance in your regex; sometimes it helps to read it out, left-to-right, English-style):
([\w\s+]+)
This says, "capture one or more (letter or one or more spaces)"
Do you really want to capture the spaces at the end of the city name? Also, you don't need (indeed, shouldn't have) the 1-or-more symbol + inside your brackets [ ], since your regex will already be matching one or more of them based on the outer +. I'd rewrite this part like this:
([\w\s]*\w)
Which will match eagerly up to the last alphanumeric character ("zero or more (letter or space) followed by a letter"). This does assume you have at least one character, but is better than your assumption that a single space would work as well.
Next you have:
,?\s*\(?
which looks okay to me except that it doesn't guarantee that you'll see either a comma or an open paren anymore. What about:
(?:,\s*\(|,\s*|\s*\()
which says, "non-capturingly match either (a comma with maybe some spaces and then an open paren) OR (a comma with maybe some spaces) OR (maybe some spaces and then an open paren)". This enforces that you must have either a comma or a paren or both.
Next you have the capturing expression, very similar to the first:
([\w\s+\\/]+)
Again, you don't want the spaces (or slashes in this case) at the end of the city name, and you don't want the + inside the [ ]:
([\w\s\\/]*\w)
The next expression is probably where you're getting your venezuel a problem; let's take a look:
\)?\s*\(?([\w\s+\\/]+)\)?
This is a rather long one, so let's break it down:
\)?\s*\(?
says to "maybe match a close paren, and then maybe some spaces, and then maybe an open paren". This is okay I guess, let's move on to the real problem:
([\w\s+\\/]+)
This capturing group MUST match at least one character. If the matcher sees "venezuela" at the end of your address, it will eagerly match the characters venezuel and then need to satisfy this final expression with what it has left, a. Try instead:
\)?\s*
Followed by making your entire final expression optional, and the outer expression non-capturing:
(?:\(?([\w\s+\\/]+)\)?)?
The final expression would be:
([\w\s]*\w)(?:,\s*\(|,\s*|\s*\()([\w\s\\/]*\w)\)?\s*(?:\(?([\w\s+\\/]+)\)?)?
Edit: fixed a problem that made the final group capture twice, once with the parens, once without. Now it should only capture the text inside the parens.
Testing it on your examples:
>>> re.match(r, 'caracas, venezuela').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas (venezuela)').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas, (venezuela) (df)').groups()
('caracas', 'venezuela', 'df')
Could you not just find all the words in the text?
E.g.:
>>> import re
>>> samples = ['caracas, venezuela','caracas (venezuela)','caracas, (venezuela) (df)']
>>>
>>> def find_words(text):
... return re.findall('\w+',text)
...
>>> for sample in samples:
... print find_words(sample)
...
['caracas', 'venezuela']
['caracas', 'venezuela']
['caracas', 'venezuela', 'df']

Matching 2 regular expressions in Python

Is it possible to match 2 regular expressions in Python?
For instance, I have a use-case wherein I need to compare 2 expressions like this:
re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)
I would expect to be returned a RE object.
But obviously, Python expects a string as the second parameter.
Is there a way to achieve this, or is it a limitation of the way regex matching works?
Background: I have a list of regular expressions [r1, r2, r3, ...] that match a string and I need to find out which expression is the most specific match of the given string. The way I assumed I could make it work was by:
(1) matching r1 with r2.
(2) then match r2 with r1.
If both match, we have a 'tie'. If only (1) worked, r1 is a 'better' match than r2 and vice-versa.
I'd loop (1) and (2) over the entire list.
I admit it's a bit to wrap one's head around (mostly because my description is probably incoherent), but I'd really appreciate it if somebody could give me some insight into how I can achieve this. Thanks!
Outside of the syntax clarification on re.match, I think I am understanding that you are struggling with taking two or more unknown (user input) regex expressions and classifying which is a more 'specific' match against a string.
Recall for a moment that a Python regex really is a type of computer program. Most modern forms, including Python's regex, are based on Perl. Perl's regex's have recursion, backtracking, and other forms that defy trivial inspection. Indeed a rogue regex can be used as a form of denial of service attack.
To see of this on your own computer, try:
>>> re.match(r'^(a+)+$','a'*24+'!')
That takes about 1 second on my computer. Now increase the 24 in 'a'*24 to a bit larger number, say 28. That take a lot longer. Try 48... You will probably need to CTRL+C now. The time increase as the number of a's increase is, in fact, exponential.
You can read more about this issue in Russ Cox's wonderful paper on 'Regular Expression Matching Can Be Simple And Fast'. Russ Cox is the Goggle engineer that built Google Code Search in 2006. As Cox observes, consider matching the regex 'a?'*33 + 'a'*33 against the string of 'a'*99 with awk and Perl (or Python or PCRE or Java or PHP or ...) Awk matches in 200 microseconds but Perl would require 1015 years because of exponential back tracking.
So the conclusion is: it depends! What do you mean by a more specific match? Look at some of Cox's regex simplification techniques in RE2. If your project is big enough to write your own libraries (or use RE2) and you are willing to restrict the regex grammar used (i.e., no backtracking or recursive forms), I think the answer is that you would classify 'a better match' in a variety of ways.
If you are looking for a simple way to state that (regex_3 < regex_1 < regex_2) when matched against some string using Python or Perl's regex language, I think that the answer is it is very very hard (i.e., this problem is NP Complete)
Edit
Everything I said above is true! However, here is a stab at sorting matching regular expressions based on one form of 'specific': How many edits to get from the regex to the string. The greater number of edits (or the higher the Levenshtein distance) the less 'specific' the regex is.
You be the judge if this works (I don't know what 'specific' means to you for your application):
import re
def ld(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
s='Mary had a little lamb'
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little']
for reg in regs:
m=re.search(reg,s)
if m:
print "'%s' matches '%s' with sub group '%s'" % (reg, s, m.group(0))
ld1=ld(reg,m.group(0))
ld2=ld(m.group(0),s)
score=max(ld1,ld2)
print " %i edits regex->match(0), %i edits match(0)->s" % (ld1,ld2)
print " score: ", score
d[reg]=score
print
else:
print "'%s' does not match '%s'" % (reg, s)
print " ===== %s ===== === %s ===" % ('RegEx'.center(10),'Score'.center(10))
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print " %22s %5s" % (key, value)
The program is taking a list of regex's and matching against the string Mary had a little lamb.
Here is the sorted ranking from "most specific" to "least specific":
===== RegEx ===== === Score ===
Mary had a little lamb 0
Mary.*little lamb 7
.*little lamb 11
little lamb 11
.*[lL]ittle [Ll]amb 15
\blittle\b 16
little 16
Mary 18
\b\w+mb 18
lamb 18
.* 22
This based on the (perhaps simplistic) assumption that: a) the number of edits (the Levenshtein distance) to get from the regex itself to the matching substring is the result of wildcard expansions or replacements; b) the edits to get from the matching substring to the initial string. (just take one)
As two simple examples:
.* (or .*.* or .*?.* etc) against any sting is a large number of edits to get to the string, in fact equal to the string length. This is the max possible edits, the highest score, and the least 'specific' regex.
The regex of the string itself against the string is as specific as possible. No edits to change one to the other resulting in a 0 or lowest score.
As stated, this is simplistic. Anchors should increase specificity but they do not in this case. Very short stings don't work because the wild-card may be longer than the string.
Edit 2
I got anchor parsing to work pretty darn well using the undocumented sre_parse module in Python. Type >>> help(sre_parse) if you want to read more...
This is the goto worker module underlying the re module. It has been in every Python distribution since 2001 including all the P3k versions. It may go away, but I don't think it is likely...
Here is the revised listing:
import re
import sre_parse
def ld(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
s='Mary had a little lamb'
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little',
r'^.*lamb',r'.*.*.*b',r'.*?.*',r'.*\b[lL]ittle\b \b[Ll]amb',
r'.*\blittle\b \blamb$','^'+s+'$']
for reg in regs:
m=re.search(reg,s)
if m:
ld1=ld(reg,m.group(0))
ld2=ld(m.group(0),s)
score=max(ld1,ld2)
for t, v in sre_parse.parse(reg):
if t=='at': # anchor...
if v=='at_beginning' or 'at_end':
score-=1 # ^ or $, adj 1 edit
if v=='at_boundary': # all other anchors are 2 char
score-=2
d[reg]=score
else:
print "'%s' does not match '%s'" % (reg, s)
print
print " ===== %s ===== === %s ===" % ('RegEx'.center(15),'Score'.center(10))
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print " %27s %5s" % (key, value)
And soted RegEx's:
===== RegEx ===== === Score ===
Mary had a little lamb 0
^Mary had a little lamb$ 0
.*\blittle\b \blamb$ 6
Mary.*little lamb 7
.*\b[lL]ittle\b \b[Ll]amb 10
\blittle\b 10
.*little lamb 11
little lamb 11
.*[lL]ittle [Ll]amb 15
\b\w+mb 15
little 16
^.*lamb 17
Mary 18
lamb 18
.*.*.*b 21
.* 22
.*?.* 22
It depends on what kind of regular expressions you have; as #carrot-top suggests, if you actually aren't dealing with "regular expressions" in the CS sense, and instead have crazy extensions, then you are definitely out of luck.
However, if you do have traditional regular expressions, you might make a bit more progress. First, we could define what "more specific" means. Say R is a regular expression, and L(R) is the language generated by R. Then we might say R1 is more specific than R2 if L(R1) is a (strict) subset of L(R2) (L(R1) < L(R2)). That only gets us so far: in many cases, L(R1) is neither a subset nor a superset of L(R2), and so we might imagine that the two are somehow incomparable. An example, trying to match "mary had a little lamb", we might find two matching expressions: .*mary and lamb.*.
One non-ambiguous solution is to define specificity via implementation. For instance, convert your regular expression in a deterministic (implementation-defined) way to a DFA and simply count states. Unfortunately, this might be relatively opaque to a user.
Indeed, you seem to have an intuitive notion of how you want two regular expressions to compare, specificity-wise. Why not simple write down a definition of specificity, based on the syntax of regular expressions, that matches your intuition reasonably well?
Totally arbitrary rules follow:
Characters = 1.
Character ranges of n characters = n (and let's say \b = 5, because I'm not sure how you might choose to write it out long-hand).
Anchors are 5 each.
* divides its argument by 2.
+ divides its argument by 2, then adds 1.
. = -10.
Anyway, just food for thought, as the other answers do a good job of outlining some of the issues you're facing; hope it helps.
I don't think it's possible.
An alternative would be to try to calculate the number of strings of length n that the regular expression also matches. A regular expression that matches 1,000,000,000 strings of length 15 characters is less specific than one that matches only 10 strings of length 15 characters.
Of course, calculating the number of possible matches is not trivial unless the regular expressions are simple.
Option 1:
Since users are supplying the regexes, perhaps ask them to also submit some test strings which they think are illustrative of their regex's specificity. (i.e. that show their regex is more specific than a competitor's regex.) Collect all the user's submitted test strings, and then test all the regexes against the complete set of test strings.
To design a good regex, the author must have put thought into what strings match and don't match their regex, so it should be easy for them to supply good test strings.
Option 2:
You might try a Monte Carlo approach: Starting with the string that both regexes match, write a generator which generates mutations of that string (permute characters, add/remove characters, etc.) If both regexes match or don't match the same way for each mutation, then the regexes "probably tie". If one matches a mutations that the other doesn't, and vice versa, then they "absolutely tie".
But if one matches a strict superset of mutations then it is "probably less specific" than the other.
The verdict after a large number of mutations may not always be correct, but may be reasonable.
Option 3:
Use ipermute or pyParsing's invert to generate strings which match each regex. This will only work on a regexes that use a limited subset of regex syntax.
I think you could do it by looking the result of matching with the longest result
>>> m = re.match(r'google\.com\/maps','google.com/maps/hello')
>>> len(m.group(0))
15
>>> m = re.match(r'google\.com\/maps2','google.com/maps/hello')
>>> print (m)
None
>>> m = re.match(r'google\.com\/maps','google.com/maps2/hello')
>>> len(m.group(0))
15
>>> m = re.match(r'google\.com\/maps2','google.com/maps2/hello')
>>> len(m.group(0))
16
re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)
The second item to re.match() above is a string -- that's why it's not working: the regex says to match a period after google, but instead it finds a backslash. What you need to do is double up the backslashes in the regex that's being used as a regex:
def compare_regexes(regex1, regex2):
"""returns regex2 if regex1 is 'smaller' than regex2
returns regex1 if they are the same
returns regex1 if regex1 is 'bigger' than regex2
otherwise returns None"""
regex1_mod = regex1.replace('\\', '\\\\')
regex2_mod = regex2.replace('\\', '\\\\')
if regex1 == regex2:
return regex1
if re.match(regex1_mod, regex2):
return regex2
if re.match(regex2_mod, regex1):
return regex1
You can change the returns to whatever suits your needs best. Oh, and make sure you are using raw strings with re. r'like this, for example'
Is it possible to match 2 regular expressions in Python?
That certainly is possible. Use parenthetical match groups joined by | for alteration. If you arrange the parenthetical match groups by most specific regex to least specific, the rank in the returned tuple from m.groups() will show how specific your match is. You can also use named groups to name how specific your match is, such as s10 for very specific and s0 for a not so specific match.
>>> s1='google.com/maps2text'
>>> s2='I forgot my goggles at the house'
>>> s3='blah blah blah'
>>> m1=re.match(r'(^google\.com\/maps\dtext$)|(.*go[a-z]+)',s1)
>>> m2=re.match(r'(^google\.com\/maps\dtext$)|(.*go[a-z]+)',s2)
>>> m1.groups()
('google.com/maps2text', None)
>>> m2.groups()
(None, 'I forgot my goggles')
>>> patt=re.compile(r'(?P<s10>^google\.com\/maps\dtext$)|
... (?P<s5>.*go[a-z]+)|(?P<s0>[a-z]+)')
>>> m3=patt.match(s3)
>>> m3.groups()
(None, None, 'blah')
>>> m3.groupdict()
{'s10': None, 's0': 'blah', 's5': None}
If you do not know ahead of time which regex is more specific, this is a much harder problem to solve. You want to have a look at this paper covering security of regex matches against file system names.
I realize that this is a non-solution, but as there is no unambiguous way to tell which is the "most specific match", certainly when it depends on what your users "meant", the easiest thing to do would be to ask them to provide their own priority. For example just by putting the regexes in the right order. Then you can simply take the first one that matches. If you expect the users to be comfortable with regular expressions anyway, this is maybe not too much to ask?

Categories