Combine multiple regex expressions in Python - python

For clarity, i was looking for a way to compile multiple regex at once.
For simplicity, let's say that every expression should be in the format (.*) something (.*).
There will be no more than 60 expressions to be tested.
As seen here, i finally wrote the following.
import re
re1 = r'(.*) is not (.*)'
re2 = r'(.*) is the same size as (.*)'
re3 = r'(.*) is a word, not (.*)'
re4 = r'(.*) is world know, not (.*)'
sentences = ["foo2 is a word, not bar2"]
for sentence in sentences:
match = re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).search(sentence)
if match is not None:
print(match.group(1))
print(match.group(2))
print(match.group(3))
As regex are separated by a pipe, i thought that it will be automatically exited once a rule has been matched.
Executing the code, i have
foo2 is a word, not bar2
None
None
But by inverting re3 and re1 in re.compile match = re.compile("(%s|%s|%s|%s)" % (re3, re2, re1, re4)).search(sentence), i have
foo2 is a word, not bar2
foo2
bar2
As far as i can understand, first rule is executed but not the others.
Can someone please point me on the right direction on this case ?
Kind regards,

There are various issues with your example:
You are using a capturing group, so it gets the index 1 that you'd expect to reference the first group of the inner regexes. Use a non-capturing group (?:%s|%s|%s|%s) instead.
Group indexes increase even inside |. So(?:(a)|(b)|(c)) you'd get:
>>> re.match(r'(?:(a)|(b)|(c))', 'a').groups()
('a', None, None)
>>> re.match(r'(?:(a)|(b)|(c))', 'b').groups()
(None, 'b', None)
>>> re.match(r'(?:(a)|(b)|(c))', 'c').groups()
(None, None, 'c')
It seems like you'd expect to only have one group 1 that returns either a, b or c depending on the branch... no, indexes are assigned in order from left to right without taking account the grammar of the regex.
The regex module does what you want with numbering the groups. If you want to use the built-in module you'll have to live with the fact that numbering is not the same between different branches of the regex if you use named groups:
>>> import regex
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'a').groups()
('a',)
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'b').groups()
('b',)
>>> regex.match(r'(?:(?P<x>a)|(?P<x>b)|(?P<x>c))', 'c').groups()
('c',)
(Trying to use that regex with re will give an error for duplicated groups).

Giacomo answered the question.
However, I also suggest: 1) put the "compile" before the loop, 2) gather non empty groups in a list, 3) think about using (.+) instead of (.*) in re1,re2,etc.
rex= re.compile("%s|%s|%s|%s" % (re1, re2, re3, re4))
for sentence in sentences:
match = rex.search(sentence)
if match:
l=[ g for g in match.groups() if g!=None ]
print(l[0],l[1])

Related

python regex where a set of options can occur at most once in a list, in any order

I'm wondering if there's any way in python or perl to build a regex where you can define a set of options can appear at most once in any order. So for example I would like a derivative of foo(?: [abc])*, where a, b, c could only appear once. So:
foo a b c
foo b c a
foo a b
foo b
would all be valid, but
foo b b
would not be
You may use this regex with a capture group and a negative lookahead:
For Perl, you can use this variant with forward referencing:
^foo((?!.*\1) [abc])+$
RegEx Demo
RegEx Details:
^: Start
foo: Match foo
(: Start a capture group #1
(?!.*\1): Negative lookahead to assert that we don't match what we have in capture group #1 anywhere in input
[abc]: Match a space followed by a or b or c
)+: End capture group #1. Repeat this group 1+ times
$: End
As mentioned earlier, this regex is using a feature called Forward Referencing which is a back-reference to a group that appears later in the regex pattern. JGsoft, .NET, Java, Perl, PCRE, PHP, Delphi, and Ruby allow forward references but Python doesn't.
Here is a work-around of same regex for Python that doesn't use forward referencing:
^foo(?!.* ([abc]).*\1)(?: [abc])+$
Here we use a negative lookahead before repeated group to check and fail the match if there is any repeat of allowed substrings i.e. [abc].
RegEx Demo 2
You can assert that there is no match for a second match for a space and a letter at the right:
foo(?!(?: [abc])*( [abc])(?: [abc])*\1)(?: [abc])*
foo Match literally
(?! Negative lookahead
(?: [abc])* Match optional repetitions of a space and a b or c
( [abc]) Capture group, use to compare with a backreference for the same
(?: [abc])* Match again a space and either a b or c
\1 Backreference to group 1
) Close lookahead
(?: [abc])* Match optional repetitions or a space and either a b or c
Regex demo
If you don't want to match only foo, you can change the quantifier to 1 or more (?: [abc])+
A variant in perl reusing the first subpattern using (?1) which refers to the capture group ([abc])
^foo ([abc])(?: (?!\1)((?1))(?: (?!\1|\2)(?1))?)?$
Regex demo
If it doesn't have to be a regex:
import collections
# python >=3.10
def is_a_match(sentence):
words = sentence.split()
return (
(len(words) > 0)
and (words[0] == 'foo')
and (collections.Counter(words) <= collections.Counter(['foo', 'a', 'b', 'c']))
)
# python <3.10
def is_a_match(sentence):
words = sentence.split()
return (
(len(words) > 0)
and (words[0] == 'foo')
and not (collections.Counter(words) - collections.Counter(['foo', 'a', 'b', 'c']))
)
# TESTING
#foo a b c True
#foo b c a True
#foo a b True
#foo b True
#foo b b False
Or with a set and the walrus operator:
def is_a_match(sentence):
words = sentence.split()
return (
(len(words) > 0)
and (words[0] == 'foo')
and (
(s := set(words[1:])) <= set(['a', 'b', 'c'])
and len(s) == len(words) - 1
)
)
You can do it using references to previously captured groups.
foo(?: ([abc]))?(?: (?!\1)([abc]))?(?: (?!\1|\2)([abc]))?$
This gets quite long with many options. Such a regex can be generated dynamically, if necessary.
def match_sequence_without_repeats(options, seperator):
def prevent_previous(n):
if n == 0:
return ""
groups = "".join(rf"\{i}" for i in range(1, n + 1))
return f"(?!{groups})"
return "".join(
f"(?:{seperator}{prevent_previous(i)}([{options}]))?"
for i in range(len(options))
)
print(f"foo{match_sequence_without_repeats('abc', ' ')}$")
Here is a modified version of anubhava's answer, using a backreference (which works in Python, and is easier to understand at least for me) instead of a forward reference.
Match using [abc] inside a capturing group, then check that the text matched by the capturing group does not appear again anywhere after it:
^foo(?:( [abc])(?!.*\1))+$
regex demo
^: Start
foo: Match foo
(?:: Start non-capturing group (?:( [abc])(?!.*\1))
( [abc]): Capturing Group 1, matching a space followed by either a, b, or c
(?!.*\1): Negative lookahead, failing to match if the text matched by the first capturing group occurs after zero or more characters matched by .
)+: End non-capturing group and match it 1 or more times
$: End
I have assumed that the elements of the string can be in any order and appear any number of times. For example, 'a foo' should match and 'a foo b foo' should not.
You can do that with a series of alternations employing lookaheads, one for each substring of interest, but it becomes a bit of a dog's breakfast when there are many strings to consider. Let's suppose you wanted to match zero or one "foo"'s and/or zero or one "a"'s. You could use the following regular expression:
^(?:(?!.*\bfoo\b)|(?=(?:(?!\bfoo\b).)*\bfoo\b(?!(.*\bfoo\b))))(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))
Start your engine!
This matches, for example, 'foofoo', 'aa' and afooa. If they are not to be matched remove the word breaks (\b).
Notice that this expression begins by asserting the start of the string (^) followed by two positive lookaheads, one for 'foo' and one for 'a'. To also check for, say, 'c' one would tack on
(?:(?!.*\bc\b)|(?=(?:(?!\bc\b).)*\bc\b(?!(.*\bc\b))))
which is the same as
(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))
with \ba\b changed to \bc\b.
It would be nice to be able to use back-references but I don't see how that could be done.
By hovering over the regular expression in the link an explanation is provided for each element of the expression. (If this is not clear I am referring to the cursor.)
Note that
(?!\bfoo\b).
matches a character provided it does not begin the word 'foo'. Therefore
(?:(?!\bfoo\b).)*
matches a substring that does not contain 'foo' and does not end with 'f' followed by 'oo'.
Would I advocate this approach in practice, as opposed to using simple string methods? Let me ponder that.
If the order of the strings doesn't matter, and you want to make sure every string occurs only once, you can turn the list into a set in Python:
my_lst = ['a', 'a', 'b', 'c']
my_set = set(lst)
print(my_set)
# {'a', 'c', 'b'}
There is not much to add to the above answers except here is a regex that does not use back or forward references. Instead it uses 3 separate negative lookahead assertions to ensure that the input does not contain 2 occurrences of either a or b or c. The regex also allows for liberal uses of spaces.
^foo(?![^a]*a[^a]*a)(?![^b]*b[^b]*b)(?![^c]*c[^c]*c)( +[abc])* *$
See Regex Demo
^ - Matches start of string
(?![^a]*a[^a]*a) - Negative lookahead assertion that what follows does not contain two occurrences of a
(?![^b]*b[^b]*b) - Negative lookahead assertion that what follows does not contain two occurrences of b
(?![^c]*c[^c]*c) - Negative lookahead assertion that what follows does not contain two occurrences of c
( +[abc])* - Matches 0 or more occurrences of: 1 or more spaces followed by an a or b or c
* - Matches 0 or more occurrences of space
7 $ - Matches the end of the string
The regex looks "clunky" but is very straightforward. With input foo a b c the successful match is done in 35 steps and with input foo b b the unsuccessful match is done in 13 steps. Thich compares favorably with the other answers.

python: re.search doesn't start at beginning of string?

I'm working on a Flask API, which takes the following regex as an endpoint:
([0-9]*)((OK)|(BACK)|(X))*
That means I'm expecting a series of numbers, and the OK, BACK, X keywords multiple times in succession after the numbers.
I want to split this regex and do different stuff depending which capture groups were present.
My approach was the following:
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]*)", str(endp), re.I)
if match:
n = match.groups()
logging.info('nums: ' + str(n[0]))
match = re.search(r"((OK)|(BACK)|(X))*", str(endp), re.I)
if match:
s1 = match.groups()
for i in s1:
logging.info('str: ' + str(i[0]))
Using the /12OK endpoint, getting the numbers works just fine, but for some reason capturing the rest of the keywords are unsuccessful. I tried reducing the second capture group to only
match = re.search(r"(OK)*", str(endp), re.I)
I constantly find the following in s1 (using the reduced regex):
(None,)
originally (with the rest of the keywords):
(None, None, None, None)
Which I suppose means the regex pattern does not match anything in my endp string (why does it have 4 Nones? 1 for each keyword, but what the 4th is there for?). I validated my endpoint (the regex against the same string too) with a regex validator, it seems fine to me. I understand that re.match is supposed to get matches from the beginning, therefore I used the re.search method, as the documentation points out it's supposed to match anywhere in the string.
What am I missing here? Please advise, I'm a beginner in the python world.
Indeed it is a bit surprising that searching with * returns `None:
>>> re.search("(OK|BACK|X)*", u'/12OK').groups()
(None,)
But it's "correct", since * matches zero or more, and any pattern matches zero times in any string, that's why you see None. Searching with + somewhat solves it:
>>> re.search("(OK|BACK|X)+", u'/12OK').groups()
('OK',)
But now, searching with this pattern in /12OKOK still only finds one match because + means one or more, and it matched one time at the first OK. To find all occurrences you need to use re.findall:
>>> re.findall("(OK|BACK|X)", u'/12OKOK')
['OK', 'OK']
With those findings, your code would look as follows: (note that you don't need to write i[0] since i is already a string, unless you want to log only the first char of the string):
import re
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]+)", str(endp))
if match:
n = match.groups()
logging.info('nums: ' + str(n))
match = re.findall(r"(OK|BACK|X)", str(endp), re.I)
for i in match:
logging.info('str: ' + str(i))
If you want to match at least ONE of the groups, use + instead of *.
>>> endp = '/12OK'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> if match:
... s1 = match.groups()
... for i in s1:
... print s1
...
('OK', 'OK', None, None)
>>> endp = '/12X'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> match.groups()
('X', None, None, 'X')
Notice that you have 4 matching groups in your expression, one for each pair of parentheses. The first match is the outer parenthesis and the second one is the first of the nested groups. In the second example, you still get the first match for the outer parenthesis and then the last one is the third of the nested ones.
"((OK)|(BACK)|(X))*" will search for OK or BACK or X, 0 or more times. Note that the * means 0 or more, not more than 0. The above expression should have a + at the end not * as + means 1 or more.
I think you're having two different issues, and their intersection is causing more confusion than either of them would cause on their own.
The first issue is that you're using repeated groups. Python's re library is not able to capture multiple matches when a group is repeated. Matching with a pattern like (X)+ against 'XXXX' will only capture a single 'X' in the first group even though the whole string will be matched. The regex library (which is not part of the standard library) can do multiple captures, though I'm not sure of the exact commands required.
The second issue is using the * repetition operator in your pattern. The pattern you show at the top of the question will match on an empty string. Obviously, none of the gropus will capture anything in that situation (which may be why you're seeing a lot of None entries in your results). You probably need to modify your pattern so that it requires some minimal amount of valid text to count as a match. Using + instead of * might be one solution, but it's not clear to me exactly what you want to match against so I can't suggest a specific pattern.

Python Regular Expression Groups

Why does this regex print ('c',) and ()?
I thought "([abc])+" === "([abc])([abc])([abc])..."
>>> import re
>>> m = re.match("([abc])+", "abc")
>>> print m.groups()
('c',)
>>> m.groups(0)
('c',)
>>> m = re.match("[abc]+", "abc")
>>> m.groups()
()
>>> m.groups(0)
()
From documentation about groups
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
In the first regex ([abc])+, it is matching character a or b or c but will store only the last match
([abc])+
<----->
Matches a or b or c
Observe carefully. Capturing groups are surrounding only the character class
So, only one character from the matched character class can be stored in capturing group.
If you want to capture string abc in a capturing group use
([abc]+)
Above will find string composed of a or b or c and will store it in capturing group.
In second regex [abc]+, there are no capturing groups, so an empty result is shown.
When you put plus after the parenthesis you are matching one character at a time, many times. So it matches a then b then c, each time overwriting the previous. Move the + inside ([abc]+) to match one or more.

How to print substring using RegEx in Python?

This is two texts:
1) 'provider:sipoutilp1.ym.ms'
2) 'provider:sipoutqtm.ym.ms'
I would like to print ilp when reaches to the fist line and qtm when reaches to the second line.
This is my solution but it is not working.
RE_PROVIDER = re.compile(r'(?P<provider>\((ilp+|qtm+)')
or in the line below,
182938,DOMINICAN REPUBLIC-MOBILE
to DOMINICAN REPUBLIC , can I use the same approach re.compile?
Thank you for any help.
Your regex is not correct because you have a open parenthesis before your keywords, since there is no such character in your lines.
As a more general way you can get capture the alphabetical character after sipout or provider:sipout.
>>> s1 = 'provider:sipoutilp1.ym.ms'
>>> s2 = 'provider:sipoutqtm.ym.ms'
>>> RE_PROVIDER = re.compile(r'(?P<provider>(?<=sipout)(ilp|qtm))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
(?<=sipout) is a positive look-behind which will makes the regex engine match the patter which is precede with sipout.
After edit:
If you want to match multiple strings with different structure, you have to use a optional preceding patterns for matching your keywords, and due to this point that you cannot use unfixed length patterns within look-behind you cannot use it for this aim. So instead you can use a capture group trick.
You can define the optional preceding patterns within a none capture group and your keyword within a capture group then after match get the second matched gorup (group(1), group(0) is the whole of your match).
>>> RE_PROVIDER = re.compile(r'(?:sipout|\d+,)(?P<provider>(ilp|qtm|[A-Z\s]+))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
>>> s3 = "182938,DOMINICAN REPUBLIC-MOBILE"
>>> RE_PROVIDER.search(s3).groupdict()
{'provider': 'DOMINICAN REPUBLIC'}
Note that gorupdict doesn't works in this case because it will returns

Regular expression to return all characters between two special characters

How would I go about using regx to return all characters between two brackets.
Here is an example:
foobar['infoNeededHere']ddd
needs to return infoNeededHere
I found a regex to do it between curly brackets but all attempts at making it work with square brackets have failed. Here is that regex: (?<={)[^}]*(?=}) and here is my attempt to hack it
(?<=[)[^}]*(?=])
Final Solution:
import re
str = "foobar['InfoNeeded'],"
match = re.match(r"^.*\['(.*)'\].*$",str)
print match.group(1)
If you're new to REG(gular) EX(pressions) you learn about them at Python Docs. Or, if you want a gentler introduction, you can check out the HOWTO. They use Perl-style syntax.
Regex
The expression that you need is .*?\[(.*)\].*. The group that you want will be \1.
- .*?: . matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
- \[: \ escapes special meta-characters, which in this case, is [. If we didn't do that, [ would do something very weird instead.
- (.*): Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
- \].*: You should know enough by now to know what this means.
Implementation
First, import the re module -- it's not a built-in -- to where-ever you want to use the expression.
Then, use re.search(regex_pattern, string_to_be_tested) to search for the pattern in the string to be tested. This will return a MatchObject which you can store to a temporary variable. You should then call it's group() method and pass 1 as an argument (to see the 'Group 1' we captured using parenthesis earlier). I should now look like:
>>> import re
>>> pat = r'.*?\[(.*)].*' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd"
>>> match = re.search(pat, s)
>>> match.group(1)
"'infoNeededHere'"
An Alternative
You can also use findall() to find all the non-overlapping matches by modifying the regex to (?>=\[).+?(?=\]).
- (?<=\[): (?<=) is called a look-behind assertion and checks for an expression preceding the actual match.
- .+?: + is just like * except that it matches one or more repititions. It is made non-greedy by ?.
- (?=\]): (?=) is a look-ahead assertion and checks for an expression following the match w/o capturing it.
Your code should now look like:
>>> import re
>>> pat = r'(?<=\[).+?(?=\])' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd[andHere] [andOverHereToo[]"
>>> re.findall(pat, s)
["'infoNeededHere'", 'andHere', 'andOverHereToo[']
Note: Always use raw Python strings by adding an 'r' before the string (E.g.: r'blah blah blah').
10x for reading! I wrote this answer when there were no accepted ones yet, but by the time I finished it, 2 ore came up and one got accepted. :( x<
^.*\['(.*)'\].*$ will match a line and capture what you want in a group.
You have to escape the [ and ] with \
The documentation at the rubular.com proof link will explain how the expression is formed.
If there's only one of these [.....] tokens per line, then you don't need to use regular expressions at all:
In [7]: mystring = "Bacon, [eggs], and spam"
In [8]: mystring[ mystring.find("[")+1 : mystring.find("]") ]
Out[8]: 'eggs'
If there's more than one of these per line, then you'll need to modify Jarrod's regex ^.*\['(.*)'\].*$ to match multiple times per line, and to be non greedy. (Use the .*? quantifier instead of the .* quantifier.)
In [15]: mystring = "[Bacon], [eggs], and [spam]."
In [16]: re.findall(r"\[(.*?)\]",mystring)
Out[16]: ['Bacon', 'eggs', 'spam']

Categories