How to get all matching iterations for a capture group - python

I made this regular expression and use it with re.findall():
SELECT.*{(?:\[([a-zA-Z0-9 ]*)\]\.\[([a-zA-Z0-9 ]*)\]\.\[([a-zA-Z0-9 ]*)\][,]{0,1}){1,}}.*
to match these lists of strings:
["dimSales","Product Title","All"],
["test","Product Title","All"]
in this haystack:
SELECT NON EMPTY Hierarchize({DrilldownLevel({[dimSales].[Product Title].[All],[test].[Product Title].[All]},,,INCLUDE_CALC_MEMBERS)}) DIMENSION PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON COLUMNS FROM [Model] CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, BACK_COLOR, FORE_COLOR, FONT_FLAGS
my regex only matches the last iteration of the outer capturing group
["test","Product Title","All"]
what do I need to change, so re.findall() returns all iterations. Not just the last iteration of the outer capturing group?

string = "SELECT NON EMPTY Hierarchize({DrilldownLevel({[dimSales].[Product Title].[All],[test].[Product Title].[All]},,,INCLUDE_CALC_MEMBERS)}) DIMENSION PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON COLUMNS FROM [Model] CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, BACK_COLOR, FORE_COLOR, FONT_FLAGS"
print re.findall(r"(?:SELECT .+\({|,)\[([\w ]+)\]\.\[([\w ]+)\]\.\[([\w ]+)\](?=[^}]*})", string)
Output:
[('dimSales', 'Product Title', 'All'), ('test', 'Product Title', 'All')]
Explanation:
(?:SELECT .+\({|,) # non capture group, match SELECT folowed by 1 or more any character then ({ OR a comma
\[([\w ]+)\] # group 1, 1 or more word character or space inside square brackets
\. # a dot
\[([\w ]+)\] # group 2, 1 or more word character or space inside square brackets
\. # a dot
\[([\w ]+)\] # group 3, 1 or more word character or space inside square brackets
(?=[^}]*}) # positive lookahead, make sure we have after a close curly bracket not preceeded by another curly bracket

What about this regex:
(\[\"[^\"]*\",\"[^\"]*\",\"[^\"]*\"\],\s*\[\"[^\"]*\",\"[^\"]*\",\"[^\"]*\"\])
demo:
https://regex101.com/r/LaddaK/2/
Explanations:
parenthesis () to have your capturing group, can be removed if not necessary
\[\"[^\"]*\",\"[^\"]*\",\"[^\"]*\"\] to match an open bracket literally followed by a double quote, 0 to N non double quote characters ([^\"]*) followed by a double quote and a comma. You might have to surround all commas by \s* if you have want to accept space characters around them.
you repeat another 2 times the pattern \"[^\"]*\" to match the first 3 words surround in brackets (you might have to adapt into \w* depending on your exact constraints on the strings.
you repeat the whole block[\"[^\"]*\",\"[^\"]*\",\"[^\"]*\"\] after a ,\s* to accept the whole pattern made of 2 blocks of brackets.
Notes:
You might want to surround your regex with anchors (^ and $)
I don't know exactly your constraints but if you want to analyse some JSON or parse any other format with infinite nested patterns repeating themselves (ex: fractals) you should not use regex.
EDIT after change of requirements:
import re
inputStr = '[dimSales,Product Title,All], [test,Product Title,All]'
print(re.findall(r'\[(?:[a-zA-Z0-9 ]*)(?:,[a-zA-Z0-9 ]*)*\]', inputStr))
output:
['[dimSales,Product Title,All]', '[test,Product Title,All]']

Related

Check in python if self designed pattern matches

I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b

Extract text between double square brackets in Python

If I have a string that may look like this:
"[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
How do I extract the categories and put them into a list?
I'm having a hard time getting the regular expression to work.
To expand on the explanation of the regex used by Avinash in his answer:
Category:([^\[\]]*) consists of several parts:
Category: which matches the text "Category:"
(...) is a capture group meaning roughly "the expression inside this group is a block that I want to extract"
[^...] is a negated set which means "do not match any characters in this set".
\[ and \] match "[" and "]" in the text respectively.
* means "match zero or more of the preceding regex defined items"
Where I have used ... to indicate that I removed some characters that were not important for the explanation.
So putting it all together, the regex does this:
Finds "Category:" and then matches any number (including zero) characters after that that are not the excluded characters "[" or "]". When it hits an excluded character it stops and the text matched by the regex inside the (...) part is returned. So the regex does not actually look for "[[" or "]]" as you might expect and so will match even if they are left out. You could force it to look for the double square brackets at the beginning and end by changing it to \[\[Category:([^\[\]]*)\]\].
For the second regex, Category:[^\[\]]*, the capture group (...) is excluded, so Python returns everything matched which includes "Category:".
Seems like you want something like this,
>>> str = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
>>> re.findall(r'Category:([^\[\]]*)', str)
['Political culture', 'Political ideologies']
>>> re.findall(r'Category:[^\[\]]*', str)
['Category:Political culture', 'Category:Political ideologies']
By default re.findall will print only the strings which are matched by the pattern present inside a capturing group. If no capturing group was present, then only the findall function would return the matches in list. So in our case , this Category: matches the string category: and this ([^\[\]]*) would capture any character but not of [ or ] zero or more times. Now the findall function would return the characters which are present inside the group index 1.
Python code:
s = "[[Category:Political culture]]\n\n [[Category:Political ideologies]]\n\n"
cats = [line.strip().strip("[").strip("]") for line in s.splitlines() if line]
print(cats)
Output:
['Category:Political culture', 'Category:Political ideologies']

Regular expression in python: removing square brackets and parts of the phrase inside of the brackets

I have a wikipedia dump and struggling with finding appropriate regex patter to remove the double square brackets in the expression. Here is the example of the expressions:
line = 'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the [[herbicide]]s and [[defoliant]]s used by the [[United States armed forces|U.S. military]] as part of its [[herbicidal warfare]] program, [[Operation Ranch Hand]], during the [[Vietnam War]] from 1961 to 1971.'
I am looking to remove all of the square brackets with the following conditions:
if there is no vertical separator within square bracket, remove the brackets.
Example : [[herbicide]]s becomes herbicides.
if there is a vertical separator within the bracket, remove the bracket and only use the phrase after the separator.
Example : [[United States armed forces|U.S. military]] becomes U.S. military.
I tried using re.match and re.search but was not able to arrive to the desired output.
Thank you for your help!
What you need is re.sub. Note that both square brackets and pipes are meta-characters so they need to be escaped.
re.sub(r'\[\[(?:[^\]|]*\|)?([^\]|]*)\]\]', r'\1', line)
The \1 in the replacement string refers to what was matched inside the parentheses, that do not start with ?: (i.e. in any case the text you want to have).
There are two caveats. This allows for only a single pipe between the opening and closing brackets. If there are more than one you would need to specify whether you want everything after the first or everything after the last one. The other caveat is that single ] between opening and closing brackets are not allowed. If that is a problem, there would still be a regex solution but it would be considerably more complicated.
For a full explanation of the pattern:
\[\[ # match two literal [
(?: # start optional non-capturing subpattern for pre-| text
[^\]|] # this looks a bit confusing but it is a negated character class
# allowing any character except for ] and |
* # zero or more of those
\| # a literal |
)? # end of subpattern; make it optional
( # start of capturing group 1 - the text you want to keep
[^\]|]* # the same character class as above
) # end of capturing group
\]\] # match two literal ]
>>> import re
>>> re.sub(r'\[\[(?:[^|\]]*\|)?([^\]]*)]]', r'\1', line)
'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the herbicides and defoliants used by the U.S. military as part of its herbicidal warfare program, Operation Ranch Hand, during the Vietnam War from 1961 to 1971.'
Explanation:
\[\[ # match two opening square brackets
(?: # start optional non-capturing group
[^|\]]* # match any number of characters that are not '|' or ']'
\| # match a '|'
)? # end optional non-capturing group
( # start capture group 1
[^\]]* # match any number of characters that are not ']'
) # end capture group 1
]] # match two closing square brackets
By replacing matches of the above regex with the contents of capture group 1, you will get the contents of the square brackets, but only what is after the separator if it is present.
You can use re.sub to just find everything between [[ and ]]and I think it's slightly easier to pass in a lambda function to do the replacement (to take everything from the last '|' onwards)
>>> import re
>>> re.sub(r'\[\[(.*?)\]\]', lambda L: L.group(1).rsplit('|', 1)[-1], line)
'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the herbicides and defoliants used by the U.S. military as part of its herbicidal warfare program, Operation Ranch Hand, during the Vietnam War from 1961 to 1971.'

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.

How can I omit words in the middle of a regular expression in Python?

I have a multi-line string like this:
"...Togo...Togo...Togo...ACTIVE..."
I want to get everything between the third 'Togo' and 'ACTIVE' and the remainder of the string. I am unable to create a regular expression that can do this. If I try something like
reg = "(Togo^[Togo]*?)(ACTIVE.*)"
nothing is captured (the first and last parentheses are needed for capturing groups).
This matches just the desired parts:
.*(Togo.*?)(ACTIVE.*)
The leading .* is greedy, so the following Togo matches at the last possible place. The captured part starts at the last Togo.
In your expression ^[Togo]*? doesn't do the right thing. ^ tries to match the beginning of a line and [Togo] matches any of the characters T, o or g. Even [^Togo] wouldn't work since this just matches any character that is not T, o or g.
reg = "Togo.*Togo.*Togo(.*)ACTIVE"
Alternatively, if you want to match the string between the last occurrence of Togo and the following occurence of ACTIVE, and the number of Togo occurences is not necessarily three, try this:
reg = "Togo(([^T]|T[^o]|To[^g]|Tog[^o])*T?.?.?)ACTIVE"
"(Togo(?:(?!Togo).)*)(ACTIVE.*)"
The square brackets in your regex form a character class that matches one of the characters 'T', 'o', or 'g'. The caret ('^') matches the beginning of the input if it's not in a character class, and it can be used inside the square brackets to invert the character class.
In my regex, after matching the word "Togo" I match one character at a time, but only after I check that it isn't the start of another instance of "Togo". (?!Togo) is called a negative lookahead.

Categories