Hi I need a regexp to capture ALL the groups that matches a text
I have the following text
"abc"
and this regexp
compiled = re.compile("(?P<group1>abc)|(?P<group2>abc)")
compiled.findall("asd")
but the output is the following:
("abc", "")
The output I expect is the following
("abc", "abc") # one match per capturing group that matches
EDITED:
What I need to achieve
I have around 500 groups of things, and I want to categorize a text to each one of this groups, so I created a capturing group with a regexp for each one. in this way I can run a big regexp once, and get the index of the matched groups to know which group matched
for example, I have ingredients of desserts, and want to know to which desserts a text may belong:
test = re.compile('(?P<dessert1>(?:apple))|(?P<dessert2>(?:apple|banana))|(?P<others>(?:other))')
then if I have the string
apple
I would want to get the groups "desert1" and "desert2"
I can't run several regexps for each dessert for performance reasons
You might use a positive lookahead with one of the capturing groups
(?=(?P<group1>abc))(?P<group2>\1)
Regex demo | Python demo
import re
regex = r"(?=(?P<group1>abc))(?P<group2>(?P=group1))"
test_str = "abc"
print(re.findall(regex, test_str))
Output
[('abc', 'abc')]
Or more explicit instead of the backreference \1, use (?P=group1) to match the same text as capturing group named group1
(?=(?P<group1>abc))(?P<group2>(?P=group1))
Regex demo
Related
So I'm trying to build a regex that searches for an occurrence of digits followed by a white space following by one of many possible key words (represented by test_cases in this case).
The first regex below does that successfully however I'm confused as to why it works. My understanding of capturing groups is that they allow you to put quantifiers on the group and also assist in specifying what data is returned. Why does this example need to be in the non-capturing group for it to be processed correctly?
test_string = "251 to 300 Vitality"
test_cases = ["Damage", "Pods", "Chance", "Vitality"]
print(re.findall(r'\d+\s(?:{})$'.format('|'.join(test_cases)), test_string)) # works
print(re.findall(r'\d+\s({})$'.format('|'.join(test_cases)), test_string)) # doesn't work
print(re.findall(r'\d+\s{}$'.format('|'.join(test_cases)), test_string)) # doesn't work
Output:
['300 Vitality']
['Vitality']
['Vitality']
I have a string s = '10000',
I need using only the Python re.findall to get how many 0\d0 in the string s
For example: for the string s = '10000' it should return 2
explanation:
the first occurrence is 10000 while the second occurrence is 10000
I just need how many occurrences and not interested in the occurrence patterns
I've tried the following regex statements:
re.findall(r'(0\d0)', s) #output: ['000']
re.findall(r'(0\d0)*', s) #output: ['', '', '000', '', '', '']
Finally, if I want to make this regex generic to fetch any number then
any_number_included_my_number then the_same_number_again, how can I do it?
How to get all possible occurrences?
The regex
As I mentioned in my comment, you can use the following pattern:
(?=(0\d0))
How it works:
(?=...) is a positive lookahead ensuring what follows matches. This doesn't consume characters (allowing us to check for a match at each position in the string as a regex would otherwise resume pattern matching after the consumed characters).
(0\d0) is a capture group matching 0, then any digit, then 0
The code
Your code becomes:
See code in use here
re.findall(r'(?=(0\d0))', s)
The result is:
['000', '000']
The python re.findall method states the following
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This means that our matches are the results of capture group 1 rather than the full match as many would expect.
How to generalize the pattern?
The regex
You can use the following pattern:
(\d)\d\1
How this works:
(\d) captures any digit into capture group 1
\d matches any digit
\1 is a backreference that matches the same text as most recently matched by capture group 1
The code
Your code becomes:
See code in use here
re.findall(r'(?=((\d)\d\2))', s)
print([n[0] for n in x])
Note: The code above has two capture groups, so we need to change the backreference to \2 to match correctly. Since we now have two capture groups, we will get tuples as the documentation states and can use list comprehension to get the expected results.
The result is:
['000', '000']
I am trying to find multiple values in a large string.
for example I want to first capture currency= values and ignore if nothing mentioned, then the next string in first occurance which start with [#
[Namex]
Name=jsdjsk
value=dfdfdf
currency=dollor
market=sfdsf
endvalue=xyz
[#1234#feagbdvsdf]
[Namey]
Name=jsdjsk
value=dfdfdf
currency=
endvalue=xyz
[#5777#feagbdvsdf]
[Namez]
Name=jsdjsk
currency=euro
market=sfdsf
[#98766#feagbdvsdf]
I am able to find the fist value for currency using below, but uable to get the next value..
re.findall('currency=(.+)', s)
I am expecting below results:
dollor, 1234
euro, 98766
You can use re.findall with a pattern that captures the desired values in two groups:
re.findall(r'^currency=([^\n]+).*?\[#(\d+)', s, re.M | re.S)
This returns:
[('dollor', '1234'), ('euro', '98766')]
Demo: https://ideone.com/vTynYJ
Another option using re.findall and 2 capturing groups is to match currency and then repeat matching the following lines ending with a newline until there is al line that starts with [# and then match the following 1+ digits in the second group.
This approach uses the multiline flag only
^currency=(\S+)(?:\n.*)*?\n\[#(\d+)
Regex demo | Python demo
For example:
re.findall(r"^currency=(\S+)(?:\n.*)*?\n\[#(\d+)", s, re.MULTILINE)
Result
[('dollor', '1234'), ('euro', '98766')]
I would like to capture n words surrounding a word x without whitespaces. I need a capture group for each word. I can achieve this in the following way (here words after x):
import regex
n = 2
x = 'beef tomato chicken trump Madonna'
right_word = '\s+(\S+)'
regex_right = r'^\S*{}\s*'.format(n*right_word)
m_right = regex.search(regex_right, x)
print(m_right.groups())
so if x = 'beef tomato chicken trump Madonna', n = 2, regex_right = '^\S*\s+(\S+)\s+(\S+)\s*', and I get two capture groups containing 'tomato' and 'chicken'. However, if n=5 I capture nothing which is not the behavior I was looking for. For n = 5 I want to capture all words the right of 'beef'.
I have tried using the greedy quantifier
regex_right = r'^\S*(\s+\S+){,n}\s*'
but I only get a single group (the last word) no matter how many matches I get (furthermore I get the white spaces as well..).
I finally tried using regex.findall but I cannot limit it to n words but have to specify number of characters?
Can anyone help ?
Wiktor helped me(see below) thanks. However I have an additional problem
if
x = 'beef, tomato, chicken, trump Madonna'
I cannot figure out how to capture without the commas? I do not want groups as 'tomato,'
You did not match all those words with the first approach because the pattern did not match the input string. You need to make the right_word pattern optional by enclosing it with (?:...)?:
import re
x = 'beef tomato chicken trump Madonna'
n = 5
right_word = '(?:\s+(\S+))?'
regex_right = r'^\S*{}'.format(n*right_word)
print(regex_right)
m_right = re.search(regex_right, x)
if m_right:
print(m_right.groups())
See the Python demo.
The second approach will only work with PyPi regex module because Python re does not keep repeated captures, once a quantified capturing group matches a substring again within the same match iteration, its value is re-written.
>>> right_word = '\s+(\S+)'
>>> n = 5
>>> regex_right = r'^\S*(?:\s+(\S+)){{1,{0}}}'.format(n)
>>> result = [x.captures(1) for x in regex.finditer(regex_right, "beef tomato chicken trump Madonna")]
>>> result
[['tomato', 'chicken', 'trump', 'Madonna']]
>>> print(regex_right)
^\S*(?:\s+(\S+)){1,5}
Note that ^\S*(?:\s+(\S+)){1,5} has a capturing group #1 inside a quantified non-capturing group that is quantified with the {1,5} limiting quantifier, and since PyPi regex keeps track of all values captured with repeated capturing groups, they all are accessible via .captures(1) here. You can test this feature with a .NET regex tester:
You got the correct approach. However regex can't do what you're asking for. Each time your capturing group captures another pattern, the previous content is replaced. That is why your capturing group only returns the last pattern captured.
You can easily match n words, but you can't capture them separately without writting each capture group explicitly.
I'm using python and the re module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:
str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"
tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/ (tokenA) or a string like Id: ('[Ii]d:?\s?') (tokenB).
My regex looks like:
re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)
When parsing the 2 strings above, I get:
[('1234','')]
[('','5678'),('0123','')]
And I'd like to simply get ['1234'] or ['5678','0123'] instead of a tuple.
How can I modify the regex to achieve that? Thanks in advance.
You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So, the solution is to use only one capturing group.
Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4}) part is common for both, just use an alternation operator between tokens put into a non-capturing group:
(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^
The regex means:
(?:tokenA|tokenB) - match but not capture tokenA or tokenB
([0-9]{4}) - match and capture into Group 1 four digits
IDEONE demo:
import re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s))
Result: ['1234', '3456']
Simply do this:
re.findall(r"token[AB](\d{4})", s)
Put [AB] inside a character class, so that it would match either A or B