Regex get all possible occurrence in Python - python

I have a string s = '10000',
I need using only the Python re.findall to get how many 0\d0 in the string s
For example: for the string s = '10000' it should return 2
explanation:
the first occurrence is 10000 while the second occurrence is 10000
I just need how many occurrences and not interested in the occurrence patterns
I've tried the following regex statements:
re.findall(r'(0\d0)', s) #output: ['000']
re.findall(r'(0\d0)*', s) #output: ['', '', '000', '', '', '']
Finally, if I want to make this regex generic to fetch any number then
any_number_included_my_number then the_same_number_again, how can I do it?

How to get all possible occurrences?
The regex
As I mentioned in my comment, you can use the following pattern:
(?=(0\d0))
How it works:
(?=...) is a positive lookahead ensuring what follows matches. This doesn't consume characters (allowing us to check for a match at each position in the string as a regex would otherwise resume pattern matching after the consumed characters).
(0\d0) is a capture group matching 0, then any digit, then 0
The code
Your code becomes:
See code in use here
re.findall(r'(?=(0\d0))', s)
The result is:
['000', '000']
The python re.findall method states the following
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This means that our matches are the results of capture group 1 rather than the full match as many would expect.
How to generalize the pattern?
The regex
You can use the following pattern:
(\d)\d\1
How this works:
(\d) captures any digit into capture group 1
\d matches any digit
\1 is a backreference that matches the same text as most recently matched by capture group 1
The code
Your code becomes:
See code in use here
re.findall(r'(?=((\d)\d\2))', s)
print([n[0] for n in x])
Note: The code above has two capture groups, so we need to change the backreference to \2 to match correctly. Since we now have two capture groups, we will get tuples as the documentation states and can use list comprehension to get the expected results.
The result is:
['000', '000']

Related

Detecting alphanumeric/numeric values in python string

I'm trying to extract tokens/part of tokens that have numeric/alphanumeric characters that have a length greater than 8 from the text.
Example:
text = 'https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8'
The expected output would be :
59800512 510557XXXXXX2302 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg 69i57j0i22i30l8j0i390 4672j0j7
I have tried using the regular expression : ((\d+)|([A-Za-z]+\d)[\dA-Za-z]*) based on the answer Python Alphanumeric Regex. I got the following results :
[match for match in re.findall(r"((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)",text)]
Output :
[('59800512', '59800512', ''),
('510557', '510557', ''),
('XXXXXX2302', '', 'XXXXXX2'),
('1601371803', '1601371803', ''),
('NhLw6NlR0EksRWkLddEo7NiEvrg', '', 'NhLw6'),
('69', '69', ''),
('i57j0i22i30l8j0i390', '', 'i5'),
('4672', '4672', ''),
('j0j7', '', 'j0'),
('8', '8', '')]
I'm getting a tuple of matching groups for each matching token.
It is possible to filter these tuples again. But I'm trying to make the code as efficient and pythonic as possible.
Could anyone suggest a solution? It need not be based on regular expressions.
Thanks in advance
Edit :
I expect alphanumeric values of length equal to or greater than 8
You get the tuples in the result, as re.findall returns the values of the capture groups.
But you can omit the capture groups and change the pattern to a single match, matching at least a digit between chars A-Z a-z and assert a minimum of 8 characters using a positive lookahead.
\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b
\b A word boundary
(?=[A-Za-z0-9]{8}) Positive lookahead, assert at least 8 occurrences of any of the listed ranges
[A-Za-z]* Optionally match a char A-Z a-z
\d Match a digit
[A-Za-z\d]* Optionall match a char A-Z a-z or a digit
\b A word boundary
See a regex demo or a Python demo.
import re
from pprint import pprint
pattern = r"\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b"
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
pprint(re.findall(pattern, s))
Output
['59800512',
'510557XXXXXX2302',
'1601371803',
'NhLw6NlR0EksRWkLddEo7NiEvrg',
'69i57j0i22i30l8j0i390',
'4672j0j7']
I came up with:
\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b
See an online demo
\b - Word boundary.
[A-Za-z]{,7} - 0-7 times a alphachar.
\d - A single digit.
[A-Za-z\d]{7,} - 7+ times an alphanumeric char.
\b - Word boundary.
Some sample code:
import re
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
result = re.findall(r'\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b', s)
print(result)
Prints:
['59800512', '510557XXXXXX2302', '1601371803', 'NhLw6NlR0EksRWkLddEo7NiEvrg', '69i57j0i22i30l8j0i390', '4672j0j7']
You could opt to match case-insensitive with:
(?i)\b[a-z]{,7}\d[a-z\d]{7,}\b
Although the selected answer returns the required output, it is not generic, and it fails to match specific cases (eg., s= "thisword2H2g2d")
For a more generic regex that works for all combinations of alphanumeric values:
result = re.findall(r"(\d+[A-Za-z\d]+\d*)|([A-Za-z]+[\d]+[A-Za-z\d]*)")
See the demo here.

Issue with python regex query: why does (-)? not capture a match? [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
I want to capture numbers and number ranges from a list: ["op.15", "Op.16-17", "Op16,17,18"]
match = re.compile(r"\d+[-]?\d+").findall(text)
Gets the correct result
op.15 ['15']
Op.16-17 ['16-17']
Op16,17,18 ['16', '17', '18']
but this doesn't work:
match = re.compile(r"\d+(-)?\d+").findall(text)
op.15 ['']
Op.16-17 ['-']
Op16,17,18 ['', '', '']
What's the issue here? I want to add in alternative values to -, such as "to" i.e. -|to which doesn't work with [].
The documentation for findall in re module says
Return a list of all non-overlapping matches in the string. If one or
more capturing groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result.
In your first regex you dont provide any capture groups so you get returned a list of non overlapping matches I.E it will return one or more digits followed by 0 or 1 hyphen followed by one or more digits.
In your second regex you change your [ ] which was saying match any chars in this list. To ( ) which is a capture group. so now you are saying match one or more digits followed by and capture zero or one hyphen, followed by one or more digits.
Now since you have given a capture group as per the documentation you wont now be returned the full non over lapping match, instead you will be returned only the capture group. I.e only returned anything inside the ( ) which will be either empty if there was 0 hyphen or will be - if there was 1 hyphen.
To fix the issue, use a non-capturing group: r"\d+(?:-)?\d+".

python regex: capturing group within OR

I'm using python and the re module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:
str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"
tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/ (tokenA) or a string like Id: ('[Ii]d:?\s?') (tokenB).
My regex looks like:
re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)
When parsing the 2 strings above, I get:
[('1234','')]
[('','5678'),('0123','')]
And I'd like to simply get ['1234'] or ['5678','0123'] instead of a tuple.
How can I modify the regex to achieve that? Thanks in advance.
You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So, the solution is to use only one capturing group.
Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4}) part is common for both, just use an alternation operator between tokens put into a non-capturing group:
(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^
The regex means:
(?:tokenA|tokenB) - match but not capture tokenA or tokenB
([0-9]{4}) - match and capture into Group 1 four digits
IDEONE demo:
import re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s))
Result: ['1234', '3456']
Simply do this:
re.findall(r"token[AB](\d{4})", s)
Put [AB] inside a character class, so that it would match either A or B

regular expression: may or may not contain a string

I want to match a floating number that might be in the form of 0.1234567 or 1.23e-5
Here is my python code:
import re
def main():
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
for svs_elem in m2:
print svs_elem
main()
It prints blank... Based on my test, the problem was in (e-\d+)? part.
See emphasis:
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
You have a group, so it’s returned instead of the entire match, but it doesn’t match in any of your cases. Make it non-capturing with (?:e-\d+):
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
Use a non-capturing group. The matches are succeeding, but the output is the contents of the optional groups that don't actually match.
See the output when your input includes something like e-6:
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['', '', 'e-6']
With a non-capturing group ((?:...)):
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['1:0.00003', '3:0.123456', '8:-0.12345e-6']
Here's are some simpler examples to demonstrate how capturing groups work and how they influence the output of findall. First, no groups:
>>> re.findall("a[bc]", "ab")
["ab"]
Here, the string "ab" matched the regex, so we print everything the regex matched.
>>> re.findall("a([bc])", "ab")
["b"]
This time, we put the [bc] inside a capturing group, so even though the entire string is still matched by the regex, findall only includes the part inside the capturing group in its output.
>>> re.findall("a(?:[bc])", "ab")
["ab"]
Now, by converting the capturing group to a non-capturing group, findall again uses the match of the entire regex in its output.
>>> re.findall("a([bc])?", "a")
['']
>>> re.findall("a(?:[bc])?", "a")
['a']
In both of these final case, the regular expression as a whole matches, so the return value is a non-empty list. In the first one, the capturing group itself doesn't match any text, though, so the empty string is part of the output. In the second, we don't have a capturing group, so the match of the entire regex is used for the output.

What do round brackets in Regex mean?

I don't understand why the regex ^(.)+$ matches the last letter of a string. I thought it would match the whole string.
Example in Python:
>>> text = 'This is a sentence'
>>> re.findall('^(.)+$', text)
['e']
If there's a capturing group (or groups), re.findall returns differently:
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
And according to MatchObject.group documentation:
If a group matches multiple times, only the last match is accessible:
If you want to get whole string, use a non-capturing group:
>>> re.findall('^(?:.)+$', text)
['This is a sentence']
or don't use capturing groups at all:
>>> re.findall('^.+$', text)
['This is a sentence']
or change the group to capturing all:
>>> re.findall('^(.+)$', text)
['This is a sentence']
>>> re.findall('(^.+$)', text)
['This is a sentence']
Alternatively, you can use re.finditer which yield match objects. Using MatchObject.group(), you can get the whole matched string:
>>> [m.group() for m in re.finditer('^(.)+$', text)]
['This is a sentence']
Because the capture group is just one character (.). The regex engine will continue to match the whole string because of the + quantifier, and each time, the capture group will be updated to the latest match. In the end, the capture group will be the last character.
Even if you use findall, the first time the regex is applied, because of the + quantifier it will continue to match the whole string up to the end. And since the end of the string was reached, the regex won't be applied again, and the call returns just one result.
If you remove the + quantifier, then the first time, the regex will match just one character, so the regex will be applied again and again, until the whole string will be consumed, and findall will return a list of all the characters in the string.
NOte that + is greedy by default which matches all the characters upto the last. Since only the dot present inside the capturing group, the above regex matches all the characters from the start but captures only the last character. Since findall function gives the first preference to groups, it just prints out the chars present inside the groups.
re.findall('^(.+)$', text)

Categories