Meaning of regex Python - python

Is the meaning of this regex: (\d+).*? - group a set of numbers, then take whatever that comes after (only one occurance of it at maximum, except a newline)?
Is there a difference in: (\d+) and [\d]+?

Take as many digits as possible (at least 1), then take the smallest amount of characters as possible (except newline). The non greedy qualifier (?) doesn't really help unless you have the rest of your pattern following it, otherwise it will just match as little as possible, in this case, always 0.
>>> import re
>>> re.match(r'(\d+).*?', '123').group()
'123'
>>> re.match(r'(\d+).*?', '123abc').group()
'123'
The difference between (\d+) and [\d]+ is the fact that the former groups and the latter doesn't. ([\d]+) would however be equivalent.
>>> re.match(r'(\d+)', '123abc').groups()
('123',)
>>> re.match(r'[\d]+', '123abc').groups()
()

(\d)+ One or more occurance of digits,
.* followed by any characters,
? lazy operator i.e. return the minimum match.

group1 will be at least one number and group0 will contain group1 and maybe other characters but not necessarily.
edit to answer the edited question: AFAIK there should be no difference in the matching between those 2 other than the grouping.

Related

Not sure about how /?(.+) works in my regex [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?
They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here
The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).
In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com
+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.
A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.
Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba
+ is minimal one, * can be zero as well.
A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.
I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

Why findall() function return a weird value? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?
They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here
The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).
In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com
+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.
A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.
Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba
+ is minimal one, * can be zero as well.
A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.
I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

Python regex - matching character sequences using prior matched characters

I wish to match strings such as "zxxz" and "vbbv" where a character is followed by a pair of identical characters that do not match the first, then followed by the first. Therefore I do not wish to match strings such as "zzzz" and "vvvv".
I started with the following Python regex that matches all of those examples:
(.)(.)\2\1
In an attempt to exclude the second set ("zzzz", "vvvv"), I tried this modification:
(.)([^\1])\2\1
My reasoning is that the second group can contain any single character provided it is not the same at that matched in the first set.
Unfortunately this does not seem to work as it still matches "zzzz" and "vvvv".
According to the Python 2.7.12 documentation:
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
(My emphasis added).
I find this sentence ambiguous, or at least unclear, because it suggests to me that the numeric escape should resolve as a single excluded character in the set, but this does not seem to happen.
Additionally, the following regex does not seem to work as I would expect either:
(.)[^\1][^\1][\1]
This doesn't seem to match "zzzz" or "zxxz".
You want to do a negative lookahead assertion (?!...) on \1 in the second capture group, then it will work:
r'(.)((?!\1).)\2\1'
Testing your examples:
>>> import re
>>> re.match(r'(.)((?!\1).)\2\1', 'zxxz')
<_sre.SRE_Match object at 0x109b661c8>
>>> re.match(r'(.)((?!\1).)\2\1', 'vbbv')
<_sre.SRE_Match object at 0x109b663e8>
>>> re.match(r'(.)((?!\1).)\2\1', 'zzzz') is None
True
>>> re.match(r'(.)((?!\1).)\2\1', 'vvvv') is None
True

regex conditional matching

I am trying to use re.findall to find this pattern:
01-234-5678
regex:
(\b\d{2}(?P<separator>[-:\s]?)\d{2}(?P=separator)\d{3}(?P=separator)\d{3}(?:(?P=separator)\d{4})?,?\.?\b)
however, some cases have shortened to 01-234-5 instead of 01-234-0005 when the last four digits are 3 zeros followed by a non-zero digit.
Since there does't seem to be any uniformity in formatting I had to account for a few different separator characters or possibly none at all. Luckily, I have only noticed this shortening when some separator has been used...
Is it possible to use a regex conditional to check if a separator does exist (not an empty string), then also check for the shortened variation?
So, something like if separator != '': re.findall(r'(\b\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)(\d{4}|\d{1})\.?\b)', text)
Or is my only option to include all the possibly incorrect 6 digit patterns then check for a separator with python?
If you want the last group of digits to be "either one or four digits", try:
>>> import re
>>> example = "This has one pattern that you're expecting, 01-234-5678, and another that maybe you aren't: 23:456:7"
>>> pattern = re.compile(r'\b(\d{2}(?P<sep>[-:\s]?)\d{3}(?P=sep)\d(?:\d{3})?)\b')
>>> pattern.findall(example)
[('01-234-5678', '-'), ('23:456:7', ':')]
The last part of the pattern, \d(?:\d{3})?), means one digit, optionally followed by three more (i.e. one or four). Note that you don't need to include the optional full stop or comma, they're already covered by \b.
Given that you don't want to capture the case where there is no separator and the last section is a single digit, you could deal with that case separately:
r'\b(\d{9}|\d{2}(?P<sep>[-:\s])\d{3}(?P=sep)\d(?:\d{3})?)\b'
# ^ exactly nine digits
# ^ or
# ^ sep not optional
See this demo.
It is not clear why you are using word boundaries, but I have not seen your data.
Otherwise you can shorten the entire this to this:
re.compile(r'\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)\d{1,4}')
Note that \d{1,4} matched a string with 1, 2, 3 or 4 digits
If there is no separator, e.g. "012340008" will match the regex above as you are using [-:\s]? which matches 0 or 1 times.
HTH

Regular Expression in Python

I don't know how to find the string using regular expression, the format of string is below.
[ any symbol 0~n times any number 1~n times] 1~n times.
It's seems like phone number matched. But the difference is that can insert any symbols and white space between numbers, for example
458###666###2##111####111
OR
(123)))444###555%%6222%%%%
I don't know if I explain the question clearly.
Anyway, thanks for your reply.
I think this represents the pattern you described
^(?:(\D?)\1*\d+)+$
See it here on Regexr
^ matches the start of the string
(\D?)\1* will match an optional non digit (\D), put it into a backreference and match this same character again 0 or more times using \1*
\d+ at least 1 digit
(?:(\D?)\1*\d+)+ the complete non capturing group is repeated 1 or more times
$ matches the end of the string
It will match
458###666###2##111####111
(123)))444###555%%6222%%%%1
(((((((((123)))444###555%%6222%%%%1
But not
s(123)))444###555%%6222%%%%1
(123)))444###555%%6222%%%%
Your statement:
[ any symbol 0~n times any number 1~n times] 1~n times.
does not fit to your second example (123)))444###555%%6222%%%% that does not end with a digit.
If you need to gather all the groups of digits from the string you can use \d+ regex:
>>> re.findall('\d+', '458###666###2##111####111 OR (123)))444###555%%6222%%%%')
['458', '666', '2', '111', '111', '123', '444', '555', '6222']
[ NOTE, I am ignoring the 'in python', opting instead for a more general 'build regular expressions' answer, in the hope that this will not only provide the desired answer but be something to take away for different RE-related problems ]
First, you want to match any symbol (or possibly any symbol, except a number), 0 or more times. That would be one of .* or [^0-9]* (the first is the 'anything wildcard', the second is a character class of everything except the numbers 0 to 9. The * is a 'match at least no times'.
Second, you want to match one or more digits. That, too, is relatively easy: [0-9]+ (or if you have a sufficiently old and anal RE library, [0-9][0-9]*, but that is highly unlikely to be the case outside a CS exam).
Third, you want to group that and repeat the grouping at least one time.
The general syntax for grouping is to enclose the group in parentheses (except in emacs, where you need \(, as the plain parenthesis is frequently matched). So, something along the lines of ([^0-9]*[0-9]+)+ should do the trick.

Categories