Not sure about how /?(.+) works in my regex [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?

They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here

The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).

In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com

+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.

A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.

Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba

+ is minimal one, * can be zero as well.

A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.

I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Python regex with one operator or the other, but not both

I want to use a regular expression to match a lowercase letter followed by either a + and a digit or a - and a digit, or both, but not 2 times the same operator.
To be clear, these are acceptable
a
a+1
a-2
a+3-4
a-5+6
while these are not acceptable
a+1+2
a-3-4
My current expression is
r = re.compile(r"[a-z]{1}([+-]\d){0,2}?$")
which allows both the non-acceptable strings. How can I specify that if one operator has already been used, it cannot appear twice?
You can use a backreference within a negative lookahead (the overall regex will have to change a little bit though):
[a-z](?:([+-])\d(?:(?!\1)[+-]\d)?)?$
regex101 demo
Instead of ([+-]\d){0,2}?, I have made the possible two repeats like this: ([+-])\d(?:(?!\1)[+-]\d)?, the first occurrence of operator and number being ([+-])\d and the second (?:(?!\1)[+-]\d)?.
In the first occurrence, the regex is storing the matched value (either + or -) and in the second, it is making sure this matched value is not matched (?!\1)[+-] ((?! ... ) is the syntax for negative lookahead so that [+-] cannot be something that this negative lookahead matches)
Try this:
[a-z](?!(\+\d\+\d)|(\-\d\-\d))((\+|\-)\d)*
And verbose version (which is better, use it):
[a-z] # find this
(?! # not followed by:
(\+\d\+\d) | (\-\d\-\d) # (this or that)
)
(
(\+|\-)\d # followed by this
)* # 0 or more times
You can branch these in two scenario's so:
r = re.compile(r'^[a-z]([+]\d([-]\d)?|[-]\d([+]\d)?)?$')
(regex101)
So we basically have two branches here:
[+]\d([-]\d)?: we start with a +, a digit and optionally a - and a digit; and
[-]\d([+]\d)?: we start with a -, a digit and optionally a + and a digit.
We then make a union between the the two, and make this optional as well.
Try this one:
(?!(.+?\+){2,}|(.*?\-){2,})[a-z][\d+-]*
Demo at regex 101
Explanation:
(?!(.+?\+){2,}|(.*?\-){2,}) negative look ahead asserts that there are not more than two occurrence of + or -
[a-z] matches lower case character
[\d+-]* matches zero or more digits, + or -

Why findall() function return a weird value? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?
They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here
The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).
In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com
+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.
A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.
Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba
+ is minimal one, * can be zero as well.
A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.
I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

What is the use of second limit in the quantifier {m,n} in the regular expression in python if it used in a non-greedy way?

The regular expression in Python re.compile(r'\w{3,5}?') will match with any pattern that have at least three non-overlapping alpha-numeric and underscore characters. My question here 'is the second limit has any use in this non greedy use of quantifier {3,5}, i.e. even if the five is replaced by any other number the result would be same. i.e. re.compile(r'\w{3,5}?')=re.compile(r'\w{3,6}?')=re.compile(r'\w{3,7}?')=re.compile(r'\w{3,}?')
Can some one give me an example where the second limit find any use?
When a lazily quantified pattern appears at the end of the pattern, it matches the minimum amount of chars it needs to match to return a value. A 123(\w*?) will always yield no value inside Group 1 as *? matches zero or more chars, but as few as possible.
It means that \w{3,5}? regex will always match 3 word chars, and the second argument will be "ignored" as it is enough to match 3 occurrences of the word char.
If the lazy pattern is not at the end, the second argument is important.
See an example: Test: (\w{3,5}?)-(\d+) captures different amount of chars in Group 1 depending on how match word chars there are in the strings.

Python regex with slash

Why does
len(re.findall('[0-9999][/][0-9999]', '15/11/2012'))
correctly return 2, but
len(re.findall('[0-9999][/][0-9999][/]', '15/11/2012'))
return 0? Shouldn’t it return 1?
You're misunderstanding character classes. The expression, [abc123] matches a single character—namely one of the characters in the bracket. The - is a range operator in character classes, but regular expressions are not aware of numeric ranges, only string ranges. In other words, [0-9999] is equivalent to [0-9], you're just specifying the 9 duplicate times.
The reason you find 2 matches with the first regex is that you're matching 5/1 and 1/2. The second regex doesn't have the flexibility of matching any one-digit number, and thus fails.
The correct expressions that would return 2 and 1 results, for example, would be
[0-9]+/[0-9]+
and
[0-9]+/[0-9]+/
respectively. The + is known as a quantifier.

Categories