python regex: don't allow a specific character to repeat - python

I have a regex
^[a-z][a-z0-9\-]{6,10}[a-z0-9]$
Which matches the following rules:
8-12 characters in length
first character is lowercase alpha
last characters lowercase alpha or digit
internal characters can contain a hyphen
it's re-used a lot in a module, always alongside some other rules and regexes
while writing out some unit tests, i noticed that it's always used in conjunction with another specific rule.
hyphens may not repeat
i can't wrap my head around integrating that rule into this one. i've tried a few dozen approaches with lookbehinds and lookaheads, but have had no luck on isolating to the specific character AND keeping the length requirement.

No repeating hyphen ^[a-z](?:[a-z0-9]|-(?!-)){6,10}[a-z0-9]$
Explained
^ [a-z]
(?:
[a-z0-9] # alnum
| # or
- (?! - ) # hyphen if not followed by hyphen
){6,10}
[a-z0-9] $

Related

Regex for matching number ranges with specific units

I need help to complete a regex pattern. I need a pattern to match a range of numbers including unit.
Examples:
The car drives 50,5 - 80 km/10min on the road.
The car drives 50,5 - 80 km / 10min on the road.
The car drives 40,5-80 km/h on the road.
The car drives 30-50 km/h on the road.
The car drives 40 - 60.8 km/ h on the road.
The car drives 40.90-60,8 km/h on the road.
I need to match the entire ranges. Good would also be (?:km/10min|km / 10min|km/h|km/ h) to simplify this part so that this does not have to be listed multiple times. So also here the blanks taken into account.
([,.\d]+)\s*(?:km/10min|km / 10min|km/h|km/ h)
https://regex101.com/r/Ey792V/1
Currently, unfortunately, only the first number is matched. Thanks in advance for the help.
You could make the pattern a bit more specific and optionally match whitespace chars instead of hard coding all the possible spaces variations
\b\d+(?:[.,]\d+)?(?:\s*-\s*\d+(?:[.,]\d+)?)?\s*km\s*/\s*(?:h|10min)\b
Explanation
\b A word boundary
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
(?: Non capture group
\s*-\s* Match - between optional whitespace chars
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
)? Close the non capture group and make it optional
\s*km\s*/\s* Match km/ surrounded with optional whitespace chars to match different variations
(?:h|10min) Match either h or 10min (Or use \d+min to match 1+ digits)
\b A word boundary
See a regex demo.
Your question is not entirely clear as you framed it in terms of examples. To be precise you need to state the question in words, then use the examples for illustration. To take one example, the question does not make clear whether
"The car drives 40,5- 80 km /h on the road."
is to be matched.
Expressing a question in words is not always easy but it is a skill that you need to acquire in order write clear code specifications. A by-product is that it makes the code easier to write, as that amounts to merely translating the words into code.
Let's give it a try.
Match a string comprised by six successive substrings:
One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
A hyphen, optionally preceded and/or followed by a space.
One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
The literal " km".
A forward slash, optionally preceded and/or followed by a space.
The literal "h" or one or more digits followed by "min", followed by a word boundary.
I cannot be sure that this is what you want but you should be able to easily modify these requirements as necessary.
Now let's translate these requirements into a regular expression.
1. One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
(?<![,.])\d+(?:[,.-]\d+)?
(?<![,.]) is a negative lookbehind. It is needed to avoid matching, for example, the indicated part of the following string.
"The car drives 1,500.5 - 80 km/10min on the road."
^^^^^^^^^^^^^^^
2. A hyphen, optionally preceded and/or followed by a space.
?- ?
(The first question mark is preceded by a space.)
3. One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
\d+(?:[,.]\d+)?
4. The literal " km".
km
5. A forward slash, optionally preceded and/or followed by a space.
?\/ ?
(The first question mark is preceded by a space.)
6. The literal "h" or one or more digits followed by "min", followed by a word boundary.
(?:h|\d+min)\b
Now we can simply join these pieces to form the regular expression.
\d+(?:[,.-]\d+)? ?- ?\d+(?:[,.]\d+)?km ?\/ ?(?:h|\d+min)\b
Demo
\d.+(?:h\b|min\b|s\b)
Would also work. Demo

Regex: match sub-string between keywords including the keywords [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Python regex: How to make a group of words/character optional?

I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $

Negative lookahead not working after character range with plus quantifier

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Python regex: using or statement

I may not being saying this right (I'm a total regex newbie). Here's the code I currently have:
bugs.append(re.compile("^(\d+)").match(line).group(1))
I'd like to add to the regex so it looks at either '\d+' (starts with digits) or that it starts with 2 capital letters and contains a '-' before the first whitespace. I have the regex for the capital letters:
^[A-Z]{2,}
but I'm not sure how to add the '-' and the make an OR with the \d+. Does this make sense? Thanks!
The way to do an OR in regexps is with the "alternation" or "pipe" operator, |.
For example, to match either one or more digits, or two or more capital letter:
^(\d+|[A-Z]{2,})
Debuggex Demo
You may or may not sometimes need to add/remove/move parentheses to get the precedence right. The way I've written it, you've got one group that captures either the digit string or the capitals. While you're learning the rules (in fact, even after you've learned the rules) it's helpful to look at a regular expression visualizer/debugger like the one I used.
Your rule is slightly more complicated: you want 2 or more capital letters, and a hyphen before the first space. That's a bit hard to write as is, but if you change it to two or more capital letters, zero or more non-space characters, and a hyphen, that's easy:
^(\d+|[A-Z]{2,}\S*?-)
Debuggex Demo
(Notice the \S*?—that means we're going to match as few characters as possible, instead of as many as possible, so we'll only match up to the first hyphen in THIS-IS-A-TEST instead of up to the last. If you want the other one, just drop the ?.)
Write | for "or". For a sequence of zero or more non-whitespace characters, write \S*.
re.compile('^(\d+|[A-Z][A-Z]\S*-\s)')
re.compile(r"""
^ # beginning of the line
(?: # non-capturing group; do not return this group in .group()
(\d+) # one or more digits, captured as a group
| # Or
[A-Z]{2} # Exactly two uppercase letters
\S* # Any number of non-whitespace characters
- # the dash you wanted
) # end of the non-capturing group
""",
re.X) # enable comments in the regex

Categories