Python regex: How to make a group of words/character optional? - python

I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1

Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters

To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $

Related

Ignoring a word in regex (negative lookahead)

I'm looking to try and ignore a word in regex, but the solutions I've seen here did not work correctly for me.
Regular expression to match a line that doesn't contain a word
The issue I'm facing is I have an existing regex:
(?P<MovieCode>[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]{1}\b)?
That is matching on Deku-041114-575-boku.mp4.
However, I want this regex to fail to match for cases where the MovieCode group has Deku in it.
I tried
(?P<MovieCode>(?!Deku)[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]{1}\b)?
but unfortunately it just matches eku-124 and I need it to fail.
I have a regex101 with my attempts.
https://regex101.com/r/xqALM2/2
The MovieClose group can match 3-6 chars A-Z and Deku has 4 chars. If that part should not contain Deku, you could use the negative lookahead predeced by repeating 0+ times a character class [A-Za-z]* as it can not cross the -.
To prevent matching eku-124, you could prepend a word boundary before the MovieClose group or add (?<!\S if there should be a whitespace boundary at the left.
Note that you can omit {1} from the pattern.
\b(?P<MovieCode>(?![A-Za-z]*Deku)[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]\b)?
Regex demo

Regular Expression in Python strings

I want to validate a string that satisfies the below three conditions using regular expression
The special characters allowed are (. , _ , - ).
Should contain only lower-case characters.
Should not start or end with special character.
To satisfy the above conditions, I have created a format as below
^[^\W_][a-z\.,_-]+
This pattern works fine up to second character. However, this pattern is failing for the 3rd and subsequent characters if those contains any special character or upper cases characters.
Example:
Pattern Works for the string S#yanthan but not for Sa#yanthan. I am expecting that pattern to pass even if the third and subsequent characters contains any special characters or upper case characters. Can you suggest me where this pattern goes wrong please? Below is the snippet of the code.
import re
a = "Sayanthan"
exp = re.search("^[^\W_][a-z\.,_-]+",a)
if exp:
print(True)
else:
print(False)
Based on you initial rules I'd go with:
^[a-z](?:[.,_-]*[a-z])*$
See the online demo.
However, you mentioned in the comments:
"Also the third condition is "should not start with Special character" instead of "should not start or end with Special character""
In that case you could use:
^[a-z][-.,_a-z]*$
See the online demo
The pattern that you tried ^[^\W_][a-z.,_-]+ starts with [^\W_] which will match any word char except an underscore, so it could also be an uppercase char.
Then [a-z.,_-]+ will match 1+ times any of the listed, which means the string can also end with a comma for example.
Looking at the conditions listed, you could use:
^[a-z](?:[a-z.,_-]*[a-z])?\Z
^ Start of string
[a-z] Match a lower case char a-z
(?: Non capture group
[a-z.,_-]*[a-z] Match 0+ occurrences of the listed ending with a-z
)? Close group and make it optional
\Z End of string
Regex demo

Negative lookahead not working after character range with plus quantifier

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Regular expression in python 2.7.11

I am not sure why the regex - \b((\+65[\s\-]*)?[3689]\d{3}[\s\-]*\d{4})\b doesn't work for +6565066859
Your pattern currently doesn't work because of the word boundary that is placed at the start. Note that a word boundary will match between a word-character and
a non-word-character
the start of a string
the end of a string
In your case \b is placed between the start of the string and the +, where it will match, thus your first optional group will never match. The rest of the pattern consists of a 8-digit-number (if we forget spaces and hyphens for a moment), but the number you try to test consists of 10 characters, so both word boundaries can't match at the same time.
I think you can rewrite your pattern as ((?:(\+65[\s\-]*)|\b)[3689]\d{3}[\s\-]*\d{4})\b thus either matching +65 or the word boundary. Not sure if you use the capturing groups in your pattern, so I kept them as they are.

Regex find word including "-"

I have the below regex (from this link: get python dictionary from string containing key value pairs)
r"\b(\w+)\s*:\s*([^:]*)(?=\s+\w+\s*:|$)"
Here is the explanation:
\b # Start at a word boundary
(\w+) # Match and capture a single word (1+ alnum characters)
\s*:\s* # Match a colon, optionally surrounded by whitespace
([^:]*) # Match any number of non-colon characters
(?= # Make sure that we stop when the following can be matched:
\s+\w+\s*: # the next dictionary key
| # or
$ # the end of the string
) # End of lookahead
My question is that when my string has the word with the "-" in between, for example: movie-night, the above regex is not working and I think it is due to the b(\w+). How can I change this regex to work with word including the "-"? I have tried b(\w+-) but it does not work. Thanks for your help in advance.
You could try something such as this:
r"\b([\w\-]+)\s*:\s*([^:]*)(?=\s+\w+\s*:|$)"
Note the [\w\-]+, which allows matching both a word character and a dash.
For readability in the future, you may also want to investigate re.X/re.VERBOSE, which can make regex more readable.

Categories