I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.
Related
I'm trying to extract tokens that satisfy many conditions out of which, I'm using lookahead to implement the following two conditions:
The tokens must be either numeric/alphanumeric (i.e, they must have at least one digit). They can contain few special characters like - '-','/','\','.','_' etc.,
I want to match strings like: 165271, agya678, yah#123, kj*12-
The tokens can't have consecutive special characters like: ajh12-&
I don't want to match strings like: ajh12-&, 671%&i^
I'm using a positive lookahead for the first condition: (?=\w*\d\w*) and a negative lookahead for the second condition: (?!=[\_\.\:\;\-\\\/\#\+]{2})
I'm not sure how to combine these two look-ahead conditions.
Any suggestions would be helpful. Thanks in advance.
Edit 1 :
I would like to extract complete tokens that are part of a larger string too (i.e., They may be present in middle of the string).
I would like to match all the tokens in the string:
165271 agya678 yah#123 kj*12-
and none of the tokens (not even a part of a token) in the string: ajh12-& 671%&i^
In order to force the regex to consider the whole string I've also used \b in the above regexs : (?=\b\w*\d\w*\b) and (?!=\b[\_\.\:\;\-\\\/\#\+]{2}\b)
You can use
^(?!=.*[_.:;\-\\\/#+*]{2})(?=[^\d\n]*\d)[\w.:;\-\\\/#+*]+$
Regex demo
The negative lookahead (?=[^\d\n]*\d) matches any char except a digit or a newline use a negated character class, and then match a digit.
Note that you also have to add * and that most characters don't have to be escaped in the character class.
Using contrast, you could also turn the first .* into a negated character class to prevent some backtracking
^(?!=[^_.:;\-\\\/#+*\n][_.:;\-\\\/#+*]{2})(?=[^\d\n]*\d)[\w.:;\-\\\/#+*]+$
Edit
Without the anchors, you can use whitespace boundaries to the left (?<!\S) and to the right (?!\S)
(?<!\S)(?!=\S*[_.:;\-\\\/#+*]{2})(?=[^\d\s]*\d)[\w.:;\-\\\/#+*]+(?!\S)
Regex demo
You can use multiple look ahead assertions to only capture strings that
(?!.*(?:\W|_){2,}.*) - doesn't have consecutive special characters and
(?=.*\d.*) - has at least 1 digit
^(?!.*(?:\W|_){2,}.*)(?=.*\d.*).*$
I'm coding a set of regex to match dates in text using python. One of my regex was designed to match dates in the format MM/YYYY only. The regex is the following:
r'\b((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})\b'
Looks like the word boundary is not working as it is matching parts of dates like 12/02/2020 (it should not match this date format at all).
In the attached image only the second pattern should have been recognized. The first one shouldn't, even parts of it, have been a match.
Remembering that the regex should match the MM/YYYY pattern in strings like:
"The range of dates go from 21/02/2020 to 21/03/2020 as specified above."
Can you help me find the error in my pattern to make it match only my goal format?
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
What is a word boundary in regex?
What happens is that the \ character is not part of the group \w, thus every time your string has a new \ it is considered to be a new word boundary.
You have not provided the full string you are matching, but I could solve the example you have posted you could solve it by just putting the anchors ^$
^((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})$
https://regex101.com/r/xncZNN/1
edit:
Working on your full example and your regex I did some "clean up" because it was a bit confusing, but I think I understood the pattern you were trying to map
here is the new:
(?<=^|[a-zA-Z ])(0[0-9]|1[12]|[1-9])(?:\/|\\)([\d]{4})(?=[a-zA-Z ]|$)
I have substituted the word boundary by lookahead (?!...) and lookbehind (?<!...), and specified the pattern I want to match before and after the date. You can adjust it to your specific need and add other characters like numbers or specific stuff.
https://regex101.com/r/xncZNN/4
The problem is that \b\d{2}/\d{4}\b matches 02/2000 in the string 01/02/2000 because the first forward slash is a word break. The solution is to identify the characters that should not precede and follow the match and use negative lookarounds in place of word breaks. Here you could use the regular expression
r'(?<![\d/])(?:0[1-9]|1[0-2])/\d{4}(?![\d/])'
The negative lookbehind, (?<![\d/]), prevents the two digits representing the month to be preceded by a digit or forward slash; the negative lookahead, (?![\d/]) prevents the four digits representing the year to be followed by a digit or forward slash.
Regex demo
Python demo
If 6/2000 is to be matched as well as 06/2000, change (?:0[1-9] to (?:0?[1-9].
Searching a large syslog repo and need to get a specific word to match with a certain condition.
I'm using regex to compile a search for this word. I've read the python docs on regex characters and I understand how to specify each criteria separately but somehow missing how to concatenate all together for my specific search. This is what I have so far but not working...
p = re.compile("^'[A-Z]\w+'$")
match = re.search(p, syslogline, )
the word is a username that can be alphanum, always beginning with an uppercase character (preceded by blank space), can contain chars or nums, is 3-12 in length and ends with single quote.
an example would be: Epresley01' or J98473'
Brief
Based on your requirements (also stated below), your regex doesn't work because:
^' Asserts the position at the start of the line and ensures a ' is the first character of that line.
$ Asserts the position at the end of the line.
Having said that you specify that it's preceded by a space character (which isn't present in your pattern). You pattern also checks for ' which isn't the first character of the username. Given that you haven't actually given us a sample of your file I can't confirm nor deny that your string starts before the username and ends after it, but if that's not the case the anchors ^$ are also not helping you here.
Requirements
The requirements below are simply copied from the OP's question (rewritten) to outline the username format. The username:
Is preceded by a space character.
Starts with an uppercase letter.
Contains chars or nums. I'm assuming here that chars actually means letters and that all letters in the username (including the uppercase starting character) are ASCII.
Is 3-12 characters in length (excluding the preceding space and the end character stated below).
Ends with an apostrophe character '.
Code
See regex in use here
(?<= )[A-Z][^\W_]{2,11}'
Explanation
(?<= ) Positive lookbehind ensuring what precedes is a space character
[A-Z] Match any uppercase ASCII letter
[^\W_]{2,11} Match any word character except underscore _ (equivalent to a-zA-Z0-9)
This appears a little confusing because it's actually a double-negative. It's saying match anything that's not in the set. The \W matches any non-word character. Since it's a double-negative, it's like saying don't match non-word characters. Adding _ to the set negates it.
' Match the apostrophe character ' literally
I think you can do it like this:
(Updated after the comment from #ctwheels)
See regex in use here
[A-Z][a-zA-Z0-9]{1,10}'
Explanation
Match a whitespace
Match an uppercase character [A-Z]
Match [a-zA-Z0-9]+
Match an apostrophe '
Demo
I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $
I am not sure why the regex - \b((\+65[\s\-]*)?[3689]\d{3}[\s\-]*\d{4})\b doesn't work for +6565066859
Your pattern currently doesn't work because of the word boundary that is placed at the start. Note that a word boundary will match between a word-character and
a non-word-character
the start of a string
the end of a string
In your case \b is placed between the start of the string and the +, where it will match, thus your first optional group will never match. The rest of the pattern consists of a 8-digit-number (if we forget spaces and hyphens for a moment), but the number you try to test consists of 10 characters, so both word boundaries can't match at the same time.
I think you can rewrite your pattern as ((?:(\+65[\s\-]*)|\b)[3689]\d{3}[\s\-]*\d{4})\b thus either matching +65 or the word boundary. Not sure if you use the capturing groups in your pattern, so I kept them as they are.