Regex - Word boundary not working even with raw-string - python

I'm coding a set of regex to match dates in text using python. One of my regex was designed to match dates in the format MM/YYYY only. The regex is the following:
r'\b((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})\b'
Looks like the word boundary is not working as it is matching parts of dates like 12/02/2020 (it should not match this date format at all).
In the attached image only the second pattern should have been recognized. The first one shouldn't, even parts of it, have been a match.
Remembering that the regex should match the MM/YYYY pattern in strings like:
"The range of dates go from 21/02/2020 to 21/03/2020 as specified above."
Can you help me find the error in my pattern to make it match only my goal format?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
What is a word boundary in regex?
What happens is that the \ character is not part of the group \w, thus every time your string has a new \ it is considered to be a new word boundary.
You have not provided the full string you are matching, but I could solve the example you have posted you could solve it by just putting the anchors ^$
^((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})$
https://regex101.com/r/xncZNN/1
edit:
Working on your full example and your regex I did some "clean up" because it was a bit confusing, but I think I understood the pattern you were trying to map
here is the new:
(?<=^|[a-zA-Z ])(0[0-9]|1[12]|[1-9])(?:\/|\\)([\d]{4})(?=[a-zA-Z ]|$)
I have substituted the word boundary by lookahead (?!...) and lookbehind (?<!...), and specified the pattern I want to match before and after the date. You can adjust it to your specific need and add other characters like numbers or specific stuff.
https://regex101.com/r/xncZNN/4

The problem is that \b\d{2}/\d{4}\b matches 02/2000 in the string 01/02/2000 because the first forward slash is a word break. The solution is to identify the characters that should not precede and follow the match and use negative lookarounds in place of word breaks. Here you could use the regular expression
r'(?<![\d/])(?:0[1-9]|1[0-2])/\d{4}(?![\d/])'
The negative lookbehind, (?<![\d/]), prevents the two digits representing the month to be preceded by a digit or forward slash; the negative lookahead, (?![\d/]) prevents the four digits representing the year to be followed by a digit or forward slash.
Regex demo
Python demo
If 6/2000 is to be matched as well as 06/2000, change (?:0[1-9] to (?:0?[1-9].

Related

Why positive lookahead is working but negative lookahead doesn't?

First of all, regex needs to be working for both the python and PCRE(PHP). I'm trying to ignore if a regex pattern is followed by the letter 'x' to distinguish dimensions from strings like "number/number" in the given example below:
dummy word 222/2334; Ø14 x Ø6,33/523,23 x 2311 mm
From here, I'm trying to extract 222/2334 but not the 6,33/523,23 since that part is actually part of dimensions. So far I came up with this regex
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))(?=\s?x)
which can extract what I don't want it to extract and it looks like this. If I change the positive lookahead to negative it captures both of them except the last '3' from 6,33/523,23. It looks like this. How can I only capture 222/2334? What am I doing wrong here?
Desired output:
222/2334
What I got
222/2334 6,33/523,2
You may use this simplified regex with negative lookahead:
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))\b(?![.,]?\d|\s?x)
Updated RegEx Demo
It is important to use a word boundary in the end to avoid matching partial numbers (the reason of your regex matching till a digit before)
Also include [.,]?\d in negative lookahead condition so that match doesn't end at position before last comma.
This shorter (and more efficient) regex may also work for OP:
(\d+(?:[,.]\d+)*)\s*\/\s*(\d+(?:[,.]\d+)*)\b(?![.,]?\d|\s?x)
RegEx Demo 2
There are two easy options.
The first option is ugly and long, but basically negates a positive match on the string that is followed by x, then matches the patterns without it.
(?!PATTERN(?=x))PATTERN
See regex in use here
(?!\d+(?:[,.]\d+)?\s?\/\s?\d+(?:[,.]\d+)?(?=\s?x))(\d+(?:[,.]\d+)?)\s?\/\s?(\d+(?:[,.]\d+)?)
The second option uses possessive quantifiers, but you'll have to use the regex module instead of re in python.
See regex in use here
(\d+(?:[,.]\d+)?+)\s?\/\s?(\d+(?:[,.]\d+)?+)(?!\s?x)
Additionally, I changed your subpattern to \d+(?:[,.]\d+)?. This will match one or more digits, then optionally match . or , followed by one or more digits.

Regex for parsing uid from URL

I am trying to parse UIDs from URLs. However regex is not something I am good at so seeking for some help.
Example Input:
https://example.com/d/iazs9fEil/somethingelse?foo=bar
Example Output:
iazs9fEil
What I've tried so far is
([/d/]+[\d\x])\w+
Which somehow works, but returns in with the /d/ prefix, so the output is /d/iazs9fEil.
How to change the regex to not contain the /d/ prefix?
EDIT:
I've tried this regex ([^/d/]+[\d\x])\w+ which outputs the correct string which is iazs9fEil, but also returns the rest of the url, so here it is somethingelse?foo=bar
In short, you may use
match = re.search(r'/d/(\w+)', your_string) # Look for a match
if match: # Check if there is a match first
print(match.group(1)) # Now, get Group 1 value
See this regex demo and a regex graph:
NOTE
/ is not any special metacharacter, do not escape it in Python string patterns
([/d/]+[\d\x])\w+ matches and captures into Group 1 any one or more slashes or digits (see [/d/]+, a positive character class) and then a digit or (here, Python shows an error: sre_contants.error incomplete escape \x, probably it could parse it as x, but it is not the case), and then matches 1+ word chars. You put the /d/ into a character class and it stopped matching a char sequence, [/d/]+ matches slashes and digits in any order and amount, and certainly places this string into Group 1.
Try (?<=/d/)[^/]+
Explanation:
(?<=/d/) - positive lookbehind, assure that what's preceeding is /d/
[^/]+ - match one or more characters other than /, so it matches everything until /
Demo
You could use a capturing group:
https?://.*?/d/([^/\s]+)
Regex demo

Python regex: How to make a group of words/character optional?

I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $

Negative lookahead not working after character range with plus quantifier

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Regular expression in python 2.7.11

I am not sure why the regex - \b((\+65[\s\-]*)?[3689]\d{3}[\s\-]*\d{4})\b doesn't work for +6565066859
Your pattern currently doesn't work because of the word boundary that is placed at the start. Note that a word boundary will match between a word-character and
a non-word-character
the start of a string
the end of a string
In your case \b is placed between the start of the string and the +, where it will match, thus your first optional group will never match. The rest of the pattern consists of a 8-digit-number (if we forget spaces and hyphens for a moment), but the number you try to test consists of 10 characters, so both word boundaries can't match at the same time.
I think you can rewrite your pattern as ((?:(\+65[\s\-]*)|\b)[3689]\d{3}[\s\-]*\d{4})\b thus either matching +65 or the word boundary. Not sure if you use the capturing groups in your pattern, so I kept them as they are.

Categories