Why positive lookahead is working but negative lookahead doesn't? - python

First of all, regex needs to be working for both the python and PCRE(PHP). I'm trying to ignore if a regex pattern is followed by the letter 'x' to distinguish dimensions from strings like "number/number" in the given example below:
dummy word 222/2334; Ø14 x Ø6,33/523,23 x 2311 mm
From here, I'm trying to extract 222/2334 but not the 6,33/523,23 since that part is actually part of dimensions. So far I came up with this regex
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))(?=\s?x)
which can extract what I don't want it to extract and it looks like this. If I change the positive lookahead to negative it captures both of them except the last '3' from 6,33/523,23. It looks like this. How can I only capture 222/2334? What am I doing wrong here?
Desired output:
222/2334
What I got
222/2334 6,33/523,2

You may use this simplified regex with negative lookahead:
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))\b(?![.,]?\d|\s?x)
Updated RegEx Demo
It is important to use a word boundary in the end to avoid matching partial numbers (the reason of your regex matching till a digit before)
Also include [.,]?\d in negative lookahead condition so that match doesn't end at position before last comma.
This shorter (and more efficient) regex may also work for OP:
(\d+(?:[,.]\d+)*)\s*\/\s*(\d+(?:[,.]\d+)*)\b(?![.,]?\d|\s?x)
RegEx Demo 2

There are two easy options.
The first option is ugly and long, but basically negates a positive match on the string that is followed by x, then matches the patterns without it.
(?!PATTERN(?=x))PATTERN
See regex in use here
(?!\d+(?:[,.]\d+)?\s?\/\s?\d+(?:[,.]\d+)?(?=\s?x))(\d+(?:[,.]\d+)?)\s?\/\s?(\d+(?:[,.]\d+)?)
The second option uses possessive quantifiers, but you'll have to use the regex module instead of re in python.
See regex in use here
(\d+(?:[,.]\d+)?+)\s?\/\s?(\d+(?:[,.]\d+)?+)(?!\s?x)
Additionally, I changed your subpattern to \d+(?:[,.]\d+)?. This will match one or more digits, then optionally match . or , followed by one or more digits.

Related

Regular Expression which matches two duplicate consecutive characters within string but not three or more. Should match if both 'aa' and 'bbb' exist

My original question was closed for being a duplicate. I disagree with it being a duplicate as this is a different use case looking at regular expression syntax. I have tried to clarify my question below.
Is it possible to create a regular expression which matches two duplicate consecutive characters within a string (in this example lowercase letters) but does not match a section of the string if the same characters are either side. e.g. match 'aa' but not 'aaa' or 'aaaa'?
Additionally:
Although I am using Python 3.10 I am trying to work out if this is possible using 'standard' regular expression syntax without utilising additional functionality provided by external modules. For example using Python this would mean a solution which uses the 're' module from the standard library.
If there are 3 or more duplicate consecutive characters, the string should still match if there are two duplicate consecutive characters elsewhere in the sting. e.g match 'aa' even if 'bbb' exists elsewhere in the string.
The string should also match if the two duplicate consecutive characters appear at the beginning or end of the string.
My examples are 16 character strings if a specific length makes a difference.
Examples:
ffumlmqwfcsyqpss should match either 'ff' or 'ss'.
zztdcqzqddaazdjp should match either 'zz','dd', 'aa'.
urrvucyrzzzooxhx should match 'rr' or 'oo' even though 'zzz' exists in the string.
zettygjpcoedwyio should match 'tt'.
dtfkgggvqadhqbwb should not match 'ggg'.
rwgwbwzebsnjmtln should not match.
What I had originally tried
([a-z])\1 to capture the duplicate character but this also matches when there are additional duplicate characters such as 'aaa' or 'aaaa' etc.
([a-z])\1(?!\1) to negate the third duplicate character but this just moves the match to the end of the duplicate character string.
Negative lookarounds to compensate for a match at the beginning but I think I am causing some kind of loop which will never match.
>>>import re
>>>re.search(r'([a-z])\1(?!\1)', 'dtfkgggvqadhqbwb')
<re.Match object; span=(5, 7), match='gg'> # should not match as 'gg' ('[gg]g' or 'g[gg]')
Currently offered solutions don't match described criteria.
Wiktor Stribiżew's solution uses the additional (*SKIP) functionality of the external python regex module.
Tim Biegeleisen's solution does not match duplicate pairs if there are duplicate triples etc in the same string.
In the linked question, Cary Swoveland's solutions do not work for duplicate pairs at the beginning or end of a string or match even when there is no duplicate in the string.
In the linked question, the fourth bird's solution does not match duplicate pairs at the beginning or end of strings.
Summary
So far the only answer which works is Wiktor Stribiżew's but this uses the (*SKIP) function of the external 'regex' module. Is a solution not possible using 'standard' regular expression syntax?
In Python re, the main problem with creating the right regex for this task is the fact that you need to define the capturing group before using a backreference to the group, and negative lookbehinds are usually placed before the captured pattern. Also, regex101.com Python testing option is not always reflecting the current state of affairs in the re library, and it confuses users with the message like "This token can not be used in a lookbehind due to either making it non-fixed width or interfering with the pattern matching" when it sees a \1 in (?<!\1), while Python allows this since v3.5 for groups of fixed length.
The pattern you can use here is
(.)(?<!\1.)\1(?!\1)
See the regex demo.
Details
(.) - Capturing group 1: any single char (if re.DOTALL is used, even line break chars)
(?<!\1.) - a negative lookbehind that fails the match if there is the same char as captured in Group 1 and then any single char (we can use \1 instead of the . here, and it will work the same) immediately to the left of the current location
\1 - same char as in Group 1
(?!\1) - a negative lookahead that fails the match if there is the same char as in Group 1 immediately to the right of the current location.
See the Python test:
import re
tests ={'ffumlmqwfcsyqpss': ['ff','ss'],
'zztdcqzqddaazdjp': ['zz','dd', 'aa'],
'urrvucyrzzzooxhx': ['rr','oo'],
'zettygjpcoedwyio': ['tt'],
'dtfkgggvqadhqbwb': [],
'rwgwbwzebsnjmtln': []
}
for test, answer in tests.items():
matches = [m.group() for m in re.finditer(r'(.)(?<!\1.)\1(?!\1)', test, re.DOTALL)]
if matches:
print(f"Matches found in '{test}': {matches}. Is the answer expected? {set(matches)==set(answer)}.")
else:
print(f"No match found in '{test}'. Is the answer expected? {set(matches)==set(answer)}.")
Output:
Matches found in 'ffumlmqwfcsyqpss': ['ff', 'ss']. Is the answer expected? True.
Matches found in 'zztdcqzqddaazdjp': ['zz', 'dd', 'aa']. Is the answer expected? True.
Matches found in 'urrvucyrzzzooxhx': ['rr', 'oo']. Is the answer expected? True.
Matches found in 'zettygjpcoedwyio': ['tt']. Is the answer expected? True.
No match found in 'dtfkgggvqadhqbwb'. Is the answer expected? True.
No match found in 'rwgwbwzebsnjmtln'. Is the answer expected? True.
You may use the following regex pattern:
^(?![a-z]*([a-z])\1{2,})[a-z]*([a-z])\2[a-z]*$
Demo
This pattern says to match:
^ start of the string
(?![a-z]*([a-z])\1{2,}) same letter does not occur 3 times or more
[a-z]* zero or more letters
([a-z]) capture a letter
\2 which is followed by the same letter
[a-z]* zero or more letters
$ end of the string

Regex - Word boundary not working even with raw-string

I'm coding a set of regex to match dates in text using python. One of my regex was designed to match dates in the format MM/YYYY only. The regex is the following:
r'\b((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})\b'
Looks like the word boundary is not working as it is matching parts of dates like 12/02/2020 (it should not match this date format at all).
In the attached image only the second pattern should have been recognized. The first one shouldn't, even parts of it, have been a match.
Remembering that the regex should match the MM/YYYY pattern in strings like:
"The range of dates go from 21/02/2020 to 21/03/2020 as specified above."
Can you help me find the error in my pattern to make it match only my goal format?
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
What is a word boundary in regex?
What happens is that the \ character is not part of the group \w, thus every time your string has a new \ it is considered to be a new word boundary.
You have not provided the full string you are matching, but I could solve the example you have posted you could solve it by just putting the anchors ^$
^((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})$
https://regex101.com/r/xncZNN/1
edit:
Working on your full example and your regex I did some "clean up" because it was a bit confusing, but I think I understood the pattern you were trying to map
here is the new:
(?<=^|[a-zA-Z ])(0[0-9]|1[12]|[1-9])(?:\/|\\)([\d]{4})(?=[a-zA-Z ]|$)
I have substituted the word boundary by lookahead (?!...) and lookbehind (?<!...), and specified the pattern I want to match before and after the date. You can adjust it to your specific need and add other characters like numbers or specific stuff.
https://regex101.com/r/xncZNN/4
The problem is that \b\d{2}/\d{4}\b matches 02/2000 in the string 01/02/2000 because the first forward slash is a word break. The solution is to identify the characters that should not precede and follow the match and use negative lookarounds in place of word breaks. Here you could use the regular expression
r'(?<![\d/])(?:0[1-9]|1[0-2])/\d{4}(?![\d/])'
The negative lookbehind, (?<![\d/]), prevents the two digits representing the month to be preceded by a digit or forward slash; the negative lookahead, (?![\d/]) prevents the four digits representing the year to be followed by a digit or forward slash.
Regex demo
Python demo
If 6/2000 is to be matched as well as 06/2000, change (?:0[1-9] to (?:0?[1-9].

Python regex conditional, don't match if

Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester
You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo
Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.

How to match digits only after a particular string, stop matching if non-digit is found - Python 27

I have huge string like this dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2
I want to get the number after ludocid, only consecutive numbers.
I have tried this regex (ludocid).*(?=\d+\d+) and many more but no luck.
You can try ludocid=(\d+):
s = "dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2"
import re
re.findall(r"ludocid=(\d+)", s)
# ['15878284988193842600']
You can use this regex:
ludocid\D*(\d+)
RegEx Demo
This will match literal ludocid followed by 0 or more non-digits and then it will match 1 or more digits in captured group #1
Code:
>>> s = 'dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2'
>>> print re.search(r'ludocid\D*(\d+)', s).group(1)
15878284988193842600
It looks like you just threw a bunch of regex bits together... Let's work through that.
First, this is the correct regex: ludocid.(\d+)
(You would want to use it with re.search instead of re.match, by the way. Match requires the regex to match the entire string.)
But let's look at yours and see what went wrong and how we can get to the correct regex.
(ludocid).*(?=\d+\d+)
Imagine a regex as a function. You pass it the right things, and it gives you the appropriate result. When you wrap things in parentheses, you're saying "Find this and give it back to me." You don't need the ludocid given back to you, I'm guessing... so remove those paren.
ludocid.*(?=\d+\d+)
Now you've got a .*. This is dangerous in regular expressions because it literally says "Grab as many of anything as you possibly can!" Often I use the non-greedy version (.*?), but in this case it looks like we're just expecting a single extra character there. If you know the literal character you can use that, but to be safe I'll leave it as ., which says "Grab any one character."
ludocid.(?=\d+\d+)
Now let's go inside the parentheses. You've got \d+\d+, which says "Find a sequence of one or more digits, and then find another sequence of one or more digits." This equates to "Find a sequence of two or more digits." I don't think this is what you wanted (it's not how you described the problem, anyway), so let's reduce that:
ludocid.(?=\d+)
Okay, great. Now... what is (?=...) for? It's called a lookahead assertion. It says "If you find this string, match things in front of it." The example given in the Python 2.7 documentation is:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Essentially this means that your regex will never return the digits. Instead, it looks to see if digits exist, and then it returns things from the rest of the regex. Remove the lookahead assertion and we're there:
ludocid.(\d+)
When you use this with re.search, you'll get the group you want:
>>> s = "dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2"
>>> import re
>>> re.search(r"ludocid.(\d+)", s).group(1)
'15878284988193842600'
To match only the digits that follow, stopping at the first non-numeric char, try a positive look behind:
(?<=ludocid=)(\d+)
So:
re.findall(r"(?<=ludocid=)(\d+)", s)
The positive look behind will look for what you want, and only match if it is preceded by the 'flag' string.
**Note: **You may need to escape that second = sign like this: (?<=ludocid\=)(\d+)

Example of "use \G in negative variable-length lookbehinds to limit how far back the lookbehind goes"

In the pypi page of the awesome regex module (https://pypi.python.org/pypi/regex) it is stated that \G can be used "in negative variable-length lookbehinds to limit how far back the lookbehind goes". Very interesting, but the page doesn't give any example and my white-belt regex-fu simply chokes when I try to imagine one.
Could anyone describe some sample use case?
Here's an example that uses \G and a negative lookbehind creatively:
regex.match(r'\b\w+\b(?:\s(\w+\b)(?<!\G.*\b\1\b.*\b\1\b))*', words)
words should be a string of alphanumeric characters separated by a single whitespace, for example "a b c d e a b b c d".
The pattern will match a sequence of unique words.
\w+ - Match the first word.
(?:\s(\w+\b) )* - match additional words ...
(?<!\G.*\b\1\b.*\b\1\b) - ... but for each new word added, check it didn't already appear until we get to \G.
A lookbehind at the end of the pattern that is limited at \G can assert another condition on the current match, which would not have been possible otherwise. Basically, the pattern is a variation on using lookaheads for AND logic in regular expressions, but is not limited to the whole string.
Here's a working example in .Net, which shares the same features.
Trying the same pattern in Python 2 with findall and the regex module gives me a segmentation fault, but match seems to work.
One example I could think of is using \G in a positive lookbehind to split a CSV row by commas:
regex.split(r'(?<=\G(?:"[^"]*(?:""[^"]*)*"|[^"]*)),', csv)
This is a variant you can only do with variable length lookbehind and \G.
Usually if you want to use split you add a lookahead until the end of the row, and check that the following records are all valid, like here ,(?=([^\"]*\"[^\"]*\")*[^\"]*$). This is somewhat annoying because you keep matching the end on the string over and over.
We also never explicitly mention that the unquoted values are not commas ([^,]) because we use \G to match since the previous comma.
This is an example that might be more useful than creative.
It uses a variable length negative lookbehind starting from the last match (\G)
checking forward to the current position.
In this case (?<!\Ga+)b, the result is that it matches every other b in a string where b's are separated by one or more a's.
This can also be done in a fixed width lookbehind like (?<!\Ga)b where the result is that it matches every other b in a string where b's are separated by one a.
This is kind of a template where a and b could be bigger expressions and have a little more
meaning.
(One thing to be aware of is that when using \G in a negative lookbehind, it is fairly easy to
satisfy the negative assertion. So, these kind of things have gotcha written all over it !!)
Don't have Python (latest, beta?) to test this with, so below is using C# console app.
string strSrc = "abaabaabaabaabaab";
Regex rxGtest = new Regex(#"(?<!\Ga+)b");
Match _m = rxGtest.Match(strSrc);
while (_m.Success)
{
Console.WriteLine("Found: {0} at position {1}", _m.Groups[0].Value, _m.Index);
_m = _m.NextMatch();
}
Output:
Found: b at position 4
Found: b at position 10
Found: b at position 16

Categories