Python regex conditional, don't match if - python

Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester

You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo

Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.

Related

regex to match coordinates

I am trying to write a regex expression in python that can match the following lines - I am just able to match the very first number by doing something like this
re.compile(r'\d.\d{14}\s+')
but could not do rest. Also tried doing [^-\d] to catch the negative sign - does not seem working.
Any help? Thanks!
First, lets start by looking at the numbers. You've already got a decent expression for finding a single number (\d.\d{14}\s+), but there are a couple things wrong with it.
In regex, . indicates any single character. This means that your expression will accept any character after the first digit.
It's not taking into account the possibility that there could be a negative sign at the beginning.
Both of these problems are really easy to fix. The first can be fixed by simply escaping the period (\.). The second can be fixed by adding the negative sign to the pattern and giving it a quantifier. In this case, the ? quantifier will be the best option because it matches between 0 and 1 times. All this means is that it won't care if the symbol is there, but if it is it will match it. After these 2 changes, the pattern looks like this: -?\d\.\d{14}\s+.
Next, we need to tell it to match more than once. This can be done very easily by putting the pattern in a group and applying a quantifier to said group. Now the question is which quantifier should be used. In your example, there are only 3 numbers before the single character at the end of the line. You can match this pattern exactly 3 times by using the {3} quantifier. If you know there will be at least 1 but don't know how many in total there will be, you can use the + quantifier. For this example I will be using the {3} quantifier just so it's more specific to your question. After adding this, the pattern will look something like this: (-?\d\.\d{14}\s+){3}
Now all that's left is to match the character at the end. You can use \S to match any single word character. You can add a quantifier to it, but again, for the purposes of your question, I won't be since there's only a single character. The final expression would look like (-?\d\.\d{14}\s+){3}\S.

Regular Expression which matches two duplicate consecutive characters within string but not three or more. Should match if both 'aa' and 'bbb' exist

My original question was closed for being a duplicate. I disagree with it being a duplicate as this is a different use case looking at regular expression syntax. I have tried to clarify my question below.
Is it possible to create a regular expression which matches two duplicate consecutive characters within a string (in this example lowercase letters) but does not match a section of the string if the same characters are either side. e.g. match 'aa' but not 'aaa' or 'aaaa'?
Additionally:
Although I am using Python 3.10 I am trying to work out if this is possible using 'standard' regular expression syntax without utilising additional functionality provided by external modules. For example using Python this would mean a solution which uses the 're' module from the standard library.
If there are 3 or more duplicate consecutive characters, the string should still match if there are two duplicate consecutive characters elsewhere in the sting. e.g match 'aa' even if 'bbb' exists elsewhere in the string.
The string should also match if the two duplicate consecutive characters appear at the beginning or end of the string.
My examples are 16 character strings if a specific length makes a difference.
Examples:
ffumlmqwfcsyqpss should match either 'ff' or 'ss'.
zztdcqzqddaazdjp should match either 'zz','dd', 'aa'.
urrvucyrzzzooxhx should match 'rr' or 'oo' even though 'zzz' exists in the string.
zettygjpcoedwyio should match 'tt'.
dtfkgggvqadhqbwb should not match 'ggg'.
rwgwbwzebsnjmtln should not match.
What I had originally tried
([a-z])\1 to capture the duplicate character but this also matches when there are additional duplicate characters such as 'aaa' or 'aaaa' etc.
([a-z])\1(?!\1) to negate the third duplicate character but this just moves the match to the end of the duplicate character string.
Negative lookarounds to compensate for a match at the beginning but I think I am causing some kind of loop which will never match.
>>>import re
>>>re.search(r'([a-z])\1(?!\1)', 'dtfkgggvqadhqbwb')
<re.Match object; span=(5, 7), match='gg'> # should not match as 'gg' ('[gg]g' or 'g[gg]')
Currently offered solutions don't match described criteria.
Wiktor Stribiżew's solution uses the additional (*SKIP) functionality of the external python regex module.
Tim Biegeleisen's solution does not match duplicate pairs if there are duplicate triples etc in the same string.
In the linked question, Cary Swoveland's solutions do not work for duplicate pairs at the beginning or end of a string or match even when there is no duplicate in the string.
In the linked question, the fourth bird's solution does not match duplicate pairs at the beginning or end of strings.
Summary
So far the only answer which works is Wiktor Stribiżew's but this uses the (*SKIP) function of the external 'regex' module. Is a solution not possible using 'standard' regular expression syntax?
In Python re, the main problem with creating the right regex for this task is the fact that you need to define the capturing group before using a backreference to the group, and negative lookbehinds are usually placed before the captured pattern. Also, regex101.com Python testing option is not always reflecting the current state of affairs in the re library, and it confuses users with the message like "This token can not be used in a lookbehind due to either making it non-fixed width or interfering with the pattern matching" when it sees a \1 in (?<!\1), while Python allows this since v3.5 for groups of fixed length.
The pattern you can use here is
(.)(?<!\1.)\1(?!\1)
See the regex demo.
Details
(.) - Capturing group 1: any single char (if re.DOTALL is used, even line break chars)
(?<!\1.) - a negative lookbehind that fails the match if there is the same char as captured in Group 1 and then any single char (we can use \1 instead of the . here, and it will work the same) immediately to the left of the current location
\1 - same char as in Group 1
(?!\1) - a negative lookahead that fails the match if there is the same char as in Group 1 immediately to the right of the current location.
See the Python test:
import re
tests ={'ffumlmqwfcsyqpss': ['ff','ss'],
'zztdcqzqddaazdjp': ['zz','dd', 'aa'],
'urrvucyrzzzooxhx': ['rr','oo'],
'zettygjpcoedwyio': ['tt'],
'dtfkgggvqadhqbwb': [],
'rwgwbwzebsnjmtln': []
}
for test, answer in tests.items():
matches = [m.group() for m in re.finditer(r'(.)(?<!\1.)\1(?!\1)', test, re.DOTALL)]
if matches:
print(f"Matches found in '{test}': {matches}. Is the answer expected? {set(matches)==set(answer)}.")
else:
print(f"No match found in '{test}'. Is the answer expected? {set(matches)==set(answer)}.")
Output:
Matches found in 'ffumlmqwfcsyqpss': ['ff', 'ss']. Is the answer expected? True.
Matches found in 'zztdcqzqddaazdjp': ['zz', 'dd', 'aa']. Is the answer expected? True.
Matches found in 'urrvucyrzzzooxhx': ['rr', 'oo']. Is the answer expected? True.
Matches found in 'zettygjpcoedwyio': ['tt']. Is the answer expected? True.
No match found in 'dtfkgggvqadhqbwb'. Is the answer expected? True.
No match found in 'rwgwbwzebsnjmtln'. Is the answer expected? True.
You may use the following regex pattern:
^(?![a-z]*([a-z])\1{2,})[a-z]*([a-z])\2[a-z]*$
Demo
This pattern says to match:
^ start of the string
(?![a-z]*([a-z])\1{2,}) same letter does not occur 3 times or more
[a-z]* zero or more letters
([a-z]) capture a letter
\2 which is followed by the same letter
[a-z]* zero or more letters
$ end of the string

Why positive lookahead is working but negative lookahead doesn't?

First of all, regex needs to be working for both the python and PCRE(PHP). I'm trying to ignore if a regex pattern is followed by the letter 'x' to distinguish dimensions from strings like "number/number" in the given example below:
dummy word 222/2334; Ø14 x Ø6,33/523,23 x 2311 mm
From here, I'm trying to extract 222/2334 but not the 6,33/523,23 since that part is actually part of dimensions. So far I came up with this regex
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))(?=\s?x)
which can extract what I don't want it to extract and it looks like this. If I change the positive lookahead to negative it captures both of them except the last '3' from 6,33/523,23. It looks like this. How can I only capture 222/2334? What am I doing wrong here?
Desired output:
222/2334
What I got
222/2334 6,33/523,2
You may use this simplified regex with negative lookahead:
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))\b(?![.,]?\d|\s?x)
Updated RegEx Demo
It is important to use a word boundary in the end to avoid matching partial numbers (the reason of your regex matching till a digit before)
Also include [.,]?\d in negative lookahead condition so that match doesn't end at position before last comma.
This shorter (and more efficient) regex may also work for OP:
(\d+(?:[,.]\d+)*)\s*\/\s*(\d+(?:[,.]\d+)*)\b(?![.,]?\d|\s?x)
RegEx Demo 2
There are two easy options.
The first option is ugly and long, but basically negates a positive match on the string that is followed by x, then matches the patterns without it.
(?!PATTERN(?=x))PATTERN
See regex in use here
(?!\d+(?:[,.]\d+)?\s?\/\s?\d+(?:[,.]\d+)?(?=\s?x))(\d+(?:[,.]\d+)?)\s?\/\s?(\d+(?:[,.]\d+)?)
The second option uses possessive quantifiers, but you'll have to use the regex module instead of re in python.
See regex in use here
(\d+(?:[,.]\d+)?+)\s?\/\s?(\d+(?:[,.]\d+)?+)(?!\s?x)
Additionally, I changed your subpattern to \d+(?:[,.]\d+)?. This will match one or more digits, then optionally match . or , followed by one or more digits.

Python Regex Behaviour

I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?
Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.
#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.

Python regex: Matching a URL

I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:
imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.
Regular expressions can be represented as graphs to understand there operation. A parallel connection between nodes indicate that it is optional a serial connection indicates taht it is mandatory and a loop indicated repitition over the same node.
(http://i.imgur.com/(.*))(\?.*)?
Debuggex Demo
So this starts with an imgur URL http://i.imgur.com/(.*) (mandatorily) having any characters untill a '?'(optional) is encountered. Following any characters after the '?'. Notice '?' has been escaped of its regular behaviour. The pink highlights indicate the capture groups.
(http://i.imgur.com/(.*))(\?.*)?
The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.
The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.
The last ? means that the last capturing group is optional.
EDIT:
These groups can then be used as:
p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);
To improve the regex, you must limit the engine to what characters you need, like:
(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
[A-z0-9\-]+ limit to alphanumeric characters
[^/] exclude /
The (.*) means any character repeated any amount of times, the (\?.*)? matches the query string of a url for example (a imgur search of "cat"):
http://imgur.com/search?q=cat
http://imgur.com/search is matched by the (http://i.imgur.com/(.*)) (the search is specifically matched by the (.*)) section of the regex. The ?q=cat is matched by the (\?.*)? of the regex. In the regex the ? in the end means optional, so it means there might or might not be a query string. There is no query string in the url http://www.imgur.com. The parenthesis are used for grouping. We want to group (http://i.imgur.com/(.*)) as one thing because it matches the url, and there is another group within this that matches the page you are request (this is (.*)). We want to group (\?.*)? because it matches the query string.
Here is a diagram to help you

Categories