Excluding a string in a regex expression - python

I currently have the following regular expression:
/(_|[a-z]|[A-Z])(_|[a-z]|[A-Z]|[0-9])*/
I would like the expression not to match with "PI", however I failed to do so.
To clarify, I would like the following to be valid:
_PI, abcPI, PIpipipi
I just dont want to accept PI when its on its own.

Before jumping at the solution, please have a look at your regex: the character classes for single ranges inside alternation groups is an inefficient way of writing regex patterns. You may simply merge these ([A-Z]|[0-9]|_)+ into [A-Z0-9_]+.
The solution may be a word boundary with a negative lookahead after it:
r"\b(?!PI\b)[_a-zA-Z][_a-zA-Z0-9]*"
See the regex demo. You may replace [a-zA-Z0-9_] with \w:
re.compile(r"\b(?!PI\b)[_a-zA-Z]\w*") # In Python 2.x, re.UNICODE is not enabled by default
re.compile(r"\b(?!PI\b)[_a-zA-Z]\w*", re.A) # In Python 3.x, make \w match ASCII only
Details
\b - word boundary
(?!PI\b) - immediately to the right, there can't be PI as a whole word
[_a-zA-Z] - an ASCII letter or _
[_a-zA-Z0-9]* - 0 or more underscores, ASCII letters or digits.

Submitting another answer:
^(((?!PI).)*)$|^.*(PI).+$|^.+(PI).*$
I broke it down into 3 cases using OR |:
1) Match a string that doesn't contain PI at all.
^(((?!PI).)*)$
2) Match a string that has PI in it but has at least one character behind it, and optionally any characters ahead of it.
^.*(PI).+$
3) Match a string that has PI in it but has at least one character ahead of it, and optionally any characters behind it.
^.+(PI).*$
Here it is with test cases:
https://regex101.com/r/7rzqpe/3
Please comment if you find a missing edge case.

Not very nice, but I'll add it anyway for variety:
/^([A-OQ-Za-z_][A-Za-z0-9_]*|P([A-HJ-Za-z0-9_][A-Za-z0-9_]*)?)$/

Related

Why does my Python regex not work as expected when including a forward slash?

I'm having a Python issue when I include a not / in my regex.
In the following example I only want to find a match if the string sitting in the first word boundary starts with a digit AND there isn't a / at any point afterwards.
Why does the following regex return 1ab as a group value? I was hoping it wouldn't find a match at all:
text = "1ab/"
regex = r"\b(\d[^/]*?)\b"
Whereas:
text = "1abc"
regex = r"\b(\d[^c]*?)\b"
does not return any match, which is the outcome I want for the / scenario.
Any help would be appreciated.
Thanks,
Roy
You can use a negative lookahead assertion:
r'\b(\d\w*?)\b(?!.*/)' (use flags=re.DOTALL with this or prepend (?s) to the regex)
(?!.*/) states that the rest of the input string does not contain a '/' character. If you don't want '/' to appear just as the next character, then use as the assertion (?!/).
You almost did it. Yet the slash is not alphanumerical and thus cannot be inside word . Therefore it makes no sense to match or prohibit it start and the end of the word. You have to place "not slash" sub-expression [^/] after the end of word. And add a star [^/]* (which matches the sequence of non-slash symbols) to address the case when slashes occurs toward the end of the string rather than immediately after the end of the first word.
Since you target the first word and absence of slash until the very end of string adding symbols of the start end might help. Especially, if you are use re.search. Resulting in
^[\W]*\b(\d\w*)\b[^/]*\Z
You can play with it using an online debugger such as https://regex101.com/r/uO27vU/2
to better understand the expression or tune it.
Above ^ is a start, \Z is the end of sting, \W is for "non-word" symbols, a \w is "word" symbol.
You can remove the first \b I kept it, as perhaps, it would easier for you to understand with it.
The second expression that you tried excludes words ending with c but first does not. ^c stands for any symbol but c and right after it you have \b which denotes the end of the word. Which reads please no "c"s at the end of the word.
Your first expression says pleas no slashes before the end of the word (sequence of alphanumeric) . Which is the case for you test.
Always use a debugger to get explanation of each symbol,test and
tune your expressions regex101.com/r/B6INGg/2
Note that the list of symbols in a word might be affected by flags. When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_].

Inverse regex match on group in Python

I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.
Given a list of words, I want to print all the words that do not have special characters.
I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.
In my case given a string like: "should print nŌt thìs"
should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.
For this particular case, you can simply specify the regular alphabet range in your search:
a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']
Of course you can add digits or anything else you want to match as well.
As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:
r"\b(?!\w*[À-ǚ])\w*"
This:
Checks for a word boundary \b, like a space or the start of the input string.
Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
Finally, if it makes it past the lookahead, it matches any word characters.
Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.
There is a third option as well:
r"\b[^À-ǚ\s]*\b"
The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.
I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:
>>> import unicodedata as ud
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']
Or use ... 'WITH' not in to reverse.

Why positive lookahead is working but negative lookahead doesn't?

First of all, regex needs to be working for both the python and PCRE(PHP). I'm trying to ignore if a regex pattern is followed by the letter 'x' to distinguish dimensions from strings like "number/number" in the given example below:
dummy word 222/2334; Ø14 x Ø6,33/523,23 x 2311 mm
From here, I'm trying to extract 222/2334 but not the 6,33/523,23 since that part is actually part of dimensions. So far I came up with this regex
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))(?=\s?x)
which can extract what I don't want it to extract and it looks like this. If I change the positive lookahead to negative it captures both of them except the last '3' from 6,33/523,23. It looks like this. How can I only capture 222/2334? What am I doing wrong here?
Desired output:
222/2334
What I got
222/2334 6,33/523,2
You may use this simplified regex with negative lookahead:
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))\b(?![.,]?\d|\s?x)
Updated RegEx Demo
It is important to use a word boundary in the end to avoid matching partial numbers (the reason of your regex matching till a digit before)
Also include [.,]?\d in negative lookahead condition so that match doesn't end at position before last comma.
This shorter (and more efficient) regex may also work for OP:
(\d+(?:[,.]\d+)*)\s*\/\s*(\d+(?:[,.]\d+)*)\b(?![.,]?\d|\s?x)
RegEx Demo 2
There are two easy options.
The first option is ugly and long, but basically negates a positive match on the string that is followed by x, then matches the patterns without it.
(?!PATTERN(?=x))PATTERN
See regex in use here
(?!\d+(?:[,.]\d+)?\s?\/\s?\d+(?:[,.]\d+)?(?=\s?x))(\d+(?:[,.]\d+)?)\s?\/\s?(\d+(?:[,.]\d+)?)
The second option uses possessive quantifiers, but you'll have to use the regex module instead of re in python.
See regex in use here
(\d+(?:[,.]\d+)?+)\s?\/\s?(\d+(?:[,.]\d+)?+)(?!\s?x)
Additionally, I changed your subpattern to \d+(?:[,.]\d+)?. This will match one or more digits, then optionally match . or , followed by one or more digits.

Regex: match sub-string between keywords including the keywords [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Ignore words containing substring using regular expressions

I am a beginner and have spent considerable amount of time on this. I was partially able to solve it.
Problem: I want to ignore all words that have either the or The. E.g. atheist, others, The, the will be excluded. However, hottie shouldn't be included because the doesn't occur inside the word as a whole word.
I am using Python's re engine.
Here's my regex:
\b - Start at word boundary
(?! - Negative lookahead to avoid starting with the or The
[t|T]he - the and The
)
\w+ - Other letters are fine
(?<! - Negative look behind
[t|T]he - the or The shouldn't occur before \w+
)
\b - Word boundary
Expected output for a given input:
Input: Atheist Others Their Hello the The bathe hottie tahaie theater
Expected Output: Hello hottie tahaie
As one can see in regex101, I am able to exclude most of the words except words like atheist--i.e. cases when the or The appear inside words. I searched for this on SO and found some threads such as How to exclude specific string using regex in Python?, but they don't seem to be directly related to what I am trying to do.
Any help will be greatly appreciated.
Please note that I am interested in solving this problem only using regex. I am not looking for solutions using python's string manipulation.
The approach is simpler than your original regular expression:
\b(?!\w*[t|T]he)\w+\b
We match a word, but make sure that there is no the within the word using a "padded" negative lookahead. Your original approach only disallowed the at the front or the back of the word as it allowed for no padding after/before the word boundary.
(?![tT]he) only matches at the current position, while (?:\w*[tT]he) allows the match to extend from the current position, because the \w* can be used as filler.

Categories