Greedy match with negative lookahead in a regular expression - python

I have a regular expression in which I'm trying to extract every group of letters that is not immediately followed by a "(" symbol. For example, the following regular expression operates on a mathematical formula that includes variable names (x, y, and z) and function names (movav and movsum), both of which are composed entirely of letters but where only the function names are followed by an "(".
re.findall("[a-zA-Z]+(?!\()", "movav(x/2, 2)*movsum(y, 3)*z")
I would like the expression to return the array
['x', 'y', 'z']
but it instead returns the array
['mova', 'x', 'movsu', 'y', 'z']
I can see in theory why the regular expression would be returning the second result, but is there a way I can modify it to return just the array ['x', 'y', 'z']?

Another solution which doesn't rely on word boundaries:
Check that the letters aren't followed by either a ( or by another letter.
>>> re.findall(r'[a-zA-Z]+(?![a-zA-Z(])', "movav(x/2, 2)*movsum(y, 3)*z")
['x', 'y', 'z']

Add a word-boundary matcher \b:
>>> re.findall(r'[a-zA-Z]+\b(?!\()', "movav(x/2, 2)*movsum(y, 3)*z")
['x', 'y', 'z']
\b matches the empty string in between two words, so now you're looking for letters followed by a word boundary that isn't immediately followed by (. For more details, see the re docs.

You need to limit matches to whole words. So use \b to match the beginning or end of a word:
re.findall(r"\b[a-zA-Z]+\b(?!\()", "movav(x/2, 2)*movsum(y, 3)*z")

An alternate approach: find strings of letters followed by either end-of-string or by a non-letter, non-bracket character; then capture the letter portion.
re.findall("([a-zA-Z]+)(?:[^a-zA-Z(]|$)", "movav(x/2, 2)*movsum(y, 3)*z")

Related

Python pattern matching with language-specific characters

From a list of strings, I want to extract all words and save extend them to a new list. I was successful to do so using pattern matching in the form of:
import re
p = re.compile('[a-z]+', re.IGNORECASE)
p.findall("02_Sektion_München_Gruppe_Süd")
Unfortunately, the language contains language-specific characters, so that strings in the form of the given example yields:
['Sektion', 'M', 'nchen', 'Gruppe', 'S', 'd']
I want it to yield:
['Sektion', 'München', 'Gruppe', 'Süd']
I am grateful for suggestions how to solve this problem.
You may use
import re
p = re.compile(r'[^\W\d_]+')
print(p.findall("02_Sektion_München_Gruppe_Süd"))
# => ['Sektion', 'München', 'Gruppe', 'Süd']
See the Python 3 demo.
The [^\W\d_]+ pattern matches any 1+ chars that are not non-word, digits and _, that is, that are only letters.
In Python 2.x you will have to add re.UNICODE flag to make it match Unicode letters:
p = re.compile(r'[^\W\d_]+', re.U)

Python seems to incorrectly identify case-sensitive string using regex

I'm checking for a case-sensitive string pattern using Python 2.7 and it seems to return an incorrect match. I've run the following tests:
>>> import re
>>> rex_str = "^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.(?i)pdf$)"
>>> not re.match(rex_str, 'BOA_1988-148.pdf')
>>> False
>>> not re.match(rex_str, 'BOA_1988-148.PDF')
>>> False
>>> not re.match(rex_str, 'BOA1988-148.pdf')
>>> True
>>> not re.match(rex_str, 'boa_1988-148.pdf')
>>> False
The first three tests are correct, but the final test, 'boa_1988-148.pdf' should return True because the pattern is supposed to treat the first 3 characters (BOA) as case-sensitive.
I checked the expression with an online tester (https://regex101.com/) and the pattern was correct, flagging the final as a no match because the 'boa' was lower case. Am I missing something or do you have to explicitly declare a group as case-sensitive using a case-sensitive mode like (?c)?
Flags do not apply to portions of a regex. You told the regex engine to match case insensitively:
(?i)
From the the syntax documentation:
(?aiLmsux)
(One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. Flags should be used first in the expression string.
Emphasis mine, the flag applies to the whole pattern, not just a substring. If you need to match just pdf or PDF, use that in your pattern directly:
r"^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.(?:pdf|PDF)$)"
This matches either .pdf or .PDF. If you need to match any mix of uppercase and lowercase, use:
r"^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.[pP][dD][fF]$)"
(?i) doesn’t only apply after itself or to the group that contains it. From the Python 2 re documentation:
(?iLmsux)
(One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags […] for the entire regular expression.
One option is to do it manually:
r"^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\.[Pp][Dd][Ff]\Z"
Another is to use a separate case-sensitive check:
rex_str = r"(?i)^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\.pdf\Z"
match = re.match(rex_str, s) if s.startswith("BOA_") else None
or separate case-insensitive one:
rex_str = r"^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\..{3}\Z"
match = re.match(rex_str, s) if s.lower().endswith(".pdf") else None

python split a string by comma not inside matrix expression

I want to split a string separated by commas not inside Matrix expression.
For example:
input:
value = 'MA[1,2],MA[1,3],der(x),x,y'
expected output:
['MA[1,2]','MA[1,3]','der(x)','x','y']
I tried with value.split(','), but it splits inside [], I tried with some regular expressions to catch extract text inside [] using this regular expression
import re
re.split(r'\[(.*?)\]', value)
I am not good in regular expression,Any suggestions would be helpful
You can use negative lookbehind
>>> import re
>>> value1 = 'MA[1,2],MA[1,3],der(x),x,y'
>>> value2 = 'M[a,b],x1,M[1,2],der(x),y1,y2,der(a,b)'
>>> pat = re.compile(r'(?<![[()][\d\w]),')
>>> pat.split(value1)
['MA[1,2]', 'MA[1,3]', 'der(x)', 'x', 'y']
>>> pat.split(value2)
['M[a,b]', 'x1', 'M[1,2]', 'der(x)', 'y1', 'y2', 'der(a,b)']
Demo
Explanation:
"(?<![[()][\d\w]),"g
(?<![[()][\d\w]) Negative Lookbehind - Assert that it is impossible to match the regex below
[[()] match a single character present in the list below
[() a single character in the list [() literally
[\d\w] match a single character present in the list below
\d match a digit [0-9]
\w match any word character [a-zA-Z0-9_]
, matches the character , literally
g modifier: global. All matches (don't return on first match)

Separating RegEx pattern matches that have the same potential starting characters

I would like to have a RegEx that matches several of the same character in a row, within a range of possible characters but does not return those pattern matches as one pattern. How can this be accomplished?
For clarification:
I want a pattern that starts with [a-c] and ungreedly returns any number of the same character, but not the other characters in the range. In the sequence 'aafaabbybcccc' it would find patterns for:
('aa', 'aa', 'bb', 'b', 'cccc')
but would exclude the following:
('f', 'aabb', 'y', 'bcccc')
I don't want to use multiple RegEx pattern searches because the order that i find the patterns will determine the output of another function. This question is for the purposes of self study (python), not homework. (I'm also under 15 rep but will come back and upvote when I can.)
Good question. Use a regex like:
(?P<L>[a-c])(?P=L)+
This is more robust - you're not limited to a-c, you can replace it with a-z if you like. It first defines any character within a-c as L, then sees whether that character occurs again one or more times. You want to run re.findall() using this regex.
You can use backreference \1 - \9 to capture previously matched 1st to 9th group.
/([a-c])(\1+)/
[a-c]: Matches one of the character.
\1+ : Matches subsequent one or more previously matched character.
Perl:
perl -e '#m = "ccccbbb" =~ /([a-c])(\1+)/; print $m[0], $m[1]'
cccc
Python:
>>> import re
>>> [m.group(0) for m in re.finditer(r"([a-c])\1+", 'aafaabbybcccc')]
['aa', 'aa', 'bb', 'cccc']

Difference between character sets in Python and re2c regular expressions

Character sets in regular expressions are specified using []. Character sets match any one of the enclosed characters. For example, [abc] will match one of 'a', 'b', or 'c'.
I realize there are potentially differences between character sets in Python and re2c regular expressions. I know what is the same in both:
Both accept ranges, for example [a-z] matches all lowercase letters
Both accept inverse sets using [^...] notation
Both accept common alphanumeric and some other characters (spaces, etc.)
But I'm concerned about these possibly being different:
Characters that need to be escaped inside of the character set
Where to place a literal '-' or '^' inside the character set if I want to match that character and not specify an inverse set or a range
Can you explain the difference between Python and re2c character sets?
Looking at the re2c manual link that you provided, it appears that re2c uses the same syntax, just a subset of that syntax.
To address your specific questions about regex syntax
Characters that need to be escaped inside of the character set.
What characters are you referring to specifically?
Where to place a literal - or ^ inside the character set...
For ^, anywhere but the beginning should do, and for -, anywhere but in the middle should do.
>>> import re
>>> match_literal_hyphen = "[ab-]"
>>> re.findall(match_literal_hyphen, "abc - def")
['a', 'b', '-']
>>> match_literal_caret = "[a^b]"
>>> re.findall(match_literal_caret, "abc ^ def")
['a', 'b', '^']
I would escape anything that causes confusion -
/[][]/ matches ']' or '['
/[[]]/ matches '[]'
/[]]]/ matches ']]'
/[[[]/ matches '['
/[]/ is an umatched '[' error

Categories