Regular Expression for Combined Look-ahead/Look-behind - python

I'm using python and trying to write a regular expression that matches a hyphen (-) if it is not preceded by a period (.) and not followed by one character and a period.
This one is matching hyphen not preceded by a period and not followed by a character:
r'(?<!\.)(-(?![a-zA-Z]))'
Nothing I've tried seems to get me the right match for the negative look-ahead part (single character and period).
Any help appreciated. Even a totally different regex if I'm barking up the wrong tree altogether.
Edit
Thanks for the answers. I did actually try
r'(?<!\.)(-(?![a-zA-Z]\.))'
But I now realise that my logic was wrong, not my expression.
I've chosen the answer and upvoted the other correct ones :)

Assuming that by "character" you mean (and I base this assumption on your example and on #SimonO101's comment) [A-Za-z], I think you are looking for something like this:
>>> r = re.compile(r'(?<!\.)-(?![A-Za-z]\.)')
>>> r.search('k.-kj')
>>> r.search('k-l.')
>>> r.search('k-ll')
<_sre.SRE_Match object at 0x02D46758>
>>> r.search('k-.l')
<_sre.SRE_Match object at 0x02D46720>
>>> r.search('l-..')
<_sre.SRE_Match object at 0x02D46758>
There is no need to try to enclose the hyphen in a group that also captures the negative lookahead assertion. Trying to do this only complicates the matter.

import re
ss = ' a-bc1 d-e.2 .-gh3 .-N.4'
print 'The analysed string:\n',ss
print '\n(?!\.-[a-zA-Z]\.)'
print 'NOT (preceded by a dot AND followed by character-and-dot)'
r = re.compile('(?!\.-[a-zA-Z]\.).-...')
print r.findall(ss)
print '\n(?<!\.)-(?![a-zA-Z]\.)'
print 'NOT (preceded by a dot OR followed by character-and-dot)'
q = re.compile('.(?<!\.)-(?![a-zA-Z]\.)...')
print q.findall(ss)
result
The analysed string:
a-bc1 d-e.2 .-gh3 .-N.4
(?!\.-[a-zA-Z]\.)
NOT (preceded by a dot AND followed by character-and-dot)
['a-bc1', 'd-e.2', '.-gh3']
(?<!\.)-(?![a-zA-Z]\.)
NOT (preceded by a dot OR followed by character-and-dot)
['a-bc1']
Which case do you want in fact ?

Related

Python regex match only if standalone

Using re in python3, I want to match appearances of percentages in text, and substitute them with a special token (e.g. substitute "A 30% increase" by "A #percent# increase").
I only want to match if the percent expression is a standalone item. For example, it should not match "The product's code is A322%n43%". However, it should match when a line contains only one percentage expression like "89%".
I've tried using delimiters in my regex like \b, but because % is itself a non-alphanumeric character, it doesn't catch the end of the expression. Using \s makes it impossible to catch expression standing by themselves in a line.
At the moment, I have the code:
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "1,211.21%")
' #percent '
which still matches if the expression is followed by letters or other text (like the product code example above).
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "EEE1,211.21%asd")
'EEE #percent# asd'
What would you recommend?
Looks like a perfect job for Negative Lookbehind and Negative Lookahead:
re.sub(r'''(?<![^\s]) [+-]?[.,;]? (\d+[.,;']?)+% (?![^\s.,;!?'"])''',
'#percent#', string, flags=re.VERBOSE)
(?<![^\s]) means "no space immediately before the current position is allowed" (add more forbidden characters if you need).
(?![^\s.,;!?'"]) means "no space, period, etc. immediately after the current position are allowed".
Demo: https://regex101.com/r/khV7MZ/1.
Try putting "first" capture group with a "second".
original: r"[+-]?[.,;]?(\d+[.,;']?)+%"
suggestd: r"[+-]?[.,;]?((\d+[.,;']?)+%)\b"

Why does the Python regex ".*PATTERN*" match "XXPATTERXX"?

Suppose I want to find "PATTERN" in a string, where "PATTERN" could be anywhere in the string. My first try was *PATTERN*, but this generates an error saying that there is "nothing to repeat", which I can accept so I tried .*PATTERN*. This regex does however not give the expected result, see below
import re
p = re.compile(".*PATTERN*")
s = "XXPATTERXX"
if p.match(s):
print s + " match with '.*PATTERN*'"
The result is
XXPATTERXX match with '.*PATTERN*'
Why does "PATTER" match?
Note: I know that I could use .*PATTERN.* to get the expected result, but I am curious to find out why the asterisk on it self fails to get the results.
Your pattern matches 0 or more N characters at the end, but doesn't say anything about what comes after those N characters.
You could add $ to the pattern to anchor to the end of the input string to disallow the XX:
>>> import re
>>> re.compile(".*PATTERN*$")
<_sre.SRE_Pattern object at 0x10029fb90>
>>> import re
>>> p = re.compile(".*PATTERN*$")
>>> p.match("XXPATTERXX") is None
True
>>> p.match("XXPATTER") is None
False
>>> p.match("XXPATTER")
<_sre.SRE_Match object at 0x1004627e8>
You may want to look into the different types of anchor. \b may also fit your needs; it matches word boundaries (so between a \w and \W class character, or between \W and \w), or you could use negative look-ahead and look-behinds to disallow other characters around your PATTERN string.

Regex: I have a flaw in my logic

I'm trying to match a pattern where the non word characters in the first bracket never repeat and the pattern must end with a the second set in the brackets. I just don't understand why this test case is failing:
regexString = '([\-\._]?[a-zA-Z0-9]+)*'
rgx = re.compile(regexString)
assert(rgx.match('dan--') == None)
Documentation for re.match: https://docs.python.org/2/library/re.html#re.match
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance.
In your case '([-._]?[a-zA-Z0-9]+)*' clearly matches 'dan' part of 'dan--' hence the result is not None but a MatchObject. If you don't want it to match anything other than what's in your group put your group between ^ and $.
If you want to check that the pattern match the whole string use ^, $ anchor.
>>> import re
>>> regexString = r'^([\-\._]?[a-zA-Z0-9]+)*$'
>>> rgx = re.compile(regexString)
>>> rgx.match('dan--')
>>> rgx.match('dan')
<_sre.SRE_Match object at 0x00000000029E0D50>
BTW, ^ is not strictly required becasue match matches only at the beginning of the string.
Try to match '--dan--'. This would indeed fail and the result of the assertion would be true.
Reason is the ?, meaning zero or one (but not two or more).
[\-\._]? is one or none of the following characters which are in the brackets, which must be followed by one or more letter or number. Anything or nothing of all of the stuff in parentheses will match nothing, as well. But, rgx.match('dan--') == None fails because you it's okay to have -- after dan since your not specifying if anything should come after [a-zA-Z0-9]+. You need anchors. If you don't mind the underscore the you could change [a-zA-Z0-9]+ to (\w|\d)+.
'^([\-\.]?[a-zA-Z0-9]+)*$'
# also matches '-underscore_dan'
'^([\-\.]?(\w|\d)+)*$'

Correct usage of \D in python?

I have some code where I am trying to find a certain set of numbers. The length varies and I do not want them to be found amongst other numbers. For example the following code:
reg="\D12345\D"
string="12345"
matchedResults = re.finditer(reg, string)
for match in matchedResults:
print match.group(0)
Does not work if the number is just by itself. However this will work if I put:
string="a12345"
but this will also match the a which is undesirable. Is there a better way to do this?
Use zero-width negative look-around assertions:
reg = r"(?<!\d)12345(?!\d)"
The look-around assertions (lookbehind and lookahead) match a position, not a character; the negative assertions only match if the preceding text or the following text respectively does not match the named pattern.
This means only locations that do not follow or precede a number will be matched; the start and end of a string will do for that purpose.
Demo:
>>> import re
>>> reg = re.compile(r"(?<!\d)12345(?!\d)")
>>> reg.search('12345')
<_sre.SRE_Match object at 0x102981ac0>
>>> reg.search('-12345-')
<_sre.SRE_Match object at 0x102a51238>
>>> reg.search('0123456')
>>> reg.search('012345-')
>>> reg.search('-123456')

Regular expression to return all characters between two special characters

How would I go about using regx to return all characters between two brackets.
Here is an example:
foobar['infoNeededHere']ddd
needs to return infoNeededHere
I found a regex to do it between curly brackets but all attempts at making it work with square brackets have failed. Here is that regex: (?<={)[^}]*(?=}) and here is my attempt to hack it
(?<=[)[^}]*(?=])
Final Solution:
import re
str = "foobar['InfoNeeded'],"
match = re.match(r"^.*\['(.*)'\].*$",str)
print match.group(1)
If you're new to REG(gular) EX(pressions) you learn about them at Python Docs. Or, if you want a gentler introduction, you can check out the HOWTO. They use Perl-style syntax.
Regex
The expression that you need is .*?\[(.*)\].*. The group that you want will be \1.
- .*?: . matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
- \[: \ escapes special meta-characters, which in this case, is [. If we didn't do that, [ would do something very weird instead.
- (.*): Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
- \].*: You should know enough by now to know what this means.
Implementation
First, import the re module -- it's not a built-in -- to where-ever you want to use the expression.
Then, use re.search(regex_pattern, string_to_be_tested) to search for the pattern in the string to be tested. This will return a MatchObject which you can store to a temporary variable. You should then call it's group() method and pass 1 as an argument (to see the 'Group 1' we captured using parenthesis earlier). I should now look like:
>>> import re
>>> pat = r'.*?\[(.*)].*' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd"
>>> match = re.search(pat, s)
>>> match.group(1)
"'infoNeededHere'"
An Alternative
You can also use findall() to find all the non-overlapping matches by modifying the regex to (?>=\[).+?(?=\]).
- (?<=\[): (?<=) is called a look-behind assertion and checks for an expression preceding the actual match.
- .+?: + is just like * except that it matches one or more repititions. It is made non-greedy by ?.
- (?=\]): (?=) is a look-ahead assertion and checks for an expression following the match w/o capturing it.
Your code should now look like:
>>> import re
>>> pat = r'(?<=\[).+?(?=\])' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd[andHere] [andOverHereToo[]"
>>> re.findall(pat, s)
["'infoNeededHere'", 'andHere', 'andOverHereToo[']
Note: Always use raw Python strings by adding an 'r' before the string (E.g.: r'blah blah blah').
10x for reading! I wrote this answer when there were no accepted ones yet, but by the time I finished it, 2 ore came up and one got accepted. :( x<
^.*\['(.*)'\].*$ will match a line and capture what you want in a group.
You have to escape the [ and ] with \
The documentation at the rubular.com proof link will explain how the expression is formed.
If there's only one of these [.....] tokens per line, then you don't need to use regular expressions at all:
In [7]: mystring = "Bacon, [eggs], and spam"
In [8]: mystring[ mystring.find("[")+1 : mystring.find("]") ]
Out[8]: 'eggs'
If there's more than one of these per line, then you'll need to modify Jarrod's regex ^.*\['(.*)'\].*$ to match multiple times per line, and to be non greedy. (Use the .*? quantifier instead of the .* quantifier.)
In [15]: mystring = "[Bacon], [eggs], and [spam]."
In [16]: re.findall(r"\[(.*?)\]",mystring)
Out[16]: ['Bacon', 'eggs', 'spam']

Categories