Python Regex matching already matched sub-string

Python Regex matching already matched sub-string - python

I'm fairly new to Python Regex and I'm not able to understand the following:
I'm trying to find one small letter surrounded by three capital letters.
My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
['AbAD']
I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.
The second issue is the following regex:
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
[]
Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?

It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.
(?=([A-Z][a-z][A-Z][A-Z]))
Code:
>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']
DEMO
For the 2nd one, you should use negative lookahead and lookbehind assertion like below,
(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))
Code:
>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']
DEMO
The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.

The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.
pat = '(?=([A-Z][a-z][A-Z][A-Z]))'
For your second regex again do the same.
print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)
.For more insights see
1)After first match the string left is aDD as the first part has matched.
2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.

1st issue,
You should use this pattern,
r'([A-Z]{1}[a-z]{1}[A-Z]{1})'
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']
2nd issue
You should use,
(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']

Related

Why does the Python regex ".PATTERN" match "XXPATTERXX"?

Suppose I want to find "PATTERN" in a string, where "PATTERN" could be anywhere in the string. My first try was *PATTERN*, but this generates an error saying that there is "nothing to repeat", which I can accept so I tried .*PATTERN*. This regex does however not give the expected result, see below
import re
p = re.compile(".*PATTERN*")
s = "XXPATTERXX"
if p.match(s):
print s + " match with '.*PATTERN*'"
The result is
XXPATTERXX match with '.*PATTERN*'
Why does "PATTER" match?
Note: I know that I could use .*PATTERN.* to get the expected result, but I am curious to find out why the asterisk on it self fails to get the results.

Your pattern matches 0 or more N characters at the end, but doesn't say anything about what comes after those N characters.
You could add $ to the pattern to anchor to the end of the input string to disallow the XX:
>>> import re
>>> re.compile(".*PATTERN*$")
<_sre.SRE_Pattern object at 0x10029fb90>
>>> import re
>>> p = re.compile(".*PATTERN*$")
>>> p.match("XXPATTERXX") is None
True
>>> p.match("XXPATTER") is None
False
>>> p.match("XXPATTER")
<_sre.SRE_Match object at 0x1004627e8>
You may want to look into the different types of anchor. \b may also fit your needs; it matches word boundaries (so between a \w and \W class character, or between \W and \w), or you could use negative look-ahead and look-behinds to disallow other characters around your PATTERN string.

Find and replace symbols with regex python

I have such sample:
sample = 'TEXT/xx_271802_1A'
p = re.compile("(/[a-z]{2})")
print p.match(sample)
in position of xx may be any from [a-z] in quantity of 2:
TEXT/qq_271802_1A TEXT/sg_271802_1A TEXT/ut_271802_1A
How can I find this xx and f.e. replace it with 'WW':
TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A
my code returns None

sample = 'TEXT/xx_271802_1A'
p = re.compile("(/[a-z]{2})")
print p.search(sample).group()
Your code return None as you are using match which matches from start.You need search or findall as you are finding anywhere in string and not at start.
For replacement use
re.sub(r'(?<=/)[a-z]{2}','WW',sample)

You can try the following Regular expression :
>>> sample = 'TEXT/xx_271802_1A'
>>> import re
>>> re.findall(r'([a-z])\1',sample)
['x']
>>> re.sub(r'([a-z])\1','WW',sample)
'TEXT/WW_271802_1A'
>>> sample = 'TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A'
>>> re.sub(r'([a-z])\1','WW',sample)
'TEXT/WW_271802_1A TEXT/WW_271802_1A TEXT/WW_271802_1A'
The RegEx ([a-z])\1 searches for 1 letter and then matches it if it repeats immediately.

you only need to do this:
sample = re.sub(r'(?<=/)[a-z]{2}', 'WW', sample)
No need to check the string before with match. re.sub makes the replacement when the pattern is found.
(?<=..) is a lookbehind assertion and means preceded by, it's only a check and is not part of the match result. So / is not replaced.
In the same way, you can add a lookahead (?=_) (followed by) at the end of the pattern, if you want to check if there is the underscore.

Regex pattern to extract substring

mystring = "q1)whatq2)whenq3)where"
want something like ["q1)what", "q2)when", "q3)where"]
My approach is to find the q\d+\) pattern then move till I find this pattern again and stop. But I'm not able to stop.
I did req_list = re.compile("q\d+\)[*]\q\d+\)").split(mystring)
But this gives the whole string.
How can I do it?

You could try the below code which uses re.findall function,
>>> import re
>>> s = "q1)whatq2)whenq3)where"
>>> m = re.findall(r'q\d+\)(?:(?!q\d+).)*', s)
>>> m
['q1)what', 'q2)when', 'q3)where']
Explanation:
q\d+\) Matches the string in the format q followed by one or more digits and again followed by ) symbol.
(?:(?!q\d+).)* Negative look-ahead which matches any char not of q\d+ zero or more times.

Correct usage of \D in python?

I have some code where I am trying to find a certain set of numbers. The length varies and I do not want them to be found amongst other numbers. For example the following code:
reg="\D12345\D"
string="12345"
matchedResults = re.finditer(reg, string)
for match in matchedResults:
print match.group(0)
Does not work if the number is just by itself. However this will work if I put:
string="a12345"
but this will also match the a which is undesirable. Is there a better way to do this?

Use zero-width negative look-around assertions:
reg = r"(?<!\d)12345(?!\d)"
The look-around assertions (lookbehind and lookahead) match a position, not a character; the negative assertions only match if the preceding text or the following text respectively does not match the named pattern.
This means only locations that do not follow or precede a number will be matched; the start and end of a string will do for that purpose.
Demo:
>>> import re
>>> reg = re.compile(r"(?<!\d)12345(?!\d)")
>>> reg.search('12345')
<_sre.SRE_Match object at 0x102981ac0>
>>> reg.search('-12345-')
<_sre.SRE_Match object at 0x102a51238>
>>> reg.search('0123456')
>>> reg.search('012345-')
>>> reg.search('-123456')

extracting multiple instances regex python

I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?

Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().

You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)

This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex matching already matched sub-string - python

Related

Why does the Python regex ".PATTERN" match "XXPATTERXX"?

Find and replace symbols with regex python

Regex pattern to extract substring

Correct usage of \D in python?

extracting multiple instances regex python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex matching already matched sub-string - python

Related

Why does the Python regex ".*PATTERN*" match "XXPATTERXX"?

Find and replace symbols with regex python

Regex pattern to extract substring

Correct usage of \D in python?

extracting multiple instances regex python

Categories

Resources

Why does the Python regex ".PATTERN" match "XXPATTERXX"?