Why does the Python regex ".*PATTERN*" match "XXPATTERXX"? - python

Suppose I want to find "PATTERN" in a string, where "PATTERN" could be anywhere in the string. My first try was *PATTERN*, but this generates an error saying that there is "nothing to repeat", which I can accept so I tried .*PATTERN*. This regex does however not give the expected result, see below
import re
p = re.compile(".*PATTERN*")
s = "XXPATTERXX"
if p.match(s):
print s + " match with '.*PATTERN*'"
The result is
XXPATTERXX match with '.*PATTERN*'
Why does "PATTER" match?
Note: I know that I could use .*PATTERN.* to get the expected result, but I am curious to find out why the asterisk on it self fails to get the results.

Your pattern matches 0 or more N characters at the end, but doesn't say anything about what comes after those N characters.
You could add $ to the pattern to anchor to the end of the input string to disallow the XX:
>>> import re
>>> re.compile(".*PATTERN*$")
<_sre.SRE_Pattern object at 0x10029fb90>
>>> import re
>>> p = re.compile(".*PATTERN*$")
>>> p.match("XXPATTERXX") is None
True
>>> p.match("XXPATTER") is None
False
>>> p.match("XXPATTER")
<_sre.SRE_Match object at 0x1004627e8>
You may want to look into the different types of anchor. \b may also fit your needs; it matches word boundaries (so between a \w and \W class character, or between \W and \w), or you could use negative look-ahead and look-behinds to disallow other characters around your PATTERN string.

Related

Python Regex matching already matched sub-string

I'm fairly new to Python Regex and I'm not able to understand the following:
I'm trying to find one small letter surrounded by three capital letters.
My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
['AbAD']
I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.
The second issue is the following regex:
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
[]
Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?
It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.
(?=([A-Z][a-z][A-Z][A-Z]))
Code:
>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']
DEMO
For the 2nd one, you should use negative lookahead and lookbehind assertion like below,
(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))
Code:
>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']
DEMO
The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.
The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.
pat = '(?=([A-Z][a-z][A-Z][A-Z]))'
For your second regex again do the same.
print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)
.For more insights see
1)After first match the string left is aDD as the first part has matched.
2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.
1st issue,
You should use this pattern,
r'([A-Z]{1}[a-z]{1}[A-Z]{1})'
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']
2nd issue
You should use,
(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']

How to print regex match results in python 3?

I was in IDLE, and decided to use regex to sort out a string. But when I typed in what the online tutorial told me to, all it would do was print:
<_sre.SRE_Match object at 0x00000000031D7E68>
Full program:
import re
reg = re.compile("[a-z]+8?")
str = "ccc8"
print(reg.match(str))
result:
<_sre.SRE_Match object at 0x00000000031D7ED0>
Could anybody tell me how to actually print the result?
You need to include .group() after to the match function so that it would print the matched string otherwise it shows only whether a match happened or not. To print the chars which are captured by the capturing groups, you need to pass the corresponding group index to the .group() function.
>>> import re
>>> reg = re.compile("[a-z]+8?")
>>> str = "ccc8"
>>> print(reg.match(str).group())
ccc8
Regex with capturing group.
>>> reg = re.compile("([a-z]+)8?")
>>> print(reg.match(str).group(1))
ccc
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you need to get the whole match value, you should use
m = reg.match(r"[a-z]+8?", text)
if m: # Always check if a match occurred to avoid NoneType issues
print(m.group()) # Print the match string
If you need to extract a part of the regex match, you need to use capturing groups in your regular expression. Enclose those patterns with a pair of unescaped parentheses.
To only print captured group results, use Match.groups:
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
So, to get ccc and 8 and display only those, you may use
import re
reg = re.compile("([a-z]+)(8?)")
s = "ccc8"
m = reg.match(s)
if m:
print(m.groups()) # => ('ccc', '8')
See the Python demo

Regex: I have a flaw in my logic

I'm trying to match a pattern where the non word characters in the first bracket never repeat and the pattern must end with a the second set in the brackets. I just don't understand why this test case is failing:
regexString = '([\-\._]?[a-zA-Z0-9]+)*'
rgx = re.compile(regexString)
assert(rgx.match('dan--') == None)
Documentation for re.match: https://docs.python.org/2/library/re.html#re.match
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance.
In your case '([-._]?[a-zA-Z0-9]+)*' clearly matches 'dan' part of 'dan--' hence the result is not None but a MatchObject. If you don't want it to match anything other than what's in your group put your group between ^ and $.
If you want to check that the pattern match the whole string use ^, $ anchor.
>>> import re
>>> regexString = r'^([\-\._]?[a-zA-Z0-9]+)*$'
>>> rgx = re.compile(regexString)
>>> rgx.match('dan--')
>>> rgx.match('dan')
<_sre.SRE_Match object at 0x00000000029E0D50>
BTW, ^ is not strictly required becasue match matches only at the beginning of the string.
Try to match '--dan--'. This would indeed fail and the result of the assertion would be true.
Reason is the ?, meaning zero or one (but not two or more).
[\-\._]? is one or none of the following characters which are in the brackets, which must be followed by one or more letter or number. Anything or nothing of all of the stuff in parentheses will match nothing, as well. But, rgx.match('dan--') == None fails because you it's okay to have -- after dan since your not specifying if anything should come after [a-zA-Z0-9]+. You need anchors. If you don't mind the underscore the you could change [a-zA-Z0-9]+ to (\w|\d)+.
'^([\-\.]?[a-zA-Z0-9]+)*$'
# also matches '-underscore_dan'
'^([\-\.]?(\w|\d)+)*$'

Correct usage of \D in python?

I have some code where I am trying to find a certain set of numbers. The length varies and I do not want them to be found amongst other numbers. For example the following code:
reg="\D12345\D"
string="12345"
matchedResults = re.finditer(reg, string)
for match in matchedResults:
print match.group(0)
Does not work if the number is just by itself. However this will work if I put:
string="a12345"
but this will also match the a which is undesirable. Is there a better way to do this?
Use zero-width negative look-around assertions:
reg = r"(?<!\d)12345(?!\d)"
The look-around assertions (lookbehind and lookahead) match a position, not a character; the negative assertions only match if the preceding text or the following text respectively does not match the named pattern.
This means only locations that do not follow or precede a number will be matched; the start and end of a string will do for that purpose.
Demo:
>>> import re
>>> reg = re.compile(r"(?<!\d)12345(?!\d)")
>>> reg.search('12345')
<_sre.SRE_Match object at 0x102981ac0>
>>> reg.search('-12345-')
<_sre.SRE_Match object at 0x102a51238>
>>> reg.search('0123456')
>>> reg.search('012345-')
>>> reg.search('-123456')

python regex: get end digits from a string

I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.
You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.
Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'
Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.
Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times
Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'
I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)
Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.

Categories