Regex string doesn't match - python

I'm having trouble matching a digit in a string with Python. While it should be clearly matched, It doesn't even match [0-9] [\d] or just 0 alone. Where is my oversight?
import re
file_without_extension = "/test/folder/something/file_0"
if re.match("[\d]+$", file_without_extension):
print "file matched!"

Read the documentation: http://docs.python.org/2/library/re.html#re.match
If zero or more characters at the beginning of string
You want to use re.search (or re.findall)

re.match is "anchored" to the beginning of the string. Use re.search.

Use re.search instead of re.match.

Related

Python Not Extracting Expected Pattern

I'm new to RegEx and I am trying to perform a simple match to extract a list of items using re.findall. However, I am not getting the expected result. Can you please help explain why I am also getting the first piece of this string based on the below regex pattern and what I need to modify to get the desired output?
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_\w+_\w+_bar_\d+', string))
Current Output:
['_1y345_xyz_orange_bar_1', '_123a5542_xyz_orange_bar_1', '_1z34512_abc_purple_bar_1']
Desired Output:
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
The \w pattern matches letters, digits and _ symbol. Depending on the Python version and options used, the letters and digits it can match may be from the whole Unicode range or just ASCII.
So, the best way to fix the issue is by replacing \w with [^\W_]:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall(r'_[^\W_]+_[^\W_]+_bar_[0-9]+', string))
# => ['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
See the Python demo.
Details:
_ - an underscore
[^\W_]+ - 1 or more chars that are either digits or letters (a [^ starts the negated character class, \W matches any non-word char, and _ is added to match any word chars other than _)
_[^\W_]+ - same as above
_bar_ - a literal substring _bar_
[0-9]+ - 1 or more ASCII digits.
See the regex demo.
_[a-z]+_\w+_bar_\d+ should work.
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[a-z]+_\w+_bar_\d+', string))
o/p
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
Your problem is that the regular expression is greedy and tries to match as much as possible. Sometimes this can be fixed by adding a ? (question mark) after the + (plus) sign. However, in your current solution that is not doable (in any simple way, at least - it can likely be done with some lookahead). However, you can choose another pattern, that explicitly forbids matching then _ (underline) character as:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[^_\W]+_[^_\W]+_bar_\d+', string))
This will match what you hope for. The [^ ... ] construct means not, thus not underline and not not whitespace.
The problem with your code is that \w pattern is equivalent to the following set of characters: [a-zA-Z0-9_]
I guess you need to match the same set but without an underscore:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[a-zA-Z0-9]+_[a-zA-Z0-9]+_bar_\d+', string))
The output:
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
Your \w usage is too permissive. It will find not only letters, but numbers and underscores as well. From the docs:
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Instead us actual character groupings to match.
_[a-z]+_[a-z]+_bar_[0-9]+
If you actually need the complete matching of \w without the underscore, you can change the character groupings to:
[a-zA-Z0-9]

Python regex equivalent for perl

what is the equivalent of following Perl condition in Python
if($line=~/DramBase/)
I tried the following but it didn't match(the line at the bottom)
if(re.match( r'DramBase', line)):
I had to change it to
if(re.match( r'.*DramBase', line)):
to match this line
# -DF0.CCM0.DramBaseAddress1 0x00004001
Is there a flag to match it anywhere on the line without explicitly matching starting characters ?
You need to use re.search, not re.match. re.match only matches as the beginning of the string, while re.search matches anywhere, like in Perl.
See re — Regular expression operations for an explanation
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

Is there a way to refer to the entire matched expression in re.sub without the use of a group?

Suppose I want to prepend all occurrences of a particular expression with a character such as \.
In sed, it would look like this.
echo '__^^^%%%__FooBar' | sed 's/[_^%]/\\&/g'
Note that the & character is used to represent the original matched expression.
I have looked through the regex docs and the regex howto, but I do not see an equivalent to the & character that can be used to substitute in the matched expression.
The only workaround I have found is to use the an extra set of () to group the expression and then refernece the group, as follows.
import re
line = "__^^^%%%__FooBar"
print re.sub("([_%^$])", r"\\\1", line)
Is there a clean way to reference the entire matched expression without the extra group creation?
From the docs:
The backreference \g<0> substitutes in the entire substring matched by the RE.
Example:
>>> print re.sub("[_%^$]", r"\\\g<0>", line)
\_\_\^\^\^\%\%\%\_\_FooBar
You could get the result also by using Positive lookahead .
>>> print re.sub("(?=[_%^$])", r"\\", line)
\_\_\^\^\^\%\%\%\_\_FooBar

How to ignore \n in regular expressions in python?

So i have a regex telling if a number is integer.
regex = '^(0|[1-9][0-9]*)$'
import re
bool(re.search(regex, '42\n'))
returns True, and it is not supposed to?
Where does the problem come from ?
From the documentation:
'$'
Matches the end of the string or just before the newline at the end of the string
Try \Z instead.
Also, any time you find yourself writing a regular expression that starts with ^ or \A and ends with $ or \Z, if your intent is to only match the entire string, you should probably use re.fullmatch() instead of re.search() (and omit the boundary markers from the regex). Or if you're using a version of Python that's too old to have re.fullmatch(), (you really need to upgrade but) you can use re.match() and omit the beginning-of-string boundary marker.
regex ahould be regex = '\b^(0|[1-9][0-9]*)$\b'
The regex in the question matches ->start of line, numbers and end of line. And the given string matches that, thats why it is returning true. If you want it to return False when there is a number present, you can use "!" to indicate NOT.
Refer https://docs.python.org/2/library/re.html
regex = '!(0|[1-9][0-9]*)$'
bool(re.search(regex, '42\n')) => (Returns false)
Yeah, that $ matching one \n before the end is kind of trap/inconsistency. Check out my list of regex traps for python: http://www.cofoh.com/advanced-regex-tutorial-python/traps

Using anchors in python regex to get exact match

I need to validate a version number consisting of 'v' plus positive int, and nothing else
eg "v4", "v1004"
I have
import re
pattern = "\Av(?=\d+)\W"
m = re.match(pattern, "v303")
if m is None:
print "noMatch"
else:
print "match"
But this doesn't work! Removing the \A and \W will match for v303 but will also match for v30G, for example
Thanks
Pretty straightforward. First, put anchors on your pattern:
"^patternhere$"
Now, let's put together the pattern:
"^v\d+$"
That should do it.
I think you may want \b (word boundary) rather than \A (start of string) and \W (non word character), also you don't need to use lookahead (the (?=...)).
Try: "\bv(\d+)" if you need to capture the int, "\bv\d+" if you don't.
Edit: You probably want to use raw string syntax for Python regexes, r"\bv\d+\b", since "\b" is a backspace character in a regular string.
Edit 2: Since + is "greedy", no trailing \b is necessary or desired.
Simply use
\bv\d+\b
Or enclosed it with ^\bv\d+\b$
to match it entirely..

Categories