In python, how to use regex to conditionally pattern match [duplicate] - python

This question already has answers here:
How to make some parts of a regex pattern conditional?
(3 answers)
Closed 6 years ago.
I am trying to parse the following string with python's regex library:
recipe_a = 'run_list[sm_collectd::default#1.0.0]'
Using http://pythex.org/, I am trying out the following regexes:
\[(.*)::(.*)#(.*)\]
Which yields:
Match 1
1. sm_collectd
2. default
3. 1.0.0
Here is the problem:
recipe_a can be pattern matched with this regex, however, it fails when the #version is no longer specified in the string. The following example will fail to match the pattern:
recipe_b = 'run_list[sm_collectd::default]'
\\[(.\*)::(.\*)#(.\*)\\] fails in this case because # was never matched. have python logic that trys \\[(.\*)::(.\*)#(.\*)\\] and tries the latter regex \\[(.\*)::(.\*)\\]. But that would be silly. It would be nice if I could accomplish this in one regex pattern.
I have tried solving this with a conditional regex statement. The general syntax I have tried is as follows:
(?(?=regex)then|else)
the first ? is lookahead assertion: matches without consuming. So we could have a conditional match on the # symbol.
if the # is matched then do \\[(.\*)::(.\*)#(.\*)\\], else do \\[(.\*)::(.\*)\\].
PROGRAMATIC SOLUTION
kitchen_recipe = 'recipe[my_cookbook::default#0.1.0]'
recipe = kitchen_recipe.strip('recipe[').strip(']')
if '#' in recipe:
cookbook, recipe, cookbook_version = tuple(re.split('::|#', recipe))
else:
cookbook, recipe = tuple(re.split('::', recipe))
cookbook_version = None # no version specified
REGEX SOLUTION
kitchen_recipe = 'recipe[my_cookbook::default#0.1.0]'
run_list_pattern = '\[(.*)::([^#]*)#?([0-9.]*)\]'
cookbook, recipe, cookbook_version = re.search(test_list_pattern,
kitchen_recipe).groups()

A couple of small changes should work for your purposes; try \[(.*)::([^#]*)(?:#(.*))?\]
run_list[sm_collectd::default] yields
sm_collectd
default
None
run_list[sm_collectd::default#1.0.0] yields
sm_collectd
default
1.0.0
Explanation:
I changed the final #(.*) into (?:#(.*))?. The additional (?:) is a non-capturing group, and the following ? means "match 0 or 1 times".
I also changed the first (.*) into ([^#]*), which means match anything but an # as many times as possible, so it won't treat all of default#1.0.0 as the chef recipe name.
Edit: As #gregory points out, you can avoid the non-capturing group. You can use \[(.*)::([^#]*)#?(.*)\].
If you'd like to limit the version number to only digits and .s: \[(.*)::([^#]*)#?([0-9.]*)\].

Related

Find all strings starting and ending with given substring in a string using regex in Python [duplicate]

This question already has an answer here:
Regex including overlapping matches with same start
(1 answer)
Closed 3 years ago.
I have given a string
ATGCCAGGCTAGCTTATTTAA
and I have to find out all substrings in string which starts with ATG and end with either of TAA, TAG, TGA.
Here is what I am doing:
seq="ATGCCAGGCTAGCTTATTTAA"
pattern = re.compile(r"(ATG[ACGT]*(TAG|TAA|TGA))")
for match in re.finditer(pattern, seq):
coding = match.group(1)
print(coding)
This code is giving me output:
ATGCCAGGCTAGCTTATTTAA
But actual output should be :
ATGCCAGGCTAGCTTATTTAA, ATGCCAGGCTAG
what I should change in my code?
tl;dr: can't use regex for this
The problem isn't greedy/non-greedy.
The problem isn't overlapping matches either: there's a solution for that (How to find overlapping matches with a regexp?)
The real problem with OP's question is, REGEX isn't designed for matches with the same start. Regex performs a linear search and stops at the first match. That's one of the reasons why it's fast. However, this prevents REGEX from supporting multiple overlapping matches starting at the same character.
See
Regex including overlapping matches with same start
for more info.
Regex isn't the be-all-end-all of pattern matching. It's in the name: Regular expressions are all about single-interpretation symbol sequences, and DNA tends not to fit that paradigm.
In r"(ATG[ACGT]*(TAG|TAA|TGA))", the * operator is "greedy". Use the non-greedy modifier, like r"(ATG[ACGT]*?(TAG|TAA|TGA))", to tell the regexp to take the shortest matching string, not the longest.

Python regex search function wierd behavior [duplicate]

This question already has answers here:
ip address validation in python using regex [duplicate]
(5 answers)
Closed 6 years ago.
I'm making a code to retrieve IP adresses in a text file, but i have an issue with the regex part:
import re
print re.search(r"[0-255].[0-255].[0-255].[0-255]","5.39.0.0")
this returns None but it should return <_sre.SRE_Match object at 0x0000000001D7C510> (because "5.39.0.0" matches with the expression). If I replace 39 with 0 it works.
Your regular expression wont wort for many reasons (see the comments).
The dots indicate that any character can be used you want \.
Try this regular expression:
(?:(?:[01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]?[0-9])
As a reference check this site, there are similar examples:
Regular Experession Examples
P.S.: There are several testing sites on the web for regular expressions, you should use them, for example: Regex101
Edit: The conditional options in the last group must be inverted, if not the match of 2** will get with the two first characters throught first condition, ex: 255.255.255.250 will be matched as 255.255.255.25 (the last digit is lost). Also using non capturing groups in regular expressions is recomended in cases where individual groups (used for alternatives or counting) have no meaning or are not needed.
Ok, i forgot some important stuffs, here is the solution:
[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}

Attribute Error in Python. How to use match and group? [duplicate]

This question already has an answer here:
Python regular expressions, how to search for a word starting with uppercase?
(1 answer)
Closed 7 years ago.
I'm trying to get the following to work. I've looked at the Python documentation, and I still don't know how to fix it. I'm getting an AttributeError, what am I doing wrong?
import re
text = '>:{abcd|}+)_(#)_#_Mitch_(#<$)_)*zersx!)Pamela#(_+)('
m = re.match(r'(?P<name1>[A-Z][A-Za-z]*) (?P<name2>[A-Z][A-Za-z]*)', text)
m.group('name1')
If the above is incorrect, how do I get it to output
>>> m.group('name1') = 'Mitch'
You're forgetting to check that the regex actually matched anything. If it doesn't then both the .match() and .search() functions will return None.
It may be that the named group you are trying to reference was not actually matched in that string for that pattern.
Try to call groups on the returned value and you will see a tuple of all matched groups.
It is a Python-related issue: the unmatched groups are not initialized and thus fail the whole match. You need to explicitly state an empty alternative inside the group for it to be really optional and work as in other regex flavors. Also, you have _, not a space between, so I suggest using a [\s_] character class to match both alternatives:
So, in your case, you can do it like this:
(?P<name1>[A-Z][A-Za-z]*)[\s_](?P<name2>[A-Z][A-Za-z]*|)
^^^^^ ^^
See a regex demo
Sample IDEONE demo:
import re
p = re.compile(ur'(?P<name1>[A-Z][A-Za-z]*)[\s_](?P<name2>[A-Z][A-Za-z]*|)')
test_str = u">:{abcd|}+)_(#)_#_Mitch_(#<$)_)*zersx!)Pamela#(_+)("
match = re.search(p, test_str)
if match:
print(match.group("name1")) # => Mitch

Issues with Python re.findall when matching variables [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
I am trying to match two string variables, and would like to catch multiple matches. re.findall seems like the obvious choice for this task, but it doesn't appear to be working the way I would expect it to. The following is an example:
a = 'a(pp)?le'
b = 'ale, apple, apol'
match = re.findall(a,b)
match
['','pp']
However, when I apply the same variables to re.search, it recognizes the embedded regular expression within the string, and picks up the first match:
match = re.search(a,b)
match.group()
'ale'
Can anyone explain why re.findall is not working in this instance? I would expect the following:
match = re.findall(a,b)
match
['ale','apple']
Thanks!
You are using a capturing group, wheras you want a non-capturing group:
a = 'a(?:pp)?le'
As stated in the docs (...) in a regex will create a "capturing group" and the result of re.findall will be only what is inside the parens.
If you just want to group things (e.g. for the purpose of applying a ?) use (?:...)which creates a non-capturing group. The result of re.findall in this case will be the whole regex (or the largest capturing group).
The key part of the re.findall docs are:
If one or more groups are present in the pattern, return a list of groups
this explains the difference in results between re.findall and re.search.
Let me quote the Python docs about re.findall():
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
And this is what your expression a(pp)?le does. It matches the content in your group, i.e. pp. You can always disable this special behavior of a group by taking a non-capturing group (?:...).

Unexpected end of Pattern : Python Regex

When I use the following python regex to perform the functionality described below, I get the error Unexpected end of Pattern.
Regex:
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
Purpose of this regex:
INPUT:
CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
Should match:
CODE876
CODE223
CODE657
CODE697
and replace occurrences with
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
Should Not match:
code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665
FINAL OUTPUT
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
EDIT and UPDATE 1
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
The error is no more happening. But this does not match any of the patterns as needed. Is there a problem with matching groups or the matching itself. Because when I compile this regex as such, I get no match to my input.
EDIT AND UPDATE 2
f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()
s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
print s1
INPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
OUTPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
Regex works for Raw input, but not for string input from a text file.
See Input 4 and 5 for more results http://ideone.com/3w1E3
Your main problem is the (?-i) thingy which is wishful thinking as far as Python 2.7 and 3.2 are concerned. For more details, see below.
import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy
Looks like suggestions fall on deaf ears ... Here's the pattern in re.VERBOSE format:
pattern4 = r'''
^
(?i)
(
(?:
(?!http://)
(?!testing[0-9])
(?!example[0-9])
. #### what is this for?
)*?
) ##### end of capturing group 1
(CODE[0-9]{3}) #### not in capturing group 1
(?!</a>)
'''
Okay, it looks like the problem is the (?-i), which is surprising. The purpose of the inline-modifier syntax is to let you apply modifiers to selected portions of the regex. At least, that's how they work in most flavors. In Python it seems they always modify the whole regex, same as the external flags (re.I, re.M, etc.). The alternative (?i:xyz) syntax doesn't work either.
On a side note, I don't see any reason to use three separate lookaheads, as you did here:
(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?
Just OR them together:
(?:(?!http://|testing[0-9]|example[0-9]).)*?
EDIT: We seem to have moved from the question of why the regex throws exceptions, to the question of why it doesn't work. I'm not sure I understand your requirements, but the regex and replacement string below return the results you want.
s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
see it in action one ideone.com
Is that what you're after?
EDIT: We now know that the replacements are being done within a larger text, not on standalone strings. That's makes the problem much more difficult, but we also know the full URLs (the ones that start with http://) only occur in already-existing anchor elements. That means we can split the regex into two alternatives: one to match complete <a>...</a> elements, and one to match our the target strings.
(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))
The trick is to use a function instead of a static string for the replacement. Whenever the regex matches an anchor element, the function will find it in group(1) and return it unchanged. Otherwise, it uses group(2) and group(3) to build a new one.
here's another demo (I know that's horrible code, but I'm too tired right now to learn a more pythonic way.)
The only problem I see is that you replace using the wrong capturing group.
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
^ ^ ^
first capturing group second one using the first group
Here I made the first one also a non capturing group
^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)
See it here on Regexr
For complex regexes, use the re.X flag to document what you're doing and to make sure the brackets match up correctly (i.e. by using indentation to indicate the current level of nesting).

Categories