Can overlapping matches with the same start position be found using regex? - python

I am looking for a regex or a regex flag in python/BigQuery that enables me to find overlapping occurrences.
For example, I have the string 1.2.5.6.8.10.12
and I would like to extract:
[1., 1.2., 1.2.5., 1.2.5.6., ..., 1.2.5.6.8.10.12]
I tried running the python code
re.findall("^(\d+(?:\.|$))+", string)
and it resulted in ['12']

Use below (BigQuery)
select text,
array(
select regexp_extract(text, r'((?:[^.]+.){' || i || '})')
from unnest(generate_array(1, array_length(split(text, '.')))) i
) as extracted
from your_table
with output

While the regex parser walks down the string each position gets consumed. To extract substrings with the same starting position it would be needed to look behind and capture matches towards start. Capturing overlapping matches needs to be done inside a lookaround for not consuming the captured parts. Python re does not support lookbehinds of variable length but PyPI regex does.
import regex as re
res = re.findall(r"(?<=(.*\d(?:\.|$)))", s)
See this Python demo at tio.run or a Regex101 demo (captures will be in the first group).
In PyPI there is even an overlapped=True option which lets avoid to capture inside the lookbehind. Together with (?r) another interesting flag for doing a reverse search it could also be achieved.
res = re.findall(r'(?r).*\d(?:\.|$)', s, overlapped=True)[::-1]
The result just needs to be reversed afterwards for receiving the desired order: Python demo
Using standard re an idea can be to reverse the string and do capturing inside a lookahead. The needed parts get captured from the reversed string and finally each list item gets reversed again before reversing the entire list. I don't know if this is worth the effort but it seems to work as well.
res = [x[::-1] for x in re.findall(r'(?=((?:\.\d|^).*))', s[::-1])][::-1]
Another Python demo at tio.run or a Regex101 demo (shows matching on the reversed string).

Related

Python: Regular Expressions on getting repeating set of numbers

I'm working with a file, that is a Genbank entry (similar to this)
My goal is to extract the numbers in the CDS line, e.g.:
CDS join(1200..1401,3490..4302)
but my regex should also be able to extract the numbers from multiple lines, like this:
CDS join(1200..1401,1550..1613,1900..2010,2200..2250,
2300..2660,2800..2999,3100..3333)
I'm using this regular expression:
import re
match=re.compile('\w+\D+\W*(\d+)\D*')
result=match.findall(line)
print(result)
This gives me the correct numbers but also numbers from the rest of the file, like
gene complement(3300..4037)
so how can I change my regex to get the numbers?
I should only use regex on it..
I'm going to use the numbers to print the coding part of the base sequence.
You could use the heavily improved regex module by Matthew Barnett (which provides the \G functionality). With this, you could come up with the following code:
import regex as re
rx = re.compile("""
(?:
CDS\s+join\( # look for CDS, followed by whitespace and join(
| # OR
(?!\A)\G # make sure it's not the start of the string and \G
[.,\s]+ # followed by ., or whitespace
)
(\d+) # capture these digits
""", re.VERBOSE)
string = """
CDS join(1200..1401,1550..1613,1900..2010,2200..2250,
2300..2660,2800..2999,3100..3333)
"""
numbers = rx.findall(string)
print numbers
# ['1200', '1401', '1550', '1613', '1900', '2010', '2200', '2250', '2300', '2660', '2800', '2999', '3100', '3333']
\G makes sure the regex engine looks for the next match at the end of the last match.
See a demo on regex101.com (in PHP as the emulator does not provide the same functionality for Python [it uses the original re module]).
A far inferior solution (if you are only allowed to use the re module), would be to use lookarounds:
(?<=[(.,\s])(\d+)(?=[,.)])
(?<=) is a positive lookbehind, while (?=) is a positive lookahead, see a demo for this approach on regex101.com. Be aware though there might be a couple of false positives.
The following re pattern might work:
>>> match = re.compile(\s+CDS\s+\w+\([^\)]*\))
But you'll need to call findall on the whole text body, not just a line at a time.
You can use parentheses just to grab out the numbers:
>>> match = re.compile(\s+CDS\s+\w+\(([^\)]*)\))
>>> match.findall(stuff)
1200..1401,3490..4302 # Numbers only
Let me know if that achieves what you want!

How to format a regex

I am trying to make a Warning waver which can look for known warnings in a log file.
The warnings in the waving file are copied directly from the log file during a review of the warnings.
The mission here is to make it as simple as possible. But i found that directly copying was a bit problematic due to that fact that the warnings could contain absolute paths.
So I added a "tag" which could be inserted into a warning which the system should look for. The whole string would then look like this.
WARNING:HDLParsers:817 - ":RE[.*]:/modules/top/hdl_src/top.vhd" Line :RE[.*]: Choice . is not a locally static expression.
The tag is :RE[Insert RegEx here]:.
In the above warning string there are two of these tags which I am trying to find using Python3 regex tool. And my pattern is the following:
(:RE\[.*\]\:)
See RegEx101 for reference
My problem with the above is that, when there are two tags in my string it finds only one result extended from the first to the last tag. how do i setup the regex so it will find each tag ?
Regards
You can use re.findall with the following regex that assumes that the regular expression inside the square brackets spans from :RE[ up to the ] that is followed by ]:
:RE\[.*?]:
See regex demo. The .*? matches 0 or more characters other than a newline but as few as possible. See rexegg.com description of a lazy quantifier solution:
The lazy .*? guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed.
See IDEONE demo
import re
p = re.compile(r':RE\[.*?]:')
test_str = "# Even more commments\nWARNING:HDLParsers:817 - \":RE[.*]:/modules/top/hdl_src/cpu_0342.vhd\" Line :RE[.*]: Choice . is not a locally static expression."
print(p.findall(test_str))
If you need to get the contents between the [ and ], use a capturing group so that re.findall could extract just those contents:
p = re.compile(r':RE\[(.*?)]:')
See another demo
To obtain indices, use re.finditer (see this demo):
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.
p = re.compile(r':RE\[(.*?)]:')
print([x.start(1) for x in p.finditer(test_str)])

Nongreedy Regex with Repetition

I am using the following regex:
((FFD8FF).+?((FFD9)(?:(?!FFD8).)*))
I need to do the following with regex:
Find FFD8FF
Find the last FFD9that comes before the next FFD8FF
Stop at the last FFD9 and not include any content after
What I've got does what I need except it finds and keeps any junk after the last FFD9. How can I get it to jump back to the last FFD9?
Here's the string that I'm searching with this expression:
asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9
Thanks a lot for your help.
More info:
I have a list of start and end values I need to search for (FFD8FF and FFD9 are just one pair). They are in a list. Because of this, I'm using r.compile to dynamically create the expression in a for loop that goes through the different values. I have the following code, but it is returning 0 matches:
regExp = re.compile("FD8FF(?:[^F]|F(?!FD8FF))*FFD9")
matchObj = re.findall(regExp, contents)
In the above code, I'm just trying to use the plain regex without even getting the values from the list (that would look like this):
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1])
Any other ideas why there aren't any matches?
EDIT:
I figured out that I forgot to include flags. Flags are now included to ignore case and multiline. I now have
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1],re.M|re.I)
Although now I'm getting a memory error. Is there any way to make this more efficient? I am using the expression to search hundreds of thousands of lines (using the findall expression above)
an easy way is to use this:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9
explanation:
FFD8FF
(?: # this group describe the allowed content between the "anchors"
[^F] # all that is not a "F"
| # OR
F(?!FD8FF) # a "F" not followed by "FD8FF"
)* # repeat (greedy)
FFD9 # until the last FFD9 before FFD8FF
Even if a greedy quantifier is used for the group, the regex engine will backtrack to find the last "FFD9" substring.
If you want to ensure that FFD8FF is present, you can add a lookahead at the end of the pattern:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9(?=.*?FFD8FF)
You can optimize this pattern by emulating an atomic group that will limit the backtracking and allows to use quantifier inside the group:
FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\1)*FFD9
This trick uses the fact that the content of a lookahead is naturally atomic once the closing parenthesis reached. So if you enclose a group inside a lookahead with a capture group inside, you only have to put the backreference after to obtain an "atom" (an indivisable substring).
When the regex engine need to backtrack, it will backtrack atom by atom instead of character by character that is much faster.
If you need a capture group before this trick, don't forget to update the number of the backreference, examples:
(FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\2)*FFD9)
(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)
working example:
>>> import re
>>> yourstr = 'asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9'
>>> p = re.compile(r'(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)(?=.*?FFD8FF)')
>>> re.findall(p, yourstr)
[('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9', 'asdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdf', 'D9asdflasdflasdf')]
variant:
(FFD8FF((?:(?=(F(?!FD8FF)[^F]*|[^F]+))\3)*)FFD9)(?=.*?FFD8FF)
Since you are not restricted to one regexp by your application's architecture, break it down into steps:
You want to break up the text in units that begin at each FFD8FF. Just use non-greedy search that ends just before the next FFD8FF: re.findall(r"FFD8FF.*?(?=FFD8FF)", contents). (This uses look-ahead, which is in my opinion overused; but it lets you save the final FFD8FF for the next string.)
You then want to trim each such string so that it ends at the last FFD9. Easiest way to do this is with greedy search: re.search(r"^.*FFD9", part). Like this:
for part in re.findall(r"FFD8FF.*?(?=FFD8FF)", contents):
print(re.search(r"^.*FFD9", part).group(0))
Simple, maintainable and efficient.
This is how I would do it:
>>> re.search(r'((FFD8FF).+?(FFD9))(?:((?!FFD9).)+FFD8FF)', s).groups()
('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9',
'FFD8FF',
'FFD9',
'f')
The second part just searches for a string not containing FFD9 that ends with FFD8FF.
It includes your search components, so you can still substitute them in your regex. However for something rather complicated like this I would avoid regex.
btw, thanks for posting a regex question that is high-quality and not the usual spam.

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Unexpected end of Pattern : Python Regex

When I use the following python regex to perform the functionality described below, I get the error Unexpected end of Pattern.
Regex:
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
Purpose of this regex:
INPUT:
CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
Should match:
CODE876
CODE223
CODE657
CODE697
and replace occurrences with
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
Should Not match:
code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665
FINAL OUTPUT
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
EDIT and UPDATE 1
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
The error is no more happening. But this does not match any of the patterns as needed. Is there a problem with matching groups or the matching itself. Because when I compile this regex as such, I get no match to my input.
EDIT AND UPDATE 2
f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()
s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
print s1
INPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
OUTPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
Regex works for Raw input, but not for string input from a text file.
See Input 4 and 5 for more results http://ideone.com/3w1E3
Your main problem is the (?-i) thingy which is wishful thinking as far as Python 2.7 and 3.2 are concerned. For more details, see below.
import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy
Looks like suggestions fall on deaf ears ... Here's the pattern in re.VERBOSE format:
pattern4 = r'''
^
(?i)
(
(?:
(?!http://)
(?!testing[0-9])
(?!example[0-9])
. #### what is this for?
)*?
) ##### end of capturing group 1
(CODE[0-9]{3}) #### not in capturing group 1
(?!</a>)
'''
Okay, it looks like the problem is the (?-i), which is surprising. The purpose of the inline-modifier syntax is to let you apply modifiers to selected portions of the regex. At least, that's how they work in most flavors. In Python it seems they always modify the whole regex, same as the external flags (re.I, re.M, etc.). The alternative (?i:xyz) syntax doesn't work either.
On a side note, I don't see any reason to use three separate lookaheads, as you did here:
(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?
Just OR them together:
(?:(?!http://|testing[0-9]|example[0-9]).)*?
EDIT: We seem to have moved from the question of why the regex throws exceptions, to the question of why it doesn't work. I'm not sure I understand your requirements, but the regex and replacement string below return the results you want.
s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
see it in action one ideone.com
Is that what you're after?
EDIT: We now know that the replacements are being done within a larger text, not on standalone strings. That's makes the problem much more difficult, but we also know the full URLs (the ones that start with http://) only occur in already-existing anchor elements. That means we can split the regex into two alternatives: one to match complete <a>...</a> elements, and one to match our the target strings.
(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))
The trick is to use a function instead of a static string for the replacement. Whenever the regex matches an anchor element, the function will find it in group(1) and return it unchanged. Otherwise, it uses group(2) and group(3) to build a new one.
here's another demo (I know that's horrible code, but I'm too tired right now to learn a more pythonic way.)
The only problem I see is that you replace using the wrong capturing group.
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
^ ^ ^
first capturing group second one using the first group
Here I made the first one also a non capturing group
^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)
See it here on Regexr
For complex regexes, use the re.X flag to document what you're doing and to make sure the brackets match up correctly (i.e. by using indentation to indicate the current level of nesting).

Categories