Nongreedy Regex with Repetition - python

I am using the following regex:
((FFD8FF).+?((FFD9)(?:(?!FFD8).)*))
I need to do the following with regex:
Find FFD8FF
Find the last FFD9that comes before the next FFD8FF
Stop at the last FFD9 and not include any content after
What I've got does what I need except it finds and keeps any junk after the last FFD9. How can I get it to jump back to the last FFD9?
Here's the string that I'm searching with this expression:
asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9
Thanks a lot for your help.
More info:
I have a list of start and end values I need to search for (FFD8FF and FFD9 are just one pair). They are in a list. Because of this, I'm using r.compile to dynamically create the expression in a for loop that goes through the different values. I have the following code, but it is returning 0 matches:
regExp = re.compile("FD8FF(?:[^F]|F(?!FD8FF))*FFD9")
matchObj = re.findall(regExp, contents)
In the above code, I'm just trying to use the plain regex without even getting the values from the list (that would look like this):
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1])
Any other ideas why there aren't any matches?
EDIT:
I figured out that I forgot to include flags. Flags are now included to ignore case and multiline. I now have
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1],re.M|re.I)
Although now I'm getting a memory error. Is there any way to make this more efficient? I am using the expression to search hundreds of thousands of lines (using the findall expression above)

an easy way is to use this:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9
explanation:
FFD8FF
(?: # this group describe the allowed content between the "anchors"
[^F] # all that is not a "F"
| # OR
F(?!FD8FF) # a "F" not followed by "FD8FF"
)* # repeat (greedy)
FFD9 # until the last FFD9 before FFD8FF
Even if a greedy quantifier is used for the group, the regex engine will backtrack to find the last "FFD9" substring.
If you want to ensure that FFD8FF is present, you can add a lookahead at the end of the pattern:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9(?=.*?FFD8FF)
You can optimize this pattern by emulating an atomic group that will limit the backtracking and allows to use quantifier inside the group:
FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\1)*FFD9
This trick uses the fact that the content of a lookahead is naturally atomic once the closing parenthesis reached. So if you enclose a group inside a lookahead with a capture group inside, you only have to put the backreference after to obtain an "atom" (an indivisable substring).
When the regex engine need to backtrack, it will backtrack atom by atom instead of character by character that is much faster.
If you need a capture group before this trick, don't forget to update the number of the backreference, examples:
(FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\2)*FFD9)
(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)
working example:
>>> import re
>>> yourstr = 'asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9'
>>> p = re.compile(r'(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)(?=.*?FFD8FF)')
>>> re.findall(p, yourstr)
[('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9', 'asdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdf', 'D9asdflasdflasdf')]
variant:
(FFD8FF((?:(?=(F(?!FD8FF)[^F]*|[^F]+))\3)*)FFD9)(?=.*?FFD8FF)

Since you are not restricted to one regexp by your application's architecture, break it down into steps:
You want to break up the text in units that begin at each FFD8FF. Just use non-greedy search that ends just before the next FFD8FF: re.findall(r"FFD8FF.*?(?=FFD8FF)", contents). (This uses look-ahead, which is in my opinion overused; but it lets you save the final FFD8FF for the next string.)
You then want to trim each such string so that it ends at the last FFD9. Easiest way to do this is with greedy search: re.search(r"^.*FFD9", part). Like this:
for part in re.findall(r"FFD8FF.*?(?=FFD8FF)", contents):
print(re.search(r"^.*FFD9", part).group(0))
Simple, maintainable and efficient.

This is how I would do it:
>>> re.search(r'((FFD8FF).+?(FFD9))(?:((?!FFD9).)+FFD8FF)', s).groups()
('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9',
'FFD8FF',
'FFD9',
'f')
The second part just searches for a string not containing FFD9 that ends with FFD8FF.
It includes your search components, so you can still substitute them in your regex. However for something rather complicated like this I would avoid regex.
btw, thanks for posting a regex question that is high-quality and not the usual spam.

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

How to search for and replace a term within another search term

I have a url I get from parsing a swagger's api.json file in Python.
The URL looks something like this and I want to replace the dashes with underscores, but only inside the curly brackets.
10.147.48.10:8285/pet-store-account/{pet-owner}/version/{pet-type-id}/pet-details-and-name
So, {pet-owner} will become {pet_owner}, but pet-store-account will remain the same.
I am looking for a regular expression that will allow me to perform a non-greedy search and then do a search-replace on each of the first search's findings.
a Python re approach is what I am looking for, but I will also appreciate if you can suggest a Vim one liner.
The expected final result is:
10.147.48.10:8285/pet-store-account/{pet_owner}/version/{pet_type_id}/pet-details-and-name
Provided that you expect all '{...}' blocks to be consistent, you may use a trailing context to determine whether a given dash is inside a block, actually just requiring it to be followed by '...}' where '.' is not a '{'
exp = re.compile(r'(?=[^{]*})-')
...
substituted_url = re.sub(exp,'_',url_string)
Using lookahead and lookbehind in Vim:
s/\({[^}]*\)\#<=-\([^{]*}\)\#=/_/g
The pattern has three parts:
\({[^}]*\)\#<= matches, but does not consume, an opening brace followed by anything except a closing brace, immediately behind the next part.
- matches a hyphen.
\([^{]*}\)\#= matches, but does not consume, anything except an opening brace, followed by a closing brace, immediately ahead of the previous part.
The same technique can't be exactly followed in Python regular expressions, because they only allow fixed-width lookbehinds.
Result:
Before
outside-braces{inside-braces}out-again{in-again}out-once-more{in-once-more}
After
outside-braces{inside_braces}out-again{in_again}out-once-more{in_once_more}
Because it checks for braces in the right place both before and after the hyphen, this solution (unlike others which use only lookahead assertions) behaves sensibly in the face of unmatched braces:
Before
b-c{d-e{f-g}h-i
b-c{d-e}f-g}h-i
b-c{d-e}f-g{h-i
b-c}d-e{f-g}h-i
After
b-c{d-e{f_g}h-i
b-c{d_e}f-g}h-i
b-c{d_e}f-g{h-i
b-c}d-e{f_g}h-i
Use a two-step approach:
import re
url = "10.147.48.10:8285/pet-store-account/{pet-owner}/version/{pet-type-id}/pet-details-and-name"
rx = re.compile(r'{[^{}]+}')
def replacer(match):
return match.group(0).replace('-', '_')
url = rx.sub(replacer, url)
print(url)
Which yields
10.147.48.10:8285/pet-store-account/{pet_owner}/version/{pet_type_id}/pet-details-and-name
This looks for pairs of { and } and replaces every - with _ inside it.
There may be solutions with just one line but this one is likely to be understood in a couple of months as well.
Edit: For one-line-gurus:
url = re.sub(r'{[^{}]+}',
lambda x: x.group(0).replace('-', '_'),
url)
Solution in Vim:
%s/\({.*\)\#<=-\(.*}\)\#=/_/g
Explanation of matched pattern:
\({.*\)\#<=-\(.*}\)\#=
\({.*\)\#<= Forces the match to have a {.* behind
- Specifies a dash (-) as the match
\(.*}\)\#= Forces the match to have a .*} ahead
Use python lookahead to ignore the string enclosed within curly brackets {}:
Description:
(?=...):
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Solution
a = "10.147.48.10:8285/pet-store-account/**{pet-owner}**/version/**{pet-type-id}**/pet-details-and-name"
import re
re.sub(r"(?=[^{]*})-", "_", a)
Output:
'10.147.48.10:8285/pet-store-account/**{pet_owner}**/version/**{pet_type_id}**/pet-details-and-name'
Another way to do in Vim is to use a sub-replace-expression:
:%s/{\zs[^}]*\ze}/\=substitute(submatch(0),'-','_','g')/g
Using \zs and \ze we set the match between the { & } characters. Using \={expr} will evaluate {expr} as the replacement for each substitution. Using VimScripts substitution function, substitute({text}, {pat}, {replace}, {flag}), on the entire match, submatch(0), to convert - to _.
For more help see:
:h sub-replace-expression
:h /\zs
:h submatch()
:h substitute()

Using regular expressing in python

I have a couple of huge log files which contains a list of activity names and sub-activities with a numerical value associated with each sub activity. I need to write a script to automate the data analysis process. I used Regex to get a pattern match for my main activity by doing a word by word search.Now, I have to find the sub-activity and get the numerical value associated with it.
For example: "Out: Packet Sizes Histogram Bucket 5=10" I need to check for the sub-activity Out: Packet Sizes and get the Histogram Bucket value 5=10. There are a list of sub-activities like this. In my word search technique I find it hard to get a pattern match for my sub-activity. What regex pattern should i use to get the 5=10 value when the pattern matches the entire text before that?
PS: All the sub-activities has the text "Histogram Bucket" repeated. I would greatly appreciate your suggestions to address this issue. I have just started learning regex and python.
(1) If you want to use one regular expression you could use:
known_activities = ['Out: Packet Sizes']
# you might have to use '\s' or '\ ' to protect the whitespaces.
activity_exprs = [a.replace(' ', '\s') for a in known_activities]
regexpr = r'('+'|'.join(activity_exprs)+r')\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)
match = pattern.match(input)
if match:
print('Activity: '+match.group(1))
print('Bucket: '+match.group(2))
(2) If you don't want (or have to) match the activities, it you could also go simply with:
regexpr = r'(.*?)\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)
match = pattern.match(input)
if match:
print('Activity: '+match.group(1))
print('Bucket: '+match.group(2))
(3) If you do want to match activities you can always do so in a separate step:
if match:
activity = match.group(1)
if activity in known_activities:
print('Activity: '+activity )
print('Bucket: '+match.group(2))
EDIT Some more details and explanations:
items = ['a','b','c']
'|'.join(items)
produces a|b|c. Used in regular expressions | denotes alternatives, e.g. r'a(b|c)a' will match either 'aba' or 'aca'. So in (1) I basically chained all known activities as alternatives together. Each activity has to be a valid regular expression in it self (that is why any 'special' characters (e.g. whitespace) should be properly escaped).
One could simply mash together all alternatives by hand into one large regular expression, but that gets unwieldy and error prone fast, if there are more than a couple of activities.
All in all you are probably better of using (2) and if necessary (3) or a separate regular expression as a secondary stage.
EDIT2
regarding your sample line you could also use:
regexpr = r'([^\s]*?)\s([^\s]*?)\s([^\s]*?)\s(.*?)\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)
match = pattern.match(input)
if match:
print('Date: '+match.group(1))
print('Time: '+match.group(2))
print('Activity: '+match.group(3))
print('Sub: '+match.group(4))
print('Bucket: '+match.group(5))
EDIT3
pattern.match(input) expects to find the pattern directly at the beginning of the input string. That means 'a' will match 'a' or 'abc' but not 'ba'. If your pattern does not start at the beginning you have to prepend '.*?' to your regular expression to consume as much arbitrary characters as necessary.
'\s' matches any whitespace character, '[^\s]' matches any character that is NOT whitespace.
If you want to learn more about regular expressions, the python HOWTO on that matter is quite good.

Python regex: how to match anything up to a specific string and avoid backtraking when failin

I'm trying to craft a regex able to match anything up to a specific pattern. The regex then will continue looking for other patterns until the end of the string, but in some cases the pattern will not be present and the match will fail. Right now I'm stuck at:
.*?PATTERN
The problem is that, in cases where the string is not present, this takes too much time due to backtraking. In order to shorten this, I tried mimicking atomic grouping using positive lookahead as explained in this thread (btw, I'm using re module in python-2.7):
Do Python regular expressions have an equivalent to Ruby's atomic grouping?
So I wrote:
(?=(?P<aux1>.*?))(?P=aux1)PATTERN
Of course, this is faster than the previous version when STRING is not present but trouble is, it doesn't match STRING anymore as the . matches everyhing to the end of the string and the previous states are discarded after the lookahead.
So the question is, is there a way to do a match like .*?STRING and alse be able to fail faster when the match is not present?
You could try using split
If the results are of length 1 you got no match. If you get two or more you know that the first one is the first match. If you limit the split to size one you'll short-circuit the later matching:
"HI THERE THEO".split("TH", 1) # ['HI ', 'ERE THEO']
The first element of the results is up to the match.
One-Regex Solution
^(?=(?P<aux1>(?:[^P]|P(?!ATTERN))*))(?P=aux1)PATTERN
Explanation
You wanted to use the atomic grouping like this: (?>.*?)PATTERN, right? This won't work. Problem is, you can't use lazy quantifiers at the end of an atomic grouping: the definition of the AG is that once you're outside of it, the regex won't backtrack inside.
So the regex engine will match the .*?, because of the laziness it will step outside of the group to check if the next character is a P, and if it's not it won't be able to backtrack inside the group to match that next character inside the .*.
What's usually used in Perl are structures like this: (?>(?:[^P]|P(?!ATTERN))*)PATTERN. That way, the equivalent of .* (here (?:[^P]|P(?!ATTERN))) won't "eat up" the wanted pattern.
This pattern is easier to read in my opinion with possessive quantifiers, which are made just for these occasions: (?:[^P]|P(?!ATTERN))*+PATTERN.
Translated with your workaround, this would lead to the above regex (added ^ since you should anchor the regex, either to the start of the string or to another regex).
The Python documentation includes a brief outline of the differences between the re.search() and re.match() functions http://docs.python.org/2/library/re.html#search-vs-match. In particular, the following quote is relevant:
Sometimes you’ll be tempted to keep using re.match(), and just add .* to the front of your RE. Resist this temptation and use re.search() instead. The regular expression compiler does some analysis of REs in order to speed up the process of looking for a match. One such analysis figures out what the first character of a match must be; for example, a pattern starting with Crow must match starting with a 'C'. The analysis lets the engine quickly scan through the string looking for the starting character, only trying the full match if a 'C' is found.
Adding .* defeats this optimization, requiring scanning to the end of the string and then backtracking to find a match for the rest of the RE. Use re.search() instead.
In your case, it would be preferable to define your pattern simply as:
pattern = re.compile("PATTERN")
And then call pattern.search(...), which will not backtrack when the pattern is not found.

Unexpected end of Pattern : Python Regex

When I use the following python regex to perform the functionality described below, I get the error Unexpected end of Pattern.
Regex:
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
Purpose of this regex:
INPUT:
CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
Should match:
CODE876
CODE223
CODE657
CODE697
and replace occurrences with
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
Should Not match:
code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665
FINAL OUTPUT
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
EDIT and UPDATE 1
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
The error is no more happening. But this does not match any of the patterns as needed. Is there a problem with matching groups or the matching itself. Because when I compile this regex as such, I get no match to my input.
EDIT AND UPDATE 2
f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()
s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
print s1
INPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
OUTPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
Regex works for Raw input, but not for string input from a text file.
See Input 4 and 5 for more results http://ideone.com/3w1E3
Your main problem is the (?-i) thingy which is wishful thinking as far as Python 2.7 and 3.2 are concerned. For more details, see below.
import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy
Looks like suggestions fall on deaf ears ... Here's the pattern in re.VERBOSE format:
pattern4 = r'''
^
(?i)
(
(?:
(?!http://)
(?!testing[0-9])
(?!example[0-9])
. #### what is this for?
)*?
) ##### end of capturing group 1
(CODE[0-9]{3}) #### not in capturing group 1
(?!</a>)
'''
Okay, it looks like the problem is the (?-i), which is surprising. The purpose of the inline-modifier syntax is to let you apply modifiers to selected portions of the regex. At least, that's how they work in most flavors. In Python it seems they always modify the whole regex, same as the external flags (re.I, re.M, etc.). The alternative (?i:xyz) syntax doesn't work either.
On a side note, I don't see any reason to use three separate lookaheads, as you did here:
(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?
Just OR them together:
(?:(?!http://|testing[0-9]|example[0-9]).)*?
EDIT: We seem to have moved from the question of why the regex throws exceptions, to the question of why it doesn't work. I'm not sure I understand your requirements, but the regex and replacement string below return the results you want.
s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
see it in action one ideone.com
Is that what you're after?
EDIT: We now know that the replacements are being done within a larger text, not on standalone strings. That's makes the problem much more difficult, but we also know the full URLs (the ones that start with http://) only occur in already-existing anchor elements. That means we can split the regex into two alternatives: one to match complete <a>...</a> elements, and one to match our the target strings.
(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))
The trick is to use a function instead of a static string for the replacement. Whenever the regex matches an anchor element, the function will find it in group(1) and return it unchanged. Otherwise, it uses group(2) and group(3) to build a new one.
here's another demo (I know that's horrible code, but I'm too tired right now to learn a more pythonic way.)
The only problem I see is that you replace using the wrong capturing group.
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
^ ^ ^
first capturing group second one using the first group
Here I made the first one also a non capturing group
^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)
See it here on Regexr
For complex regexes, use the re.X flag to document what you're doing and to make sure the brackets match up correctly (i.e. by using indentation to indicate the current level of nesting).

Categories