Python Regular Expression - right-to-left - python

I am trying to use regular expressions in python to match the frame number component of an image file in a sequence of images. I want to come up with a solution that covers a number of different naming conventions. If I put it into words I am trying to match the last instance of one or more numbers between two dots (eg .0100.). Below is an example of how my current logic falls down:
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.###.0100.exr
I realise there are other ways in which I can solve this issue (I have already implemented a solution where I am splitting the path at the dot and taking the last item which is a number) but I am taking this opportunity to learn something about regular expressions. It appears the regular expression creates the groups from left-to-right and cannot use characters in the pattern more than once. Firstly is there anyway to search the string from right-to-left? Secondly, why doesn't the pattern find two matches in eg2 (123 and 0100)?
Cheers

finditer will return an iterator "over all non-overlapping matches in the string".
In your example, the last . of the first match will "consume" the first . of the second. Basically, after making the first match, the remaining string of your eg2 example is 0100.exr, which doesn't match.
To avoid this, you can use a lookahead assertion (?=), which doesn't consume the first match:
>>> pattern = re.compile(r'\.(\d+)(?=\.)')
>>> pattern.findall(eg1)
['0100']
>>> pattern.findall(eg2)
['123', '0100']
>>> eg3 = 'xx01_010_animation.123.0100.500.9000.1234.exr'
>>> pattern.findall(eg3)
['123', '0100', '500', '9000', '1234']
# and "right to left"
>>> pattern.findall(eg3)[::-1]
['1234', '9000', '500', '0100', '123']

My solution uses a very simple hackish way of fixing it. It reverses the string path in the beginning of your function and reverses the return value at the end of it. It basically uses regular expressions to search the backwards version of your given strings. Hackish, but it works. I used the syntax shown in this question to reverse the string.
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
path = path[::-1]
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)[::-1]
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.123.####.exr
print(eg1)
print(eg2)

I believe the problem is that finditer returns only non-overlapping matches. Because both '.' characters are part of the regular expression, it doesn't consider the second dot as a possible start of another match. You can probably use the lookahead construct ?= to match the second dot without consuming it with "?=.".
Because of the way regular expressions work, I don't think there is an easy way to search right-to-left (though I suppose you could reverse the string and write the pattern backwards...).

If all you care about is the last \.(\d+)\., then anchor your pattern from the end of the string and do a simple re.search(_):
\.(\d+)\.(?:.*?)$
where (?:.*?) is non-capturing and non-greedy, so it will consume as few characters as possible between your real target and the end of the string, and those characters will not show up in matches.
(Caveat 1: I have not tested this. Caveat 2: That is one ugly regex, so add a comment explaining what it's doing.)
UPDATE: Actually I guess you could just do a ^.*(\.\d\.) and let the implicitly greedy .* match as much as possible (including matches that occur earlier in the string) while still matching your group. That makes for a simpler regex, but I think it makes your intentions less clear.

Related

wildcard match & replace and/or multiple string wildcard matching

I have two very related questions:
I want to match a string pattern with a wildcard (i.e. containing one or more '*' or '?')
and then form a replacement string with a second wildcard pattern. There the placeholders should refer to the same matched substring
(As for instance in the DOS copy command)
Example: pattern='*.txt' and replacement-pattern='*.doc':
I want aaa.txt --> aaa.doc and xx.txt.txt --> xx.txt.doc
Ideally it would work with multiple, arbitrarily placed wildcards: e.g., pattern='*.*' and replacement-pattern='XX*.*'.
Of course one needs to apply some constraints (e.g. greedy strategy). Otherwise patterns such as X*X*X are not unique for string XXXXXX.
or, alternatively, form a multi-match. That is I have one or more wildcard patterns each with the same number of wildcard characters. Each pattern is matched to one string but the wildcard characters should refer to the same matching text.
Example: pattern1='*.txt' and pattern2='*-suffix.txt
Should match the pair string1='XX.txt' and string2='XX-suffix.txt' but not
string1='XX.txt' and string2='YY-suffix.txt'
In contrast to the first this is a more well defined problem as it avoids the ambiguity problem but is perhaps quite similar.
I am sure there are algorithms for these tasks, however, I am unable to find anything useful.
The Python library has fnmatch but this is does not support what I want to do.
There are many ways to do this, but I came up with the following, which should work for your first question. Based on your examples I’m assuming you don’t want to match whitespace.
This function turns the first passed pattern into a regex and the passed replacement pattern into a string suitable for the re.sub function.
import re
def replaceWildcards(string, pattern, replacementPattern):
splitPattern = re.split(r'([*?])', pattern)
splitReplacement = re.split(r'([*?])', replacementPattern)
if (len(splitPattern) != len(splitReplacement)):
raise ValueError("Provided pattern wildcards do not match")
reg = ""
sub = ""
for idx, (regexPiece, replacementPiece) in enumerate(zip(splitPattern, splitReplacement)):
if regexPiece in ["*", "?"]:
if replacementPiece != regexPiece:
raise ValueError("Provided pattern wildcards do not match")
reg += f"(\\S{regexPiece if regexPiece == '*' else ''})" # Match anything but whitespace
sub += f"\\{idx + 1}" # Regex matches start at 1, not 0
else:
reg += f"({re.escape(regexPiece)})"
sub += f"{replacementPiece}"
return re.sub(reg, sub, string)
Sample output:
replaceWildcards("aaa.txt xx.txt.txt aaa.bat", "*.txt", "*.doc")
# 'aaa.doc xx.txt.doc aaa.bat'
replaceWildcards("aaa10.txt a1.txt aaa23.bat", "a??.txt", "b??.doc")
# 'aab10.doc a1.txt aaa23.bat'
replaceWildcards("aaa10.txt a1-suffix.txt aaa23.bat", "a*-suffix.txt", "b*-suffix.doc")
# 'aaa10.txt b1-suffix.doc aaa23.bat'
replaceWildcards("prefix-2aaa10-suffix.txt a1-suffix.txt", "prefix-*a*-suffix.txt", "prefix-*b*-suffix.doc")
# 'prefix-2aab10-suffix.doc a1-suffix.txt
Note f-strings require Python >=3.6.

I need help formulating a specific regex

I do not consider myself a newbie in regex, but I seem to have found a problem that stumped me (it's also Friday evening, so brain not at peak performance).
I am trying to substitute a place-holder inside a string with some other value. I am having great difficulty getting a syntax that behaves the way I want.
My place-holder has this format: {swap}
I want it to capture and replace these:
{swap} # NewValue
x{swap}x # xNewValuex
{swap}x # NewValuex
x{swap} # xNewValue
But I want it to NOT match these:
{{swap}} # NOT {NewValue}
x{{swap}}x # NOT x{NewValue}x
{{swap}}x # NOT {NewValue}x
x{{swap}} # NOT x{NewValue}
In all of the above, x can be any string, of any length, be it "word" or not.
I'm trying to do this using python3's re.sub() but anytime I satisfy one subset of criteria I lose another in the process. I'm starting to think it might not be possible to do in a single command.
Cheers!
If you're able to use the newer regex module, you can use (*SKIP)(*FAIL):
{{.*?}}(*SKIP)(*FAIL)|{.*?}
See a demo on regex101.com.
Broken down, this says:
{{.*?}}(*SKIP)(*FAIL) # match any {{...}} and "throw them away"
| # or ...
{.*?} # match your desired pattern
In Python this would be:
import regex as re
rx = re.compile(r'{{.*?}}(*SKIP)(*FAIL)|{.*?}')
string = """
{swap}
x{swap}x
{swap}x
x{swap}
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}"""
string = rx.sub('NewValue', string)
print(string)
This yields:
NewValue
xNewValuex
NewValuex
xNewValue
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}
For the sake of completeness, you can also achieve this with Python's own re module but here, you'll need a slightly adjusted pattern as well as a replacement function:
import re
rx = re.compile(r'{{.*?}}|({.*?})')
string = """
{swap}
x{swap}x
{swap}x
x{swap}
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}"""
def repl(match):
if match.group(1) is not None:
return "NewValue"
else:
return match.group(0)
string = rx.sub(repl, string)
print(string)
Use negative lookahead and lookbehind:
s1 = "x{swap}x"
s2 = "x{{swap}}x"
pattern = r"(?<!\{)\{[^}]+\}(?!})"
re.sub(pattern, "foo", s1)
#'xfoox'
re.sub(pattern, "foo", s2)
#'x{{swap}}x'

Parse a string using regex to obtain matches beginning with a certain word

I tried to search but the information that I am getting seems to be kinda overwhelming and far from what I need. I can't seem to get it to work.
The requirement is to get the function that starts with "meta" and its parentheses.
input:
one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)
output:
[ metaOmph , (uno) ]
[ metaAsdf, (dos) ]
[ metaPoil, (tres)]
The one that I currently have just gets the entire line if it starts with "meta". so I have the entire "one meta<>" if it's a match, would it be possible do what I'm aiming for?
Edit: It's one input/line at a time.
I'd love to post what I did earlier but I closed repl.it due to my frustration. I'll keep it in mind on my next post. (quite new here)
import re
s = """one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)"""
print(re.findall(".+(meta\w+)(\(\w+\))", s))
Outputs:
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
re.findall() approach with valid regex pattern:
import re
s = '''
one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)
'''
result = re.findall(r'\b(meta\w+)(\([^()]+\))', s)
print(result)
The output:
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
If you are going to pass a multiline string, it would seem simple to use the module level re.findall function.
text = '''one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)'''
r = re.findall(r'\b(meta.*?)(\(.*?\))', text, re.M)
print(r)
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
If you are going to be passing 1-line strings as input to a loop, it might make more sense to compile the pattern beforehand, using re.compile and re.search inside a function:
pat = re.compile(r'\b(meta.*?)(\(.*?\))')
def find(text):
return pat.search(text)
for text in list_of_texts: # assuming you're passing in your strings from a list, or elsewhere
m = find(text)
if m:
print(list(m.groups()))
['metaOmph', '(uno)']
['metaAsdf', '(dos)']
['metaPoil', '(tres)']
Note that m might return a match object or None depending on whether a search was found. You'll want to query the return value, otherwise you'll receive an AttributeError: 'NoneType' object has no attribute 'groups', or something along those lines.
Alternatively, if you want to append the result to a list, you might instead use:
r_list = []
for text in list_of_texts:
m = find(text)
if m:
r_list.append(list(m.groups()))
print(r_list)
[['metaOmph', '(uno)'], ['metaAsdf', '(dos)'], ['metaPoil', '(tres)']]
Regex Details
\b # word boundary (thought to add this in thanks to Roman's answer)
(
meta # literal 'meta'
.*? # non-greedy matchall
)
(
\( # literal opening brace (escaped)
.*?
\) # literal closing brace (escaped)
)

Search for any number of unknown substrings in place of * in a list of string

First of all, sorry if the title isn't very explicit, it's hard for me to formulate it properly. That's also why I haven't found if the question has already been asked, if it has.
So, I have a list of string, and I want to perform a "procedural" search replacing every * in my target-substring by any possible substring.
Here is an example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('mesh_*')
# should return: ['mesh_1_TMP', 'mesh_2_TMP']
In this case where there is just one * I just split each string with * and use startswith() and/or endswith(), so that's ok.
But I don't know how to do the same thing if there are multiple * in the search string.
So my question is, how do I search for any number of unknown substrings in place of * in a list of string?
For example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('*_1_*')
# should return: ['obj_1_mesh', 'mesh_1_TMP']
Hope everything is clear enough. Thanks.
Consider using 'fnmatch' which provides Unix-like file pattern matching. More info here http://docs.python.org/2/library/fnmatch.html
from fnmatch import fnmatch
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
resultSubList = [ strList[i] for i,x in enumerate(strList) if fnmatch(x,searchFor) ]
This should do the trick
I would use the regular expression package for this if I were you. You'll have to learn a little bit of regex to make correct search queries, but it's not too bad. '.+' is pretty similar to '*' in this case.
import re
def search_strings(str_list, search_query):
regex = re.compile(search_query)
result = []
for string in str_list:
match = regex.match(string)
if match is not None:
result+=[match.group()]
return result
strList= ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
print search_strings(strList, '.+_1_.+')
This should return ['obj_1_mesh', 'mesh_1_TMP']. I tried to replicate the '*_1_*' case. For 'mesh_*' you could make the search_query 'mesh_.+'. Here is the link to the python regex api: https://docs.python.org/2/library/re.html
The simplest way to do this is to use fnmatch, as shown in ma3oun's answer. But here's a way to do it using Regular Expressions, aka regex.
First we transform your searchFor pattern so it uses '.+?' as the "wildcard" instead of '*'. Then we compile the result into a regex pattern object so we can efficiently use it multiple tests.
For an explanation of regex syntax, please see the docs. But briefly, the dot means any character (on this line), the + means look for one or more of them, and the ? means do non-greedy matching, i.e., match the smallest string that conforms to the pattern rather than the longest, (which is what greedy matching does).
import re
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
pat = re.compile(searchFor.replace('*', '.+?'))
result = [s for s in strList if pat.match(s)]
print(result)
output
['obj_1_mesh', 'mesh_1_TMP']
If we use searchFor = 'mesh_*' the result is
['mesh_1_TMP', 'mesh_2_TMP']
Please note that this solution is not robust. If searchFor contains other characters that have special meaning in a regex they need to be escaped. Actually, rather than doing that searchFor.replace transformation, it would be cleaner to just write the pattern using regex syntax in the first place.
If the string you are looking for looks always like string you can just use the find function, you'll get something like:
for s in strList:
if s.find(searchFor) != -1:
do_something()
If you have more than one string to look for (like abc*123*test) you gonna need to look for the each string, find the second one in the same string starting at the index you found the first + it's len and so on.

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories