wildcard match & replace and/or multiple string wildcard matching

wildcard match & replace and/or multiple string wildcard matching - python

I have two very related questions:
I want to match a string pattern with a wildcard (i.e. containing one or more '*' or '?')
and then form a replacement string with a second wildcard pattern. There the placeholders should refer to the same matched substring
(As for instance in the DOS copy command)
Example: pattern='*.txt' and replacement-pattern='*.doc':
I want aaa.txt --> aaa.doc and xx.txt.txt --> xx.txt.doc
Ideally it would work with multiple, arbitrarily placed wildcards: e.g., pattern='*.*' and replacement-pattern='XX*.*'.
Of course one needs to apply some constraints (e.g. greedy strategy). Otherwise patterns such as X*X*X are not unique for string XXXXXX.
or, alternatively, form a multi-match. That is I have one or more wildcard patterns each with the same number of wildcard characters. Each pattern is matched to one string but the wildcard characters should refer to the same matching text.
Example: pattern1='*.txt' and pattern2='*-suffix.txt
Should match the pair string1='XX.txt' and string2='XX-suffix.txt' but not
string1='XX.txt' and string2='YY-suffix.txt'
In contrast to the first this is a more well defined problem as it avoids the ambiguity problem but is perhaps quite similar.
I am sure there are algorithms for these tasks, however, I am unable to find anything useful.
The Python library has fnmatch but this is does not support what I want to do.

There are many ways to do this, but I came up with the following, which should work for your first question. Based on your examples I’m assuming you don’t want to match whitespace.
This function turns the first passed pattern into a regex and the passed replacement pattern into a string suitable for the re.sub function.
import re
def replaceWildcards(string, pattern, replacementPattern):
splitPattern = re.split(r'([*?])', pattern)
splitReplacement = re.split(r'([*?])', replacementPattern)
if (len(splitPattern) != len(splitReplacement)):
raise ValueError("Provided pattern wildcards do not match")
reg = ""
sub = ""
for idx, (regexPiece, replacementPiece) in enumerate(zip(splitPattern, splitReplacement)):
if regexPiece in ["*", "?"]:
if replacementPiece != regexPiece:
raise ValueError("Provided pattern wildcards do not match")
reg += f"(\\S{regexPiece if regexPiece == '*' else ''})" # Match anything but whitespace
sub += f"\\{idx + 1}" # Regex matches start at 1, not 0
else:
reg += f"({re.escape(regexPiece)})"
sub += f"{replacementPiece}"
return re.sub(reg, sub, string)
Sample output:
replaceWildcards("aaa.txt xx.txt.txt aaa.bat", "*.txt", "*.doc")
# 'aaa.doc xx.txt.doc aaa.bat'
replaceWildcards("aaa10.txt a1.txt aaa23.bat", "a??.txt", "b??.doc")
# 'aab10.doc a1.txt aaa23.bat'
replaceWildcards("aaa10.txt a1-suffix.txt aaa23.bat", "a*-suffix.txt", "b*-suffix.doc")
# 'aaa10.txt b1-suffix.doc aaa23.bat'
replaceWildcards("prefix-2aaa10-suffix.txt a1-suffix.txt", "prefix-*a*-suffix.txt", "prefix-*b*-suffix.doc")
# 'prefix-2aab10-suffix.doc a1-suffix.txt
Note f-strings require Python >=3.6.

Related

regex in python - how to understand this ip lable without parentheses

I have this code to check if a string is a valid IPv4 address:
import re
def is_ip4(IP):
label = "([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it works fine. but if I remove the parentheses from the label, as this
import re
def is_ip4(IP):
label = "[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it show valid ip for "2090.1.11.0", "20.1.11.0", but not for "2.1.11.0". I'm actually a bit confused for the cases with vs without parentheses. Can someone explain this for me? thanks

The reason you need the parentheses is because of the two-step process you're using. By itself, the parentheses don't do anything (other than capturing in a group). But you're also doing this:
pattern = re.compile("(" + label + "\.){3}" + label + "$")
The label regex is copied twice, first for three repetitions followed by a period. That copy is fine (almost), because in the statement, it is enclosed in parentheses once more. However, the second copy is outside any parentheses, so you end up with a regex like (simplified):
pattern == '(a|ab|abc\.){3}a|ab|abc$'
This matches if either (a|ab|abc\.){3}a matches, or ab or abc. With parentheses, it would be like:
pattern == '((a|ab|abc)\.){3}(a|ab|abc)$'
So, although the parentheses appear superfluous, they are not for two reasons. They are keeping the period separate from the last option abc and they are keeping the final choices together and apart from the first bit.
However, you shouldn't be doing this in the first place. Just use:
from ipaddress import ip_address
def is_ip4(ip):
try:
ip_address(ip)
return True
except ValueError:
return False
No installation required, it's a standard library.
The reason you get a match for '2090.1.11.0' is because matching it to this:
'([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$'
Comes down to matching it to this:
'([0-9]){3}[0-9]'
Since, [0-9] is the first option in the 'or' expression in parentheses, repeated three times and the second [0-9] is the first option in the 'or' expression after the {3}.
Note that the $ you put in to ensure the entire string was matches is lumped in with the final 'or' option, so that doesn't do anything here.
Try running the below and note the identical first match:
import re
print(re.findall('([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$', '2090.1.11.0'))
print(re.findall('([0-9]){3}[0-9]', '2090.1.11.0'))
(ignore the second match on the first line, not as relevant)

Search for any number of unknown substrings in place of * in a list of string

First of all, sorry if the title isn't very explicit, it's hard for me to formulate it properly. That's also why I haven't found if the question has already been asked, if it has.
So, I have a list of string, and I want to perform a "procedural" search replacing every * in my target-substring by any possible substring.
Here is an example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('mesh_*')
# should return: ['mesh_1_TMP', 'mesh_2_TMP']
In this case where there is just one * I just split each string with * and use startswith() and/or endswith(), so that's ok.
But I don't know how to do the same thing if there are multiple * in the search string.
So my question is, how do I search for any number of unknown substrings in place of * in a list of string?
For example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('*_1_*')
# should return: ['obj_1_mesh', 'mesh_1_TMP']
Hope everything is clear enough. Thanks.

Consider using 'fnmatch' which provides Unix-like file pattern matching. More info here http://docs.python.org/2/library/fnmatch.html
from fnmatch import fnmatch
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
resultSubList = [ strList[i] for i,x in enumerate(strList) if fnmatch(x,searchFor) ]
This should do the trick

I would use the regular expression package for this if I were you. You'll have to learn a little bit of regex to make correct search queries, but it's not too bad. '.+' is pretty similar to '*' in this case.
import re
def search_strings(str_list, search_query):
regex = re.compile(search_query)
result = []
for string in str_list:
match = regex.match(string)
if match is not None:
result+=[match.group()]
return result
strList= ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
print search_strings(strList, '.+_1_.+')
This should return ['obj_1_mesh', 'mesh_1_TMP']. I tried to replicate the '*_1_*' case. For 'mesh_*' you could make the search_query 'mesh_.+'. Here is the link to the python regex api: https://docs.python.org/2/library/re.html

The simplest way to do this is to use fnmatch, as shown in ma3oun's answer. But here's a way to do it using Regular Expressions, aka regex.
First we transform your searchFor pattern so it uses '.+?' as the "wildcard" instead of '*'. Then we compile the result into a regex pattern object so we can efficiently use it multiple tests.
For an explanation of regex syntax, please see the docs. But briefly, the dot means any character (on this line), the + means look for one or more of them, and the ? means do non-greedy matching, i.e., match the smallest string that conforms to the pattern rather than the longest, (which is what greedy matching does).
import re
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
pat = re.compile(searchFor.replace('*', '.+?'))
result = [s for s in strList if pat.match(s)]
print(result)
output
['obj_1_mesh', 'mesh_1_TMP']
If we use searchFor = 'mesh_*' the result is
['mesh_1_TMP', 'mesh_2_TMP']
Please note that this solution is not robust. If searchFor contains other characters that have special meaning in a regex they need to be escaped. Actually, rather than doing that searchFor.replace transformation, it would be cleaner to just write the pattern using regex syntax in the first place.

If the string you are looking for looks always like string you can just use the find function, you'll get something like:
for s in strList:
if s.find(searchFor) != -1:
do_something()
If you have more than one string to look for (like abc*123*test) you gonna need to look for the each string, find the second one in the same string starting at the index you found the first + it's len and so on.

Python regex similar expressions

I have a file with two different types of data I'd like to parse with a regex; however, the data is similar enough that I can't find the correct way to distinguish it.
Some lines in my file are of form:
AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Other lines are of form
AED=20180823
AMD=20150914
AMD=20150921
The remaining lines are headers and I'd like to discard them. For example
[HEADER: BUSINESS DATE=20160831]
My solution attempt so far is to match first three capital letters and an equal sign,
r'\b[A-Z]{3}=\b'
but after that I'm not sure how to distinguish between dates (eg 20180823) and days (eg FRI:SAT:SUN).
The results I'd expect from these parsing functions:
Regex weekday_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=FRI>);
Regex date_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=20160816>);
weekdays = [weekday_rx.Match(line) for line in infile.read()]
dates = [date_rx.Match(line) for line in infile.read()]

r'\S*\d$'
Will match all non-whitespace characters that end in a digit
Will match AED=20180823
r'\S*[a-zA-Z]$'
Matches all non-whitespace characters that end in a letter.
will match AED=AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Neither will match
[HEADER: BUSINESS DATE=20160831]
This will match both
r'(\S*[a-zA-Z]$|\S*\d$)'
Replacing the * with the number of occurences you expect will be safer, the (a|b) is match a or match b

The following is a solution in Python :)
import re
p = re.compile(r'\b([A-Z]{3})=((\d)+|([A-Z])+)')
str_test_01 = "AMD=SUN:SAT"
m = p.search(str_test_01)
print (m.group(1))
print (m.group(2))
str_test_02 = "AMD=20150921"
m = p.search(str_test_02)
print (m.group(1))
print (m.group(2))
"""
<Output>
AMD
SUN
AMD
20150921
"""

Use pipes to express alternatives in regex. Pattern '[A-Z]{3}:[A-Z]{3}|[A-Z]{3}' will match both ABC and ABC:ABC. Then use parenthesis to group results:
import re
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC:ABC')
assert match.groups() == ('ABC:ABC', None)
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC')
assert match.groups() == (None, 'ABC')
You can research the concept of named groups to make this even more readable. Also, take a look at the docs for the match object for useful info and methods.

Python Regular Expression - right-to-left

I am trying to use regular expressions in python to match the frame number component of an image file in a sequence of images. I want to come up with a solution that covers a number of different naming conventions. If I put it into words I am trying to match the last instance of one or more numbers between two dots (eg .0100.). Below is an example of how my current logic falls down:
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.###.0100.exr
I realise there are other ways in which I can solve this issue (I have already implemented a solution where I am splitting the path at the dot and taking the last item which is a number) but I am taking this opportunity to learn something about regular expressions. It appears the regular expression creates the groups from left-to-right and cannot use characters in the pattern more than once. Firstly is there anyway to search the string from right-to-left? Secondly, why doesn't the pattern find two matches in eg2 (123 and 0100)?
Cheers

finditer will return an iterator "over all non-overlapping matches in the string".
In your example, the last . of the first match will "consume" the first . of the second. Basically, after making the first match, the remaining string of your eg2 example is 0100.exr, which doesn't match.
To avoid this, you can use a lookahead assertion (?=), which doesn't consume the first match:
>>> pattern = re.compile(r'\.(\d+)(?=\.)')
>>> pattern.findall(eg1)
['0100']
>>> pattern.findall(eg2)
['123', '0100']
>>> eg3 = 'xx01_010_animation.123.0100.500.9000.1234.exr'
>>> pattern.findall(eg3)
['123', '0100', '500', '9000', '1234']
# and "right to left"
>>> pattern.findall(eg3)[::-1]
['1234', '9000', '500', '0100', '123']

My solution uses a very simple hackish way of fixing it. It reverses the string path in the beginning of your function and reverses the return value at the end of it. It basically uses regular expressions to search the backwards version of your given strings. Hackish, but it works. I used the syntax shown in this question to reverse the string.
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
path = path[::-1]
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)[::-1]
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.123.####.exr
print(eg1)
print(eg2)

I believe the problem is that finditer returns only non-overlapping matches. Because both '.' characters are part of the regular expression, it doesn't consider the second dot as a possible start of another match. You can probably use the lookahead construct ?= to match the second dot without consuming it with "?=.".
Because of the way regular expressions work, I don't think there is an easy way to search right-to-left (though I suppose you could reverse the string and write the pattern backwards...).

If all you care about is the last \.(\d+)\., then anchor your pattern from the end of the string and do a simple re.search(_):
\.(\d+)\.(?:.*?)$
where (?:.*?) is non-capturing and non-greedy, so it will consume as few characters as possible between your real target and the end of the string, and those characters will not show up in matches.
(Caveat 1: I have not tested this. Caveat 2: That is one ugly regex, so add a comment explaining what it's doing.)
UPDATE: Actually I guess you could just do a ^.*(\.\d\.) and let the implicitly greedy .* match as much as possible (including matches that occur earlier in the string) while still matching your group. That makes for a simpler regex, but I think it makes your intentions less clear.

How fill a regex string with parameters

I would like to fill regex variables with string.
import re
hReg = re.compile("/robert/(?P<action>([a-zA-Z0-9]*))/$")
hMatch = hReg.match("/robert/delete/")
args = hMatch.groupdict()
args variable is now a dict with {"action":"delete"}.
How i can reverse this process ? With args dict and regex pattern, how i can obtain the string "/robert/delete/" ?
it's possible to have a function just like this ?
def reverse(pattern, dictArgs):
Thank you

This function should do it
def reverse(regex, dict):
replacer_regex = re.compile('''
\(\?P\< # Match the opening
(.+?) # Match the group name into group 1
\>\(.*?\)\) # Match the rest
'''
, re.VERBOSE)
return replacer_regex.sub(lambda m : dict[m.group(1)], regex)
You basically match the (\?P...) block and replace it with a value from the dict.
EDIT: regex is the regex string in my exmple. You can get it from patter by
regex_compiled.pattern
EDIT2: verbose regex added

Actually, i thinks it's doable for some narrow cases, but pretty complex thing "in general case".
You'll need to write some sort of finite state machine, parsing your regex string, and splitting different parts, then take appropriate action for this parts.
For regular symbols — simply put symbols "as is" into results string.
For named groups — put values from dictArgs in place of them
For optional blocks — put some of it's values
And so on.
One requllar expression often can match big (or even infinite) set of strings, so this "reverse" function wouldn't be very useful.

Building upon #Dimitri's answer, more sanitisation is possible.
retype = type(re.compile('hello, world'))
def reverse(ptn, dict):
if isinstance(ptn, retype):
ptn = ptn.pattern
ptn = ptn.replace(r'\.','.')
replacer_regex = re.compile(r'''
\(\?P # Match the opening
\<(.+?)\>
(.*?)
\) # Match the rest
'''
, re.VERBOSE)
# return replacer_regex.findall(ptn)
res = replacer_regex.sub( lambda m : dict[m.group(1)], ptn)
return res

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

wildcard match & replace and/or multiple string wildcard matching - python

Related

regex in python - how to understand this ip lable without parentheses

Search for any number of unknown substrings in place of * in a list of string

Python regex similar expressions

Python Regular Expression - right-to-left

How fill a regex string with parameters

Categories

Resources