Using regular expression to match and replace

Using regular expression to match and replace - python

There is a list of string A which is some how matching with another list of string B. I wanted to replace string A with list of matching string B using regular expression. However I am not getting the correct result.
The solution should be A == ["Yogesh","Numita","Hero","Yogesh"].
import re
A = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar"]
B=["Yogesh","Numita","Hero"]
for i in A:
for j in B:
replaced=re.sub('i','j',i)
print(replaced)

this one works to me:
lst=[]
for a in A:
lst.append([b for b in B if b.lower() in a.lower()][0])
This returns element from list B if it is found at A list. It's necessary to compare lowercased words. The [0] is added for getting string instead of list from comprehension list.

If looping over B, you don't need a regular expression; you can simply use membership testing.
A regex might result in better performance, as membership testing will scan each string in A for every string in B, resulting in O(len(A) * len(B) performance).
As long as the individual terms don't contain any metacharacters and can appear in any context, the simplest way to form the regex is to join the entries of B with the alternation operation:
reTerms = re.compile('|'.join(B), re.I)
However, to be safe, the entries should first be escaped, in case any contains a metacharacter:
# map-based
reTerms = re.compile('|'.join(map(re.escape, B)), re.I)
# comprehension-based
reTerms = re.compile('|'.join([re.escape(b) for b in B]), re.I)
If there is any restrictions on the context the terms appear in, sub-patterns for the restrictions would need to be prepended and appended to the pattern. For example, if the terms must appear as full words:
reTerms = re.compile(f"\b(?:{'|'.join(map(re.escape, B))})\b", re.I)
This regex can be applied to each item of A to get the matching text:
replaced = [reTerms.search(name).group(0) for name in A]
# result: ['yogesh', 'Numita', 'Hero', 'Yogesh']
Since the terms in the regex are straight string matches, the content will be correct, but the case may not. This could be corrected by a normalization step, passing the matched text through a dict:
normed = {term.lower():term for term in B}
replaced = [normed[reTerms.search(name).group(0).lower()] for name in A]
# result: ['Yogesh', 'Numita', 'Hero', 'Yogesh']
One issue remains: what if an item of A doesn't match? Then reTerms.search returns None, which doesn't have a group attribute. If None-propagating attribute access is added to Python (such as suggested by PEP 505), this would be easily addressed by using such:
names = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar", "hrithikroshan"]
normed[None] = None
replaced = [normed[reTerms.search(name)?.group(0).lower()] for name in names]
In the absence of such a feature, there are various approaches, such as using a ternary expression and walrus assignment. In the sample below, a list is used as a stand-in to provide a default value for the match:
import re
names = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar", "hrithikroshan"]
terms = ["Yogesh","Numita","Hero"]
normed = {term.lower():term for term in terms}
normed[''] = None
reTerms = re.compile('|'.join(map(re.escape, terms)), re.I)
# index may need to be changed if `reTerms` includes any context
[normed[(reTerms.search(sentence) or [''])[0].lower()] for sentence in sentences]

Related

Replacing sub-string occurrences with elements of a given list

Suppose I have a string that has the same sub-string repeated multiple times and I want to replace each occurrence with a different element from a list.
For example, consider this scenario:
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
The goal is to obtain a string of the form:
s = "a(_001_), b(_002_), c(_003_)"
The number of occurrences is known, and the list r has the same length as the number of occurrences (3 in the previous example) and contains increasing integers starting from 0.
I've came up with this solution:
import re
pattern = "_____"
s = "a(_____), b(_____), c(_____)"
l = [m.start() for m in re.finditer(pattern, s)]
i = 0
for el in l:
s = s[:el] + f"_{str(i).zfill(5 - 2)}_" + s[el + 5:]
i += 1
print(s)
Output: a(_000_), b(_001_), c(_002_)
This solves my problem, but it seems to me a bit cumbersome, especially the for-loop. Is there a better way, maybe more "pythonic" (intended as concise, possibly elegant, whatever it means) to solve the task?

You can simply use re.sub() method to replace each occurrence of the pattern with a different element from the list.
import re
pattern = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0,1,2]
for val in r:
s = re.sub(pattern, f"_{val:03d}_", s, count=1)
print(s)
You can also choose to go with this approach without re using the values in the r list with their indexes respectively:
r = [0,1,2]
s = ", ".join(f"{'abc'[i]}(_{val:03d}_)" for i, val in enumerate(r))
print(s)
a(_000_), b(_001_), c(_002_)

TL;DR
Use re.sub with a replacement callable and an iterator:
import re
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
it = iter(r)
print(re.sub(p, lambda _: f"_{next(it):03d}_", s))
Long version
Generally speaking, it is a good idea to re.compile your pattern once ahead of time. If you are going to use that pattern repeatedly later, this makes the regex calls much more efficient. There is basically no downside to compiling the pattern, so I would just make it a habit.
As for avoiding the for-loop altogether, the re.sub function allows us to pass a callable as the repl argument, which takes a re.Match object as its only argument and returns a string. Wouldn't it be nice, if we could have such a replacement function that takes the next element from our replacements list every time it is called?
Well, since you have an iterable of replacement elements, we can leverage the iterator protocol to avoid explicit looping over the elements. All we need to do is give our replacement function access to an iterator over those elements, so that it can grab a new one via the next function every time it is called.
The string format specification that Jamiu used in his answer is great if you know exactly that the sub-string to be replaced will always be exactly five underscores (_____) and that your replacement numbers will always be < 999.
So in its simplest form, a function doing what you described, could look like this:
import re
from collections.abc import Iterable
def multi_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(_match: re.Match[str]) -> str:
return f"_{next(iterator):03d}_"
return re.sub(pattern, repl, string)
Trying it out with your example data:
if __name__ == "__main__":
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
print(multi_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_)
In this simple application, we aren't doing anything with the Match object in our replacement function.
If you want to make it a bit more flexible, there are a few avenues possible. Let's say the sub-strings to replace might (perhaps unexpectedly) be a different number of underscores. Let's further assume that the numbers might get bigger than 999.
First of all, the pattern would need to change a bit. And if we still want to center the replacement in an arbitrary number of underscores, we'll actually need to access the match object in our replacement function to check the number of underscores.
The format specifiers are still useful because the allow centering the inserted object with the ^ align code.
import re
from collections.abc import Iterable
def dynamic_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(match: re.Match[str]) -> str:
replacement = f"{next(iterator):03d}"
length = len(match.group())
return f"{replacement:_^{length}}"
return re.sub(pattern, repl, string)
if __name__ == "__main__":
p = re.compile("(_+)")
s = "a(_____), b(_____), c(_____), d(_______), e(___)"
r = [0, 1, 2, 30, 4000]
print(dynamic_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_), d(__030__), e(4000)
Here we are building the replacement string based on the length of the match group (i.e. the number of underscores) to ensure it the number is always centered.
I think you get the idea. As always, separation of concerns is a good idea. You can put the replacement logic in its own function and refer to that, whenever you need to adjust it.

i dun see regex best suit the situation.
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
fstring = s.replace(pattern, "_{}_")
str_out = fstring.format(*r)
str_out_pad = fstring.format(*[str(entry).zfill(3) for entry in r])
print(str_out)
print(str_out_pad)
--
a(_0_), b(_1_), c(_2_)
a(_000_), b(_001_), c(_002_)

Extract named group regex pattern from a compiled regex in Python

I have a regex in Python that contains several named groups. However, patterns that match one group can be missed if previous groups have matched because overlaps don't seem to be allowed. As an example:
import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')
x = re.findall(myRegex,myText)
print(x)
Produces the output:
[('AAA', '')]
The 'long' group does not find a match because 'AAA' was used-up in finding a match for the preceding 'short' group.
I've tried to find a method to allow overlapping but failed. As an alternative, I've been looking for a way to run each named group separately. Something like the following:
for g in myRegex.groupindex.keys():
match = re.findall(***regex_for_named_group_g***,myText)
Is it possible to extract the regex for each named group?
Ultimately, I'd like to produce a dictionary output (or similar) like:
{'short':'AAA',
'long':'AAAaoasgosaegnsBBB'}
Any and all suggestions would be gratefully received.

There really doesn't appear to be a nicer way to do this, but here's a another approach, along the lines of this other answer but somewhat simpler. It will work provided that a) your patterns will always formed as a series of named groups separated by pipes, and b) the named group patterns never contain named groups themselves.
The following would be my approach if you're interested in all matches of each pattern. The argument to re.split looks for a literal pipe followed by the (?=<, the beginning of a named group. It compiles each subpattern and uses the groupindex attribute to extract the name.
def nameToMatches(pattern, string):
result = dict()
for subpattern in re.split('\|(?=\(\?P<)', pattern):
rx = re.compile(subpattern)
name = list(rx.groupindex)[0]
result[name] = rx.findall(string)
return result
With your given text and pattern, returns {'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']}. Patterns that don't match at all will have an empty list for their value.
If you only want one match per pattern, you can make it a bit simpler still:
def nameToMatch(pattern, string):
result = dict()
for subpattern in re.split('\|(?=\(\?P<)', pattern):
match = re.search(subpattern, string)
if match:
result.update(match.groupdict())
return result
This gives {'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} for your givens. If one of the named groups doesn't match at all, it will be absent from the dict.

There didn't seem to be an obvious answer, so here's a hack. It needs a bit of finessing but basically it splits the original regex into its component parts and runs each group regex separately on the original text.
import re
myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr) # This is actually no longer needed
print("Full regex with multiple groups")
print(myRegexStr)
# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)
print("\nList of separate regexes")
print(mySepRegexesList)
# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)
# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
m = re.findall(re.compile(r),myTextStr)
myGroupRegexOutput[g] = m[0]
print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)
The resulting output is:
Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))
List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]
Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}
Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}
This might be useful to someone somewhere.

Search for any number of unknown substrings in place of * in a list of string

First of all, sorry if the title isn't very explicit, it's hard for me to formulate it properly. That's also why I haven't found if the question has already been asked, if it has.
So, I have a list of string, and I want to perform a "procedural" search replacing every * in my target-substring by any possible substring.
Here is an example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('mesh_*')
# should return: ['mesh_1_TMP', 'mesh_2_TMP']
In this case where there is just one * I just split each string with * and use startswith() and/or endswith(), so that's ok.
But I don't know how to do the same thing if there are multiple * in the search string.
So my question is, how do I search for any number of unknown substrings in place of * in a list of string?
For example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('*_1_*')
# should return: ['obj_1_mesh', 'mesh_1_TMP']
Hope everything is clear enough. Thanks.

Consider using 'fnmatch' which provides Unix-like file pattern matching. More info here http://docs.python.org/2/library/fnmatch.html
from fnmatch import fnmatch
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
resultSubList = [ strList[i] for i,x in enumerate(strList) if fnmatch(x,searchFor) ]
This should do the trick

I would use the regular expression package for this if I were you. You'll have to learn a little bit of regex to make correct search queries, but it's not too bad. '.+' is pretty similar to '*' in this case.
import re
def search_strings(str_list, search_query):
regex = re.compile(search_query)
result = []
for string in str_list:
match = regex.match(string)
if match is not None:
result+=[match.group()]
return result
strList= ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
print search_strings(strList, '.+_1_.+')
This should return ['obj_1_mesh', 'mesh_1_TMP']. I tried to replicate the '*_1_*' case. For 'mesh_*' you could make the search_query 'mesh_.+'. Here is the link to the python regex api: https://docs.python.org/2/library/re.html

The simplest way to do this is to use fnmatch, as shown in ma3oun's answer. But here's a way to do it using Regular Expressions, aka regex.
First we transform your searchFor pattern so it uses '.+?' as the "wildcard" instead of '*'. Then we compile the result into a regex pattern object so we can efficiently use it multiple tests.
For an explanation of regex syntax, please see the docs. But briefly, the dot means any character (on this line), the + means look for one or more of them, and the ? means do non-greedy matching, i.e., match the smallest string that conforms to the pattern rather than the longest, (which is what greedy matching does).
import re
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
pat = re.compile(searchFor.replace('*', '.+?'))
result = [s for s in strList if pat.match(s)]
print(result)
output
['obj_1_mesh', 'mesh_1_TMP']
If we use searchFor = 'mesh_*' the result is
['mesh_1_TMP', 'mesh_2_TMP']
Please note that this solution is not robust. If searchFor contains other characters that have special meaning in a regex they need to be escaped. Actually, rather than doing that searchFor.replace transformation, it would be cleaner to just write the pattern using regex syntax in the first place.

If the string you are looking for looks always like string you can just use the find function, you'll get something like:
for s in strList:
if s.find(searchFor) != -1:
do_something()
If you have more than one string to look for (like abc*123*test) you gonna need to look for the each string, find the second one in the same string starting at the index you found the first + it's len and so on.

Python - how to substitute a substring using regex with n occurrencies

I have a string with a lot of recurrencies of a single pattern like
a = 'eresQQQutnohnQQQjkhjhnmQQQlkj'
and I have another string like
b = 'rerTTTytu'
I want to substitute the entire second string having as a reference the 'QQQ' and the 'TTT', and I want to find in this case 3 different results:
'ererTTTytuohnQQQjkhjhnmQQQlkj'
'eresQQQutnrerTTTytujhnmQQQlkj'
'eresQQQutnohnQQQjkhjrerTTTytu'
I've tried using re.sub
re.sub('\w{3}QQQ\w{3}' ,b,a)
but I obtain only the first one, and I don't know how to get the other two solutions.

Edit: As you requested, the two characters surrounding 'QQQ' will be replaced as well now.
I don't know if this is the most elegant or simplest solution for the problem, but it works:
import re
# Find all occurences of ??QQQ?? in a - where ? is any character
matches = [x.start() for x in re.finditer('\S{2}QQQ\S{2}', a)]
# Replace each ??QQQ?? with b
results = [a[:idx] + re.sub('\S{2}QQQ\S{2}', b, a[idx:], 1) for idx in matches]
print(results)
Output
['errerTTTytunohnQQQjkhjhnmQQQlkj',
'eresQQQutnorerTTTytuhjhnmQQQlkj',
'eresQQQutnohnQQQjkhjhrerTTTytuj']
Since you didn't specify the output format, I just put it in a list.

Getting captured group in one line

There is a known "pattern" to get the captured group value or an empty string if no match:
match = re.search('regex', 'text')
if match:
value = match.group(1)
else:
value = ""
or:
match = re.search('regex', 'text')
value = match.group(1) if match else ''
Is there a simple and pythonic way to do this in one line?
In other words, can I provide a default for a capturing group in case it's not found?
For example, I need to extract all alphanumeric characters (and _) from the text after the key= string:
>>> import re
>>> PATTERN = re.compile('key=(\w+)')
>>> def find_text(text):
... match = PATTERN.search(text)
... return match.group(1) if match else ''
...
>>> find_text('foo=bar,key=value,beer=pub')
'value'
>>> find_text('no match here')
''
Is it possible for find_text() to be a one-liner?
It is just an example, I'm looking for a generic approach.

Quoting from the MatchObjects docs,
Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:
match = re.search(pattern, string)
if match:
process(match)
Since there is no other option, and as you use a function, I would like to present this alternative
def find_text(text, matches = lambda x: x.group(1) if x else ''):
return matches(PATTERN.search(text))
assert find_text('foo=bar,key=value,beer=pub') == 'value'
assert find_text('no match here') == ''
It is the same exact thing, but only the check which you need to do has been default parameterized.
Thinking of #Kevin's solution and #devnull's suggestions in the comments, you can do something like this
def find_text(text):
return next((item.group(1) for item in PATTERN.finditer(text)), "")
This takes advantage of the fact that, next accepts the default to be returned as an argument. But this has the overhead of creating a generator expression on every iteration. So, I would stick to the first version.

You can play with the pattern, using an empty alternative at the end of the string in the capture group:
>>> re.search(r'((?<=key=)\w+|$)', 'foo=bar,key=value').group(1)
'value'
>>> re.search(r'((?<=key=)\w+|$)', 'no match here').group(1)
''

It's possible to refer to the result of a function call twice in a single one-liner: create a lambda expression and call the function in the arguments.
value = (lambda match: match.group(1) if match else '')(re.search(regex,text))
However, I don't consider this especially readable. Code responsibly - if you're going to write tricky code, leave a descriptive comment!

One-line version:
if re.findall(pattern,string): pass
The issue here is that you want to prepare for multiple matches or ensure that your pattern only hits once. Expanded version:
# matches is a list
matches = re.findall(pattern,string)
# condition on the list fails when list is empty
if matches:
pass
So for your example "extract all alphanumeric characters (and _) from the text after the key= string":
# Returns
def find_text(text):
return re.findall("(?<=key=)[a-zA-Z0-9_]*",text)[0]

One line for you, although not quite Pythonic.
find_text = lambda text: (lambda m: m and m.group(1) or '')(PATTERN.search(text))
Indeed, in Scheme programming language, all local variable constructs can be derived from lambda function applications.

Re: "Is there a simple and pythonic way to do this in one line?" The answer is no. Any means to get this to work in one line (without defining your own wrapper), is going to be uglier to read than the ways you've already presented. But defining your own wrapper is perfectly Pythonic, as is using two quite readable lines instead of a single difficult-to-read line.
Update for Python 3.8+: The new "walrus operator" introduced with PEP 572 does allow this to be a one-liner without convoluted tricks:
value = match.group(1) if (match := re.search('regex', 'text')) else ''
Many would consider this Pythonic, particularly those who supported the PEP. However, it should be noted that there was fierce opposition to it as well. The conflict was so intense that Guido van Rossum stepped down from his role as Python's BDFL the day after announcing his acceptance of the PEP.

You can do it as:
value = re.search('regex', 'text').group(1) if re.search('regex', 'text') else ''
Although it's not terribly efficient considering the fact that you run the regex twice.
Or to run it only once as #Kevin suggested:
value = (lambda match: match.group(1) if match else '')(re.search(regex,text))

One liners, one liners... Why can't you write it on 2 lines?
getattr(re.search('regex', 'text'), 'group', lambda x: '')(1)
Your second solution if fine. Make a function from it if you wish. My solution is for demonstrational purposes and it's in no way pythonic.

Starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), we can name the regex search expression pattern.search(text) in order to both check if there is a match (as pattern.search(text) returns either None or a re.Match object) and use it to extract the matching group:
# pattern = re.compile(r'key=(\w+)')
match.group(1) if (match := pattern.search('foo=bar,key=value,beer=pub')) else ''
# 'value'
match.group(1) if (match := pattern.search('no match here')) else ''
# ''

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regular expression to match and replace - python

this one works to me: lst=[] for a in A: lst.append([b for b in B if b.lower() in a.lower()][0]) This returns element from list B if it is found at A list. It's necessary to compare lowercased words. The [0] is added for getting string instead of list from comprehension list.

Related

Replacing sub-string occurrences with elements of a given list

Extract named group regex pattern from a compiled regex in Python

Search for any number of unknown substrings in place of * in a list of string

Python - how to substitute a substring using regex with n occurrencies

Getting captured group in one line

Categories

Resources