I have a list of regex patterns.
rgx_list = ['pattern_1', 'pattern_2', 'pattern_3']
And I am using a function to loop through the list, compile the regex's, and apply a findall to grab the matched terms and then I would like a way of deleting said terms from the text.
def clean_text(rgx_list, text):
matches = []
for r in rgx_list:
rgx = re.compile(r)
found_matches = re.findall(rgx, text)
matches.append(found_matches)
I want to do something like text.delete(matches) so that all of the matches will be deleted from the text and then I can return the cleansed text.
Does anyone know how to do this? My current code will only work for one match of each pattern, but the text may have more than one occurence of the same pattern and I would like to eliminate all matches.
Use sub to replace matched patterns with an empty string. No need to separately find the matches first.
def clean_text(rgx_list, text):
new_text = text
for rgx_match in rgx_list:
new_text = re.sub(rgx_match, '', new_text)
return new_text
For simple regex you can OR the expressions together using a "|". There are examples of combining regex using OR on stack overflow.
For really complex regex I would loop through the list of regex. You could get timeouts from combined complex regex.
Related
curP = "https://programmers.co.kr/learn/courses/4673'>#!Muzi#Muzi!)jayg07con&&"
I want to find the Muzi from this string with regex
for example
MuziMuzi : count 0 because it considers as one word
Muzi&Muzi: count 2 because it has & between so it separate the word
7Muzi7Muzi : count 2
I try to use the regex to find all matched
curP = "<a href='https://programmers.co.kr/learn/courses/4673'></a>#!Muzi#Muzi!)jayg07con&&"
pattern = re.compile('[^a-zA-Z]muzi[^a-zA-Z]')
print(pattern.findall(curP))
I expected the ['!muzi#','#Muzi!']
but the result is
['!muzi#']
You need to use this as your regex:
pattern = re.compile('[^a-zA-Z]muzi(?=[^a-zA-Z])', flags=re.IGNORECASE)
(?=[^a-zA-Z]) says that muzi must have a looahead of [^a-zA-Z] but does not consume any characters. So the first match is only matching !Muzi leaving the following # available to start the next match.
Your original regex was consuming !Muzi# leaving Muzi!, which would not match the regex.
Your matches will now be:
['!Muzi', '#Muzi']
As I understand it you want to get any value that may appear on both sides of your keyword Muzi.
That means that the #, in this case, has to be shared by both output values.
The only way to do it using regex is to manipulate the string as you find patterns.
Here is my solution:
import re
# Define the function to find the pattern
def find_pattern(curP):
pattern = re.compile('([^a-zA-Z]muzi[^a-zA-Z])', flags=re.IGNORECASE)
return pattern.findall(curP)[0]
curP = "<a href='https://programmers.co.kr/learn/courses/4673'></a>#!Muzi#Muzi!)jayg07con&&"
pattern_array = []
# Find the the first appearence of pattern on the string
pattern_array.append(find_pattern(curP))
# Remove the pattern found from the string
curP = curP.replace('Muzi','',1)
#Find the the second appearence of pattern on the string
pattern_array.append(find_pattern(curP))
print(pattern_array)
Output:
['!Muzi#', '#Muzi!']
I'm looking to find words in a string that match a specific pattern.
Problem is, if the words are part of an email address, they should be ignored.
To simplify, the pattern of the "proper words" \w+\.\w+ - one or more characters, an actual period, and another series of characters.
The sentence that causes problem, for example, is a.a b.b:c.c d.d#e.e.e.
The goal is to match only [a.a, b.b, c.c] . With most Regexes I build, e.e returns as well (because I use some word boundary match).
For example:
>>> re.findall(r"(?:^|\s|\W)(?<!#)(\w+\.\w+)(?!#)\b", "a.a b.b:c.c d.d#e.e.e")
['a.a', 'b.b', 'c.c', 'e.e']
How can I match only among words that do not contain "#"?
I would definitely clean it up first and simplify the regex.
first we have
words = re.split(r':|\s', "a.a b.b:c.c d.d#e.e.e")
then filter out the words that have an # in them.
words = [re.search(r'^((?!#).)*$', word) for word in words]
Properly parsing email addresses with a regex is extremely hard, but for your simplified case, with a simple definition of word ~ \w\.\w and the email ~ any sequence that contains #, you might find this regex to do what you need:
>>> re.findall(r"(?:^|[:\s]+)(\w+\.\w+)(?=[:\s]+|$)", "a.a b.b:c.c d.d#e.e.e")
['a.a', 'b.b', 'c.c']
The trick here is not to focus on what comes in the next or previous word, but on what the word currently captured has to look like.
Another trick is in properly defining word separators. Before the word we'll allow multiple whitespaces, : and string start, consuming those characters, but not capturing them. After the word we require almost the same (except string end, instead of start), but we do not consume those characters - we use a lookahead assertion.
You may match the email-like substrings with \S+#\S+\.\S+ and match and capture your pattern with (\w+\.\w+) in all other contexts. Use re.findall to only return captured values and filter out empty items (they will be in re.findall results when there is an email match):
import re
rx = r"\S+#\S+\.\S+|(\w+\.\w+)"
s = "a.a b.b:c.c d.d#e.e.e"
res = filter(None, re.findall(rx, s))
print(res)
# => ['a.a', 'b.b', 'c.c']
See the Python demo.
See the regex demo.
I have a multiple regex which combines thousands of different regexes e.g r"reg1|reg2|...".
I'd like to know which one of the regexes gave a match in re.search(r"reg1|reg2|...", text), and I cannot figure how to do it since `re.search(r"reg1|reg2|...", text).re.pattern gives the whole regex.
For example, if my regex is r"foo[0-9]|bar", my pattern "foo1", I'd like to get as an answer "foo[0-9].
Is there any way to do this ?
Wrap each sub-regexp in (). After the match, you can go through all the groups in the matcher (match.group(index)). The non-empty group will be the one that matched.
You could put each possible regex into a list, then checking them in series, as this would be faster than one very large regex, and allow you to figure out which matched as you need to:
mystring = "Some string you're searching in."
regs = ['reg1', 'reg2', 'reg3', ...]
matching_reg = None
for reg in regs:
match = re.search(reg, mystring)
if match:
matching_reg = reg
break
After that, match and matching_reg will both be None if no match was found. If a match was found, match will contain the regex result and matching_reg will contain the regex search string from regs that matched.
Note that break is used to stop attempting to match as soon as a match is found.
I need to find, process and remove (one by one) any substrings that match a rather long regex:
# p is a compiled regex
# s is a string
while 1:
m = p.match(s)
if m is None:
break
process(m.group(0)) #do something with the matched pattern
s = re.sub(m.group(0), '', s) #remove it from string s
The code above is not good for 2 reasons:
It doesn't work if m.group(0) happens to contain any regex-special characters (like *, +, etc.).
It feels like I'm duplicating the work: first I search the string for the regular expression, and then I have to kinda go look for it again to remove it.
What's a good way to do this?
The re.sub function can take a function as an argument so you can combine the replacement and processing steps if you wish:
# p is a compiled regex
# s is a string
def process_match(m):
# Process the match here.
return ''
s = p.sub(process_match, s)
I'm parsing strings that could have any number of quoted strings inside them (I'm parsing code, and trying to avoid PLY). I want to find out if a substring is quoted, and I have the substrings index. My initial thought was to use re to find all the matches and then figure out the range of indexes they represent.
It seems like I should use re with a regex like \"[^\"]+\"|'[^']+' (I'm avoiding dealing with triple quoted and such strings at the moment). When I use findall() I get a list of the matching strings, which is somewhat nice, but I need indexes.
My substring might be as simple as c, and I need to figure out if this particular c is actually quoted or not.
This is what you want: (source)
re.finditer(pattern, string[, flags])
Return an iterator yielding MatchObject instances over all
non-overlapping matches for the RE pattern in string. The string is
scanned left-to-right, and matches are returned in the order found. Empty
matches are included in the result unless they touch the beginning of
another match.
You can then get the start and end positions from the MatchObjects.
e.g.
[(m.start(0), m.end(0)) for m in re.finditer(pattern, string)]
To get indice of all occurences:
S = input() # Source String
k = input() # String to be searched
import re
pattern = re.compile(k)
r = pattern.search(S)
if not r: print("(-1, -1)")
while r:
print("({0}, {1})".format(r.start(), r.end() - 1))
r = pattern.search(S,r.start() + 1)
This should solve your issue:
pattern=r"(?=(\"[^\"]+\"|'[^']+'))"
Then use the following to get all overlapping indices:
indicesTuple = [(mObj.start(1),mObj.end(1)-1) for mObj in re.finditer(pattern,input)]