Python Regex, how to substitute multiple occurrences with a single pattern? - python

I'm trying to make a fuzzy autocomplete suggestion box that highlights searched characters with HTML tags <b></b>
For example, if the user types 'ldi' and one of the suggestions is "Leonardo DiCaprio" then the desired outcome is "Leonardo DiCaprio". The first occurrence of each character is highlighted in order of appearance.
What I'm doing right now is:
def prototype_finding_chars_in_string():
test_string_list = ["Leonardo DiCaprio", "Brad Pitt","Claire Danes","Tobey Maguire"]
comp_string = "ldi" #chars to highlight
regex = ".*?" + ".*?".join([f"({x})" for x in comp_string]) + ".*?" #results in .*?(l).*?(d).*?(i).*
regex_compiled = re.compile(regex, re.IGNORECASE)
for x in test_string_list:
re_search_result = re.search(regex_compiled, x) # correctly filters the test list to include only entries that features the search chars in order
if re_search_result:
print(f"char combination {comp_string} are in {x} result group: {re_search_result.groups()}")
results in
char combination ldi are in Leonardo DiCaprio result group: ('L', 'D', 'i')
Now I want to replace each occurrence in the result groups with <b>[whatever in the result]</b> and I'm not sure how to do it.
What I'm currently doing is looping over the result and using the built-in str.replace method to replace the occurrences:
def replace_with_bold(result_groups, original_string):
output_string: str = original_string
for result in result_groups:
output_string = output_string.replace(result,f"<b>{result}</b>",1)
return output_string
This results in:
Highlighted string: <b>L</b>eonar<b>d</b>o D<b>i</b>Caprio
But I think looping like this over the results when I already have the match groups is wasteful. Furthermore, it's not even correct because it checked the string from the beginning each loop. So for the input 'ooo' this is the result:
char combination ooo are in Leonardo DiCaprio result group: ('o', 'o', 'o')
Highlighted string: Le<b><b><b>o</b></b></b>nardo DiCaprio
When it should be Le<b>o</b>nard<b>o</b> DiCapri<b>o</b>
Is there a way to simplify this? Maybe regex here is overkill?

A way using re.split:
test_string_list = ["Leonardo DiCaprio", "Brad Pitt", "Claire Danes", "Tobey Maguire"]
def filter_and_highlight(strings, letters):
pat = re.compile( '(' + (')(.*?)('.join(letters)) + ')', re.I)
results = []
for s in strings:
parts = pat.split(s, 1)
if len(parts) == 1: continue
res = ''
for i, p in enumerate(parts):
if i & 1:
p = '<b>' + p + '</b>'
res += p
results.append(res)
return results
filter_and_highlight(test_string_list, 'lir')
A particularity of re.split is that captures are included by default as parts in the result. Also, even if the first capture matches at the start of the string, an empty part is returned before it, that means that searched letters are always at odd indexes in the list of substrings.

This should work:
for result in result_groups:
output_string = re.sub(fr'(.*?(?!<b>))({result})((?!</b>).*)',
r'\1<b>\2</b>\3',
output_string,
flags=re.IGNORECASE)
on each iteration first occurrence of result (? makes .* lazy this together does the magic of first occurrence) will be replaced by <b>result</b> if it is not enclosed by tag before ((?!<b>) and (?!</b>) does that part) and \1 \2 \3 are first, second and third group additionally we will use IGNORECASE flag to make it case insensitive.

Related

how to get a pattern repeating multiple times in a string using regular expression

I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']

searching a word in the column pandas dataframe python

I have two text columns and I would like to find whether a word from one column is present in another. I wrote the below code, which works very well, but it detects if a word is present anywhere in the string. For example, it will find "ha" in "ham". I want to use regex expression instead, but I am stuck. I came across this post and looked at the second answer, but I haven't been able to modify it for my purpose. I would like to do something similar.
I would appreciate help and/or any pointers
d = {'emp': ['abc d. efg', 'za', 'sdfadsf '], 'vendor': ['ABCD enterprise', 'za industries', '' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')
def check_subset(vendor, employee):
s = []
for n in employee.split():
# n=" " + n +"[^a-zA-Z\d:]"
if ((str(n) in vendor.lower()) & (len(str(n))>1)):
s.append(n)
return s
check_subset("ABC-xy 54", "54 xy")
df['emp_name_find_in_vendor'] = df.apply(lambda row: check_subset(row['vendor'],row['clean_empy_name']), axis=1)
df
#########update 2
i updated my dataframe as below
d = {'emp': ['abc d. efg', 'za', 'sdfadsf ','abc','yuma'], 'vendor': ['ABCD enterprise', 'za industries', '','Person Vue\Cisco','U OF M CONTLEARNING' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')
I used code provided by first answer and it fails
in case of 'Person Vue\Cisco' it throws the error error: bad escape \c. If i remove \ in 'Person Vue\Cisco', code runs fine
in case of 'U OF M CONTLEARNING' it return u and m when clearly they are not a match
Yes, you can! It is going to be a little bit messy so let me construct in a few steps:
First, let's just create a regular expression for the single case of check_subset("ABC-xy 54", "54 xy"):
We will use re.findall(pattern, string) to find all the occurrences of pattern in string
The regex pattern will basically say "any of the words":
for the "any" we use the | (or) operator
for constructing words we need to use the parenthesis to group together... However, parenthesis (word) create a group that keeps track, so we could later call reuse these groups, since we are not interested we can create a non-capturing group by adding ?: as follows: (?:word)
import re
re.findall('(?:54)|(?:xy)', 'ABC-xy 54')
# -> ['xy', '54']
Now, we have to construct the pattern each time:
Split into words
Wrap each word inside a non-capturing group (?:)
Join all of these groups by |
re.findall('|'.join(['(?:'+x+')' for x in '54 xy'.split()]), 'ABC-xy 54')
One minor thing, since the last row's vendor is empty and you seem to want no matches (technically, the empty string matches with everything) we have to add a minor check. So we can rewrite your function to be:
def check_subset_regex(vendor, employee):
if vendor == '':
return []
pattern = '|'.join(['(?:'+x+')' for x in vendor.lower().split(' ')])
return re.findall(pattern, employee)
And then we can apply the same way:
df['emp_name_find_in_vendor_regex'] = df.apply(lambda row: check_subset_regex(row['vendor'],row['clean_empy_name']), axis=1)
One final comment is that your solution matches partial words, so employee Tom Sawyer would match "Tom" to the vendor "Atomic S.A.". The regex function I provided here will not give this as a match, should you want to do this the regex would become a little more complicated.
EDIT: Removing punctuation marks from vendors
You could either add a new column as you did with clean_employee, or simply add the removal to the function, as so (you will need to import string to get the string.punctuation, or just add in there a string with all the symbols you want to substitute):
def check_subset_regex(vendor, employee):
if vendor == '':
return []
clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)
pattern = '|'.join(['(?:'+x+')' for x in clean_vnd.lower().split(' ')])
return re.findall(pattern, employee)
In the spirit of teaching to fish :), in regex the [] denote any of these characters... So [abc] would be the same to a|b|c.
So the re.sub line will substitute any occurrence of the string.punctuation (which evaluates to !"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~) characters by a '' (removing them).
EDIT2: Adding the possibility of a single non-alphanumeric character at the end of each searchword:
def check_subset_regex(vendor, employee):
if vendor == '':
return []
clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)
pattern = '|'.join(['(?:'+x+'[^a-zA-Z0-9]?)' for x in clean_vnd.lower().split(' ')])
return re.findall(pattern, employee)
In this case we are using:
- ^ as the first character inside a [] (called character class), denotes any character except for those specified in the character class, e.g. [^abc] would match anything that is not a or b or c (so d, or a white space, or #)
- and the ?, which means the previous symbol is optional...
So, [^a-zA-Z0-9]? means an optional single non-alphanumeric character.

How to replace a word which occurs before another word in python

I want to replace(re-spell) a word A in a text string with another word B if the word A occurs before an operator. Word A can be any word.
E.G:
Hi I am Not == you
Since "Not" occurs before operator "==", I want to replace it with alist["Not"]
So, above sentence should changed to
Hi I am alist["Not"] == you
Another example
My height > your height
should become
My alist["height"] > your height
Edit:
On #Paul's suggestion, I am putting the code which I wrote myself.
It works but its too bulky and I am not happy with it.
operators = ["==", ">", "<", "!="]
text_list = text.split(" ")
for index in range(len(text_list)):
if text_list[index] in operators:
prev = text_list[index - 1]
if "." in prev:
tokens = prev.split(".")
prev = "alist"
for token in tokens:
prev = "%s[\"%s\"]" % (prev, token)
else:
prev = "alist[\"%s\"]" % prev
text_list[index - 1] = prev
text = " ".join(text_list)
This can be done using regular expressions
import re
...
def replacement(match):
return "alist[\"{}\"]".format(match.group(0))
...
re.sub(r"[^ ]+(?= +==)", replacement, s)
If the space between the word and the "==" in your case is not needed, the last line becomes:
re.sub(r"[^ ]+(?= *==)", replacement, s)
I'd highly recommend you to look into regular expressions, and the python implementation of them, as they are really useful.
Explanation for my solution:
re.sub(pattern, replacement, s) replaces occurences of patterns, that are given as regular expressions, with a given string or the output of a function.
I use the output of a function, that puts the whole matched object into the 'alist["..."]' construct. (match.group(0) returns the whole match)
[^ ] match anything but space.
+ match the last subpattern as often as possible, but at least once.
* match the last subpattern as often as possible, but it is optional.
(?=...) is a lookahead. It checks if the stuff after the current cursor position matches the pattern inside the parentheses, but doesn't include them in the final match (at least not in .group(0), if you have groups inside a lookahead, those are retrievable by .group(index)).
str = "Hi I am Not == you"
s = str.split()
y = ''
str2 = ''
for x in s:
if x in "==":
str2 = str.replace(y, 'alist["'+y+'"]')
break
y = x
print(str2)
You could try using the regular expression library I was able to create a simple solution to your problem as shown here.
import re
data = "Hi I am Not == You"
x = re.search(r'(\w+) ==', data)
print(x.groups())
In this code, re.search looks for the pattern of (1 or more) alphanumeric characters followed by operator (" ==") and stores the result ("Hi I am Not ==") in variable x.
Then for swaping you could use the re.sub() method which CodenameLambda suggested.
I'd also recommend learning how to use regular expressions, as they are useful for solving many different problems and are similar between different programming languages

Substitute specific matches using regex

I want to execute substitutions using regex, not for all matches but only for specific ones. However, re.sub substitutes for all matches. How can I do this?
Here is an example.
Say, I have a string with the following content:
FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3
What I want to do is this:
re.sub(r'^BAR', '#BAR', s, index=[1,2], flags=re.MULTILINE)
to get the below result.
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could pass replacement function to re.sub that keeps track of count and checks if the given index should be substituted:
import re
s = '''FOO=foo1
BAR=bar1
FOO=foo2
BAR=bar2
BAR=bar3'''
i = 0
index = {1, 2}
def repl(x):
global i
if i in index:
res = '#' + x.group(0)
else:
res = x.group(0)
i += 1
return res
print re.sub(r'^BAR', repl, s, flags=re.MULTILINE)
Output:
FOO=foo1
BAR=bar1
FOO=foo2
#BAR=bar2
#BAR=bar3
You could
Split your string using s.splitlines()
Iterate over the individual lines in a for loop
Track how many matches you have found so far
Only perform substitutions on those matches in the numerical ranges you want (e.g. matches 1 and 2)
And then join them back into a single string (if need be).

replacing all regex matches in single line

I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'

Categories