searching a word in the column pandas dataframe python - python

I have two text columns and I would like to find whether a word from one column is present in another. I wrote the below code, which works very well, but it detects if a word is present anywhere in the string. For example, it will find "ha" in "ham". I want to use regex expression instead, but I am stuck. I came across this post and looked at the second answer, but I haven't been able to modify it for my purpose. I would like to do something similar.
I would appreciate help and/or any pointers
d = {'emp': ['abc d. efg', 'za', 'sdfadsf '], 'vendor': ['ABCD enterprise', 'za industries', '' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')
def check_subset(vendor, employee):
s = []
for n in employee.split():
# n=" " + n +"[^a-zA-Z\d:]"
if ((str(n) in vendor.lower()) & (len(str(n))>1)):
s.append(n)
return s
check_subset("ABC-xy 54", "54 xy")
df['emp_name_find_in_vendor'] = df.apply(lambda row: check_subset(row['vendor'],row['clean_empy_name']), axis=1)
df
#########update 2
i updated my dataframe as below
d = {'emp': ['abc d. efg', 'za', 'sdfadsf ','abc','yuma'], 'vendor': ['ABCD enterprise', 'za industries', '','Person Vue\Cisco','U OF M CONTLEARNING' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')
I used code provided by first answer and it fails
in case of 'Person Vue\Cisco' it throws the error error: bad escape \c. If i remove \ in 'Person Vue\Cisco', code runs fine
in case of 'U OF M CONTLEARNING' it return u and m when clearly they are not a match

Yes, you can! It is going to be a little bit messy so let me construct in a few steps:
First, let's just create a regular expression for the single case of check_subset("ABC-xy 54", "54 xy"):
We will use re.findall(pattern, string) to find all the occurrences of pattern in string
The regex pattern will basically say "any of the words":
for the "any" we use the | (or) operator
for constructing words we need to use the parenthesis to group together... However, parenthesis (word) create a group that keeps track, so we could later call reuse these groups, since we are not interested we can create a non-capturing group by adding ?: as follows: (?:word)
import re
re.findall('(?:54)|(?:xy)', 'ABC-xy 54')
# -> ['xy', '54']
Now, we have to construct the pattern each time:
Split into words
Wrap each word inside a non-capturing group (?:)
Join all of these groups by |
re.findall('|'.join(['(?:'+x+')' for x in '54 xy'.split()]), 'ABC-xy 54')
One minor thing, since the last row's vendor is empty and you seem to want no matches (technically, the empty string matches with everything) we have to add a minor check. So we can rewrite your function to be:
def check_subset_regex(vendor, employee):
if vendor == '':
return []
pattern = '|'.join(['(?:'+x+')' for x in vendor.lower().split(' ')])
return re.findall(pattern, employee)
And then we can apply the same way:
df['emp_name_find_in_vendor_regex'] = df.apply(lambda row: check_subset_regex(row['vendor'],row['clean_empy_name']), axis=1)
One final comment is that your solution matches partial words, so employee Tom Sawyer would match "Tom" to the vendor "Atomic S.A.". The regex function I provided here will not give this as a match, should you want to do this the regex would become a little more complicated.
EDIT: Removing punctuation marks from vendors
You could either add a new column as you did with clean_employee, or simply add the removal to the function, as so (you will need to import string to get the string.punctuation, or just add in there a string with all the symbols you want to substitute):
def check_subset_regex(vendor, employee):
if vendor == '':
return []
clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)
pattern = '|'.join(['(?:'+x+')' for x in clean_vnd.lower().split(' ')])
return re.findall(pattern, employee)
In the spirit of teaching to fish :), in regex the [] denote any of these characters... So [abc] would be the same to a|b|c.
So the re.sub line will substitute any occurrence of the string.punctuation (which evaluates to !"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~) characters by a '' (removing them).
EDIT2: Adding the possibility of a single non-alphanumeric character at the end of each searchword:
def check_subset_regex(vendor, employee):
if vendor == '':
return []
clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)
pattern = '|'.join(['(?:'+x+'[^a-zA-Z0-9]?)' for x in clean_vnd.lower().split(' ')])
return re.findall(pattern, employee)
In this case we are using:
- ^ as the first character inside a [] (called character class), denotes any character except for those specified in the character class, e.g. [^abc] would match anything that is not a or b or c (so d, or a white space, or #)
- and the ?, which means the previous symbol is optional...
So, [^a-zA-Z0-9]? means an optional single non-alphanumeric character.

Related

Python Regex, how to substitute multiple occurrences with a single pattern?

I'm trying to make a fuzzy autocomplete suggestion box that highlights searched characters with HTML tags <b></b>
For example, if the user types 'ldi' and one of the suggestions is "Leonardo DiCaprio" then the desired outcome is "Leonardo DiCaprio". The first occurrence of each character is highlighted in order of appearance.
What I'm doing right now is:
def prototype_finding_chars_in_string():
test_string_list = ["Leonardo DiCaprio", "Brad Pitt","Claire Danes","Tobey Maguire"]
comp_string = "ldi" #chars to highlight
regex = ".*?" + ".*?".join([f"({x})" for x in comp_string]) + ".*?" #results in .*?(l).*?(d).*?(i).*
regex_compiled = re.compile(regex, re.IGNORECASE)
for x in test_string_list:
re_search_result = re.search(regex_compiled, x) # correctly filters the test list to include only entries that features the search chars in order
if re_search_result:
print(f"char combination {comp_string} are in {x} result group: {re_search_result.groups()}")
results in
char combination ldi are in Leonardo DiCaprio result group: ('L', 'D', 'i')
Now I want to replace each occurrence in the result groups with <b>[whatever in the result]</b> and I'm not sure how to do it.
What I'm currently doing is looping over the result and using the built-in str.replace method to replace the occurrences:
def replace_with_bold(result_groups, original_string):
output_string: str = original_string
for result in result_groups:
output_string = output_string.replace(result,f"<b>{result}</b>",1)
return output_string
This results in:
Highlighted string: <b>L</b>eonar<b>d</b>o D<b>i</b>Caprio
But I think looping like this over the results when I already have the match groups is wasteful. Furthermore, it's not even correct because it checked the string from the beginning each loop. So for the input 'ooo' this is the result:
char combination ooo are in Leonardo DiCaprio result group: ('o', 'o', 'o')
Highlighted string: Le<b><b><b>o</b></b></b>nardo DiCaprio
When it should be Le<b>o</b>nard<b>o</b> DiCapri<b>o</b>
Is there a way to simplify this? Maybe regex here is overkill?
A way using re.split:
test_string_list = ["Leonardo DiCaprio", "Brad Pitt", "Claire Danes", "Tobey Maguire"]
def filter_and_highlight(strings, letters):
pat = re.compile( '(' + (')(.*?)('.join(letters)) + ')', re.I)
results = []
for s in strings:
parts = pat.split(s, 1)
if len(parts) == 1: continue
res = ''
for i, p in enumerate(parts):
if i & 1:
p = '<b>' + p + '</b>'
res += p
results.append(res)
return results
filter_and_highlight(test_string_list, 'lir')
A particularity of re.split is that captures are included by default as parts in the result. Also, even if the first capture matches at the start of the string, an empty part is returned before it, that means that searched letters are always at odd indexes in the list of substrings.
This should work:
for result in result_groups:
output_string = re.sub(fr'(.*?(?!<b>))({result})((?!</b>).*)',
r'\1<b>\2</b>\3',
output_string,
flags=re.IGNORECASE)
on each iteration first occurrence of result (? makes .* lazy this together does the magic of first occurrence) will be replaced by <b>result</b> if it is not enclosed by tag before ((?!<b>) and (?!</b>) does that part) and \1 \2 \3 are first, second and third group additionally we will use IGNORECASE flag to make it case insensitive.

Regex .search function to ignore name suffixes

I created a regex definition that should read suffixes (eg., jr/sr/etc.) at the end of a name (space or comma) and then return the name if the suffix is in the name and then move on the next part of the if-then-else statement, which splits and does a reverse join on names with the last name, first name format. I can't figure out what the problem is...but the re.search function is returning all values, instead of just the ones that are part of the name suffixes. Please help!
d = {'Person': ['red robin, jr', 'bluejay, bluie', 'finch, mustard e', 'awing blackcrow' ]}
df = pd.DataFrame(data=d)
def separatetypes(name):
if re.search(r'(?:\,|\s+(?:i|ii|iii|iv|jr|sr))*$', name):
return name
elif ',' in name:
namesplit = name.split(',',1)
newname = str(namesplit[1]) + ' ' + str(namesplit[0])
return newname
else:
return name
df['Person'] = df['Person'].apply(separatetypes)
You have a * in the pattern, which means "zero or more repetitions"; as a result, it's returning a match when it found zero suffixes.
The pattern you probably want is r'(?:,|\s+(?:i|ii|iii|iv|jr|sr))$' (without the * and omitting the unnecessary \ before the comma) or r'(?:,|\s+)(?:i|ii|iii|iv|jr|sr)$' (which allows a suffix separated by comma, rather than a trailing comma).
As a general tool, sites like https://regex101.com/ (there are a bunch of them) can help develop regexes by explaining what's going on and by immediately showing results.

how to get a pattern repeating multiple times in a string using regular expression

I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']

Python re.sub() optimization

I have a python list with each string being one of the following 4 possible options like this (of course the names would be different):
Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n
I want these to be corrected to:
Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n
Easy enough to do with 4 re.sub():
with open ("path/to/file",'r') as fileset:
dataset = fileset.readlines()
for item in dataset:
dataset = [item.strip() for item in dataset] #removes some misc. white noise
item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)
While this is fine for the dataset I'm using, I want to be more efficient.
Is there a single operation that can simplify this process?
Please pardon if I forgot a quote or some such; I'm not at my workstation now and I'm aware I've stripped the newline (\n).
Thank you,
Brief
Instead of running two loops, you can reduce it to just one line. Adapted from How to iterate over the file in Python (and using the code in my Code section):
f = open("path/to/file",'r')
while True:
x = f.readline()
if not x: break
print re.sub(r, repl, x)
See Python - How to use regexp on file, line by line, in Python for other alternatives.
Code
For viewing sake I've changed your file to an array.
See regex in use here
^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?
Note: You don't need all that in python, I do in order to show it on regex101, so your regex would actually just be ^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?
Usage
See code in use here
import re
a = [
"Mr: Smith",
"Mr: Smith; John",
"Smith",
"Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"
def repl(m):
return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")
for s in a:
print re.sub(r, repl, s)
Explanation
^ Assert position at the start of the line
(?:([^:]+):\W*)? Optionally match the following
([^:]+) Capture any character except : one or more times into capture group 1
: Match this literally
\W* Match any number of non-word characters (copied from OP's original code, I assume \s* can be used instead)
([^;]+) Group any character except ; one or more times into capture group 2
(?:;\W*(.+))? Optionally match the following
; Match this literally
\W* Match any number of non-word characters (copied from OP's original code, I assume \s* can be used instead)
(.+) Capture any character one or more times into capture group 3
Given the above explanation of the regex part. The re.sub(r, repl, s) works as follows:
repl is a callback to the repl function which returns:
group 1 if it captured anything, title otherwise
group 2 (it's supposedly always set - using OP's logic here again)
group 3 if it captured anything, fname otherwise
IMHO, RegEx are just too complex here, you can use classic string function to split your string item in chunks. For that, you can use partition (or rpartition).
First, split your item string in "records", like that:
item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']
Then, you can create a short function to normalize each "record".
Here is an example:
def normalize_record(record):
# type: (str) -> str
name, _, fname = record.partition(';')
title, _, name = name.rpartition(':')
title = title.strip() or 'title'
name = name.strip()
fname = fname.strip() or 'fname'
return "{0},{1},{2}".format(title, name, fname)
This function is easier to understand than a collection of RegEx. And, in most case, it is faster.
For a better integration, you can define another function to handle each item:
def normalize(row):
records = row.splitlines()
return "\n".join(normalize_record(record) for record in records) + "\n"
Demo:
item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)
You get:
'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'

How can I use regex to search inside sentence -not a case sensitive

I'm a newbie to Regular expression in Python :
I have a list that i want to search if it's contain a employee name.
The employee name can be :
it can be at the beginning followed by space.
followed by ®
OR followed by space
OR Can be at the end and space before it
not a case sensitive
ListSentence = ["Steve®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel"]
ListEmployee = ["Steve", "Rob", "daniel"]
The output from the ListSentence is:
["Steve®", "Rob spring", "Car Daniel", "Done daniel"]
First take all your employee names and join them with a | character and wrap the string so it looks like:
(?:^|\s)((?:Steve|Rob|Daniel)(?:®)?)(?=\s|$)
By first joining all the names together you avoid the performance overhead of using a nested set of for next loops.
I don't know python well enough to offer a python example, however in powershell I'd write it like this
[array]$names = #("Steve", "Rob", "daniel")
[array]$ListSentence = #("Steve®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel")
# build the regex, and insert the names as a "|" delimited string
$Regex = "(?:^|\s)((?:" + $($names -join "|") + ")(?:®)?)(?=\s|$)"
# use case insensitive match to find any matching array values
$ListSentence -imatch $Regex
Yields
Steve®
Rob spring
Car Daniel
Done daniel
Why do you want to use regular expressions? I'd generally recommend avoiding them in Python - you can use string methods instead.
For example:
def string_has_employee_name_in_it(test_string):
test_string = test_string.lower() # case insensitive
for name in ListEmployee:
name = name.lower()
if name == test_string:
return True
elif name + '®' == test_string:
return True
elif test_string.endswith(' ' + name):
return True
elif test_string.startswith(name + ' '):
return True
elif (' ' + name + ' ') in test_string:
return True
return False
final_list = []
for string in ListSentence:
if string_has_employee_name_in_it(string):
final_list.append(string)
final_list is the list you want. This is longer than a regex, but it's also a lot easier to parse and maintain. You can make it a lot shorter in various ways (e.g. combining the tests in the function, and using a list comprehension instead of a loop), but as you're starting out with Python it's a good idea to be clear with what's going on.
I don't think you need to check for all of those scenarios. I think all you need to do is check for word breaks.
You can join the ListEmployee list with | to make an either or regex (also, lowercase it for case-insensitivity), surrounded by \b for word breaks, and that should work:
regex = '|'.join(ListEmployee).lower()
import re
[l for l in ListSentence if re.search(r'\b(%s)\b' % regex, l.lower())]
Should output:
['Steve\xb6\xa9', 'Rob spring', 'Car Daniel', 'Done daniel']
If you're just looking for strings containing a space, as your example indicates, it should be something like this:
[i for i in ListSentence if i.endswith('®') or (' ' in i)]
A possible solution:
import re
ListSentence = ["Steve®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel"]
ListEmployee = ["Steve", "Rob", "daniel"]
def findEmployees(employees, sentence):
retval = []
for employee in employees:
expr = re.compile(r'(^%(employee)s(®)?(\s|$))|((^|\s)%(employee)s(®)?(\s|$))|((^|\s)%(employee)s(®)?$)'
% {'employee': employee},
re.IGNORECASE)
for part in sentence:
if expr.search(part):
retval.append(part)
return retval
findEmployees(ListEmployee, ListSentence)
>> Returns ['Steve\xc3\x82\xc2\xae', 'Rob spring', 'Car Daniel', 'Done daniel']

Categories