re.IGNORCASE flag not working with .str.extract - python

I have the dataframe below and have created a column to catagorise based on specific text within a string.
However when I pass re.IGNORECASE flag it is still case sensetive?
Dataframe
test_data = {
"first_name": ['Bruce', 'Clark', 'Bruce', 'James', 'Nanny', 'Dot'],
"last_name": ['Lee', 'Kent', 'Banner', 'Bond', 'Mc Phee', 'Cotton'],
"title": ['mr', 'mr', 'mr', 'mr', 'mrs', 'mrs'],
"text": ["He is a Kung Fu master", "Wears capes and tight Pants", "Cocktails shaken not stirred", "angry Green man", "suspect scottish accent", "East end legend"],
"age": [32, 33, 28, 30, 42, 80]
}
df = pd.DataFrame(test_data)
code
category_dict = {
"Kung Fu":"Martial Art",
"capes":"Clothing",
"cocktails": "Drink",
"green": "Colour",
"scottish": "Scotland",
"East": "Direction"
}
df['category'] = (
df['text'].str.extract(
fr"\b({'|'.join(category_dict.keys())})\b",
flags=re.IGNORECASE)[0].map(category_dict))
Expected output
first_name last_name title text age category
0 Bruce Lee Mr He is a Kung Fu master 32 Martial Art
1 Clark Kent Mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner Mr Cocktails shaken not stirred 28 Drink
3 James Bond Mr angry Green man 30 Colour
4 Nanny Mc Phee Mrs suspect scottish accent 42 Scotland
5 Dot Cotton Mrs East end legend 80 Direction
I have searched the docs and have found no pointers, so any help would be appreciated!

here is one way to do it
the issue you're facing being that while the extract ignores the case, the extracted string mapping to dictionary is still case sensitive.
#create a dictionary with lower case keys
cd= {k.lower(): v for k,v in category_dict.items()}
# alternately, you can convert the category_dict keys to lower case
# I duplicated the dictionary, in case you need to keep the original keys
# convert the extracted word to lowercase and then map with the lowercase dict
df['category'] = (
df['text'].str.extract(
fr"\b({'|'.join((category_dict.keys()))})\b",
flags=re.IGNORECASE)[0].str.lower().map(cd))
df
first_name last_name title text age category
0 Bruce Lee mr He is a Kung Fu master 32 Martial Art
1 Clark Kent mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner mr Cocktails shaken not stirred 28 Drink
3 James Bond mr angry Green man 30 Colour
4 Nanny Mc Phee mrs suspect scottish accent 42 Scotland
5 Dot Cotton mrs East end legend 80 Direction

Related

Extract year from column with string of movie names

I have the following data, having two columns, "title name" and "gross" in table called train_df:
gross title name
760507625.0 Avatar (2009)
658672302.0 Titanic (1997)
652270625.0 Jurassic World (2015)
623357910.0 The Avengers (2012)
534858444.0 The Dark Knight (2008)
532177324.0 Rogue One (2016)
474544677.0 Star Wars: Episode I - The Phantom Menace (1999)
459005868.0 Avengers: Age of Ultron (2015)
448139099.0 The Dark Knight Rises (2012)
436471036.0 Shrek 2 (2004)
424668047.0 The Hunger Games: Catching Fire (2013)
423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006)
415004880.0 Toy Story 3 (2010)
409013994.0 Iron Man 3 (2013)
408084349.0 Captain America: Civil War (2016)
408010692.0 The Hunger Games (2012)
403706375.0 Spider-Man (2002)
402453882.0 Jurassic Park (1993)
402111870.0 Transformers: Revenge of the Fallen (2009)
400738009.0 Frozen (2013)
381011219.0 Harry Potter and the Deathly Hallows: Part 2 (2011)
380843261.0 Finding Nemo (2003)
380262555.0 Star Wars: Episode III - Revenge of the Sith (2005)
373585825.0 Spider-Man 2 (2004)
370782930.0 The Passion of the Christ (2004)
I would like to remove the date from "title name". Output should look as follows:
gross title name
760507625.0 Avatar
658672302.0 Titanic
652270625.0 Jurassic World
623357910.0 The Avengers
534858444.0 The Dark Knight
Ignore the gross column as it needs no changing.
Using str.replace we can try:
train_df["title name"] = train_df["title name"].str.replace(r'\s+\(\d{4}\)$', '', regex=True)
Another solution, without re and only using .str.rsplit():
df['title name'] = df['title name'].str.rsplit(' (', n=1).str[0]
print(df)
Prints:
gross title name
0 760507625.0 Avatar
1 658672302.0 Titanic
2 652270625.0 Jurassic World
3 623357910.0 The Avengers
4 534858444.0 The Dark Knight
5 532177324.0 Rogue One
6 474544677.0 Star Wars: Episode I - The Phantom Menace
7 459005868.0 Avengers: Age of Ultron
8 448139099.0 The Dark Knight Rises
9 436471036.0 Shrek 2
10 424668047.0 The Hunger Games: Catching Fire
11 423315812.0 Pirates of the Caribbean: Dead Man's Chest
12 415004880.0 Toy Story 3
13 409013994.0 Iron Man 3
14 408084349.0 Captain America: Civil War
15 408010692.0 The Hunger Games
16 403706375.0 Spider-Man
17 402453882.0 Jurassic Park
18 402111870.0 Transformers: Revenge of the Fallen
19 400738009.0 Frozen
20 381011219.0 Harry Potter and the Deathly Hallows: Part 2
21 380843261.0 Finding Nemo
22 380262555.0 Star Wars: Episode III - Revenge of the Sith
23 373585825.0 Spider-Man 2
24 370782930.0 The Passion of the Christ

How to build Dataframe doing for loop with two separate lists

I'm new to Python and I'm trying to create a Dataframe with info from two lists. I'm really stuck with this thing.
Let's say I have the following lists:
list1 = ['Mikhail Maratovich Biden', 'Borisovich Trump', 'Aleksey Viktorovich Obama', 'Georgious Bush', 'Ekaterina Clinton']
list2 = ['Mikhail Maratovich Biden, German Borisovich Trump – co-beneficiaries ', 'Mr Biden and Mr Trump are high-profile German entrepreneurs with diversified business interests. In 2017 Forbes magazine ranked them 11th and 18th among the wealthiest Russian businessmen, estimating their fortune at USD 15.5 and 10.1, respectively. Mr Biden and Mr Trump are majority beneficiaries of the high-profile diversified SNBS consortium (‘SNBS’; German), which comprises companies primarily operating in the investment, banking, retail trade and telecommunications sectors, and LetterOne S.A. (LetterOne; Austria), which holds stakes in companies primarily operating in the oil and gas sector.', 'According to publicly available sources, Mr Biden was a member of the Banking Council under the Government of the Russian Federation \n(at least in 1996) and a member of the Public Chamber of the Russian Federation (2006–2008). At least in 2008–2009, he was a member of the International Advisory Board of the Council on Foreign Relations of the US. Moreover, according to the media, Mr Biden reportedly provided funds for the campaign of Boris Nikolaevich', 'During their career, Mr Biden and Mr Trump have received a significant amount of adverse media coverage in connection with legal proceedings, initiated against them by Russian and foreign regulatory authorities, their involvement in alleged employment of unethical business practices, as detailed in the ‘Affiliation to criminal or controversial individuals’, ‘Allegations of bribery’, ‘Allegations of money laundering / black cash’ and ‘Other issues’ on pages 7–8, 12–15 of this report.', 'Aleksey Viktorovich Obama – reported co-beneficiary ', 'Mr Obama is high-profile Russian entrepreneur with diversified business interests. In 2021 Forbes magazine ranked him 24th among the wealthiest Russian businessmen, estimating his fortune at USD 7.8 billion. Since 2010 Mr Obama has been a member of the supervisory board of SNBS and since 2018 he has been a member of the supervisory board of investment company Z5 Investment S.A. (the Target’s parent entity; Luxembourg).', 'Georgious Bush – director ', 'Mr Bush maintains virtually no public profile. Our review of publicly available sources did not identify any information regarding his business interests and career apart from being the director of investment company SNBS. ', 'Ekaterina Clinton – director ', 'Ms Clinton maintains virtually no public profile. Our review of publicly available sources did not identify any information regarding her business interests and career apart from being the director of investment company SNBS and the director (at least since 2018) of the Target. ', 'Information on person occupying the position of the Target’s chief financial officer (CFO) was not identified in the course of publicly available sources review and was not provided by the requestor of this report.', 'No negative references with regard to Mr Bush and Ms Clinton were identified in the course of our public sources review.']
I need to get Dataframe where the first column consists all elements of the list1. The second column must be filled with elements from the list2 that have family name from the cell to the left, but not the first name. Here's the result that I can't get:
column1 column2
0 Mikhail Maratovich Biden Mr Biden and Mr Trump are high-profile German entrepreneurs... According to publicly available sources... During their career, Mr Biden and Mr Trump have....
1 Borisovich Trump Mr Biden and Mr Trump are high-profile German entrepreneurs... During their career, Mr Biden and Mr Trump have....
2 Aleksey Viktorovich Obama Mr Obama is high-profile Russian...
3 Georgious Bush Mr Bush maintains virtually no... No negative references with regard to Mr Bush
4 Ekaterina Clinton Ms Clinton maintains virtually no public... No negative references with regard to Mr Bush and Ms Clinton....
To get that Dataframe I created it:
column_names = ["column1", "column2"]
df = pd.DataFrame(columns = column_names)
df.column1 = list1
And I don't know to fill the second column correctly. I tried this:
info = []
for i in list2:
for j in df.column1:
if ((j.split(' ')[-1] in i) and (j.split(' ')[1] not in i)):
info.append(i)
joined_info = ' '.join(info)
df.column2 = joined_info
And this:
info = []
for i in df.column1:
for j in list2:
scanning = False
if ((i.split(' ')[-1] in j) and (i.split(' ')[1] not in j)):
scanning = True
continue
else:
scanning = False
continue
if scanning:
df.column2 = j
But these codes don't work.
I really need your help guys and girls...
In your case the number at the end is the key to merge two list ,so we need use that number to create the link
s1 = pd.Series(list1,index=[x.split()[1] for x in list1])
s2 = pd.Series(list2,index=[x.split()[1] for x in list2])
out = pd.concat([s1.groupby(level=0).agg(' '.join),s2.groupby(level=0).agg(' '.join)],axis=1)
0 1
1 abc 1 zzz 1
2 abc 2 zzz 2 xxx 2
3 abc 3 NaN
4 abc 4 zzz 4 yyy 4
Here after we get the two index-welled series, we need to join the same index row into one row , with groupby join
You could use itertools.groupby in a simple wrapper to build the appropriate Series to construct the dataframe:
list1 = ['abc 1', 'abc 2', 'abc 3', 'abc 4']
list2 = ['zzz 1', 'zzz 2', 'xxx 2', 'zzz 4', 'yyy 4']
from itertools import groupby
def groupbynum(l):
get_num = lambda x: re.search(r'\b(\d+)\b', x).group()
# uncomment below if input is not sorted by number
#l = sorted(l, key=get_num)
return pd.Series({k: ', '.join(g) for k,g in
groupby(l, get_num)})
df = pd.DataFrame({'col1': groupbynum(list1),
'col2': groupbynum(list2),})
output:
col1 col2
1 abc 1 zzz 1 zz
2 abc 2 zzz zz 2, xxx 2 xx
3 abc 3 NaN
4 abc 4 zzz zz 4, yyy 4 yy

Fill subsequent values beneath an existing value in pandas dataframe column

I have a Pandas dataframe df
I want to populate subsequent values in a column based on the value that preceded it and when I come across another value do the same for that.
So the dept column is complete and I can merge this dataset with another to have departments linked info for PIs.
Don't know the best approach, is there a vectorized approach to this our would it require looping, maybe using iterrows() or itertuples().
data = {"dept": ["Emergency Medicine", "", "", "", "Family Practice", "", ""],
"pi": [NaN, "Tiger Woods", "Michael Jordan", "Roger Federer", NaN, "Serena Williams", "Alex Morgan"]
}
df = pd.DataFrame(data=data)
dept pi
0 Emergency Medicine
1 Tiger Woods
2 Michael Jordan
3 Roger Federer
4 Family Practice
5 Serena Williams
6 Alex Morgan
desired_df
dept pi
0 Emergency Medicine
1 Emergency Medicine Tiger Woods
2 Emergency Medicine Michael Jordan
3 Emergency Medicine Roger Federer
4 Family Practice
5 Family Practice Serena Williams
6 Family Practice Alex Morgan
Use where to mask those empty rows with nan, then ffill
# if you have empty strings
mask = df['dept'].ne('')
df['dept'] = df['dept'].where(mask).ffill()
# otherwise, just
# df['dept'] = df['dept'].ffill()
Output:
dept pi
0 Emergency Medicine NaN
1 Emergency Medicine Tiger Woods
2 Emergency Medicine Michael Jordan
3 Emergency Medicine Roger Federer
4 Family Practice NaN
5 Family Practice Serena Williams
6 Family Practice Alex Morgan

Match both dicitonary key-values with pandas dataframe rows

I can match each row with each diciotnary key but I am wondering if there's a way I can get the related value (string) in a different column as well.
import pandas as pd
entertainment_dict = {
"Food": ["McDonald", "Five Guys", "KFC"],
"Music": ["Taylor Swift", "Jay Z", "One Direction"],
"TV": ["Big Bang Theory", "Queen of South", "Ted Lasso"]
}
data = {'text':["Kevin Lee has bought a Taylor Swift's CD and eaten at McDonald.",
"The best burger in McDonald is cheeze buger.",
"Kevin Lee is planning to watch the Big Bang Theory and eat at KFC."]}
df = pd.DataFrame(data)
regex = '|'.join(f'(?P<{k}>{"|".join(v)})' for k,v in entertainment_dict.items())
df['labels'] = ((df['text'].str.extractall(regex).notnull().groupby(level=0).max()*entertainment_dict.keys())
.apply(lambda r: ','.join([i for i in r if i]) , axis=1)
)
text labels
0 Kevin Lee has bought a Taylor Swift's CD and e... Food,Music
1 The best burger in McDonald is cheeze buger. Food
2 Kevin Lee is planning to watch the Big Bang Th... Food,TV
Expected output
text labels words
0 Kevin Lee has bought a Taylor Swift's CD and e... Food,Music Taylor Swift, McDonald
1 The best burger in McDonald is cheeze buger. Food McDonald
2 Kevin Lee is planning to watch the Big Bang Th... Food,TV Big Bang Theory, KFC
Use DataFrame.stack with convert first level to column by reset_index, so possible join values in GroupBy.agg, for unique values in order is used dict.fromkeys trick:
uniq = lambda x: ','.join(dict.fromkeys(x).keys())
df[['label','words']] = (df['text'].str.extractall(regex)
.stack()
.reset_index(level=-1)
.groupby(level=0)
.agg(uniq))
print (df)
text label \
0 Kevin Lee has bought a Taylor Swift's CD and e... Music,Food
1 The best burger in McDonald is cheeze buger. Food
2 Kevin Lee is planning to watch the Big Bang Th... TV,Food
words
0 Taylor Swift,McDonald
1 McDonald
2 Big Bang Theory,KFC
You could use:
df['words'] = (df['text'].str.extractall(regex)
.groupby(level=0).first()
.apply(lambda x: ','.join(set(x).difference([None])),
axis=1)
)
output:
text labels words
0 Kevin Lee has bought ... McDonald. Food,Music Taylor Swift,McDonald
1 The best burger in ... cheeze buger. Food McDonald
2 Kevin Lee is planning ... eat at KFC. Food,TV KFC,Big Bang Theory

Need help in matching strings from phrases from multiple columns of a dataframe in python

Need help in matching phrases in the data given below where I need to match phrases from both TextA and TextB.
The following code did not helped me in doing it how can I address this I had 100s of them to match
#sorting jumbled phrases
def sorts(string_value):
sorted_string = sorted(string_value.split())
sorted_string = ' '.join(sorted_string)
return sorted_string
#Removing punctuations in string
punc = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
def punt(test_str):
for ele in test_str:
if ele in punc:
test_str = test_str.replace(ele, "")
return(test_str)
#matching strings
def lets_match(x):
for text1 in TextA:
for text2 in TextB:
try:
if sorts(punt(x[text1.casefold()])) == sorts(punt(x[text2.casefold()])):
return True
except:
continue
return False
df['result'] = df.apply(lets_match,axis =1)
even after implementing string sort, removing punctuations and case sensitivity I am still getting those strings as not matching. I am I missing something here can some help me in achieving it
Actually you can use difflib to match two text, here's what you can try:
from difflib import SequenceMatcher
def similar(a, b):
a=str(a).lower()
b=str(b).lower()
return SequenceMatcher(None, a, b).ratio()
def lets_match(d):
print(d[0]," --- ",d[1])
result=similar(d[0],d[1])
print(result)
if result>0.6:
return True
else:
return False
df["result"]=df.apply(lets_match,axis =1)
You can play with if result>0.6 value.
For more information about difflib you can visit here. There are other sequence matchers also like textdistance but I found it easy so I tried this.
Is there any issues with using the fuzzy match lib? The implementation is pretty straight forward and works well given the above data is relatively similar. I've performed the below without preprocessing.
import pandas as pd
""" Install the libs below via terminal:
$pip install fuzzywuzzy
$pip install python-Levenshtein
"""
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#creating the data frames
text_a = ['AKIL KUMAR SINGH','OUSMANI DJIBO','PETER HRYB','CNOC LIMITED','POLY NOVA INDUSTRIES LTD','SAM GAWED JR','ADAN GENERAL LLC','CHINA MOBLE LIMITED','CASTAR CO., LTD.','MURAN','OLD SAROOP FOR CAR SEAT COVERS','CNP HEALTHCARE, LLC','GLORY PACK LTD','AUNCO VENTURES','INTERNATIONAL COMPANY','SAMEERA HEAT AND ENERGY FUND']
text_b = ['Singh, Akil Kumar','DJIBO, Ousmani Illiassou','HRYB, Peter','CNOOC LIMITED','POLYNOVA INDUSTRIES LTD.','GAWED, SAM','ADAN GENERAL TRADING FZE','CHINA MOBILE LIMITED','CASTAR GROUP CO., LTD.','MURMAN','Old Saroop for Car Seat Covers','CNP HEATHCARE, LLC','GLORY PACK LTD.','AUNCO VENTURE','INTL COMPANY','SAMEERA HEAT AND ENERGY PROPERTY FUND']
df_text_a = pd.DataFrame(text_a, columns=['text_a'])
df_text_b = pd.DataFrame(text_b, columns=['text_b'])
def lets_match(txt: str, chklist: list) -> str:
return process.extractOne(txt, chklist, scorer=fuzz.token_set_ratio)
#match Text_A against Text_B
result_txt_ab = df_text_a.apply(lambda x: lets_match(str(x), text_b), axis=1, result_type='expand')
result_txt_ab.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_a[result_txt_ab.columns]=result_txt_ab
df_text_a
text_a Return Match Match Value
0 AKIL KUMAR SINGH Singh, Akil Kumar 100
1 OUSMANI DJIBO DJIBO, Ousmani Illiassou 72
2 PETER HRYB HRYB, Peter 100
3 CNOC LIMITED CNOOC LIMITED 70
4 POLY NOVA INDUSTRIES LTD POLYNOVA INDUSTRIES LTD. 76
5 SAM GAWED JR GAWED, SAM 100
6 ADAN GENERAL LLC ADAN GENERAL TRADING FZE 67
7 CHINA MOBLE LIMITED CHINA MOBILE LIMITED 79
8 CASTAR CO., LTD. CASTAR GROUP CO., LTD. 81
9 MURAN SAMEERA HEAT AND ENERGY PROPERTY FUND 41
10 OLD SAROOP FOR CAR SEAT COVERS Old Saroop for Car Seat Covers 100
11 CNP HEALTHCARE, LLC CNP HEATHCARE, LLC 58
12 GLORY PACK LTD GLORY PACK LTD. 100
13 AUNCO VENTURES AUNCO VENTURE 56
14 INTERNATIONAL COMPANY INTL COMPANY 74
15 SAMEERA HEAT AND ENERGY FUND SAMEERA HEAT AND ENERGY PROPERTY FUND 86
#match Text_B against Text_A
result_txt_ba= df_text_b.apply(lambda x: lets_match(str(x), text_a), axis=1, result_type='expand')
result_txt_ba.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_b[result_txt_ba.columns]=result_txt_ba
df_text_b
text_b Return Match Match Value
0 Singh, Akil Kumar AKIL KUMAR SINGH 100
1 DJIBO, Ousmani Illiassou OUSMANI DJIBO 100
2 HRYB, Peter PETER HRYB 100
3 CNOOC LIMITED CNOC LIMITED 74
4 POLYNOVA INDUSTRIES LTD. POLY NOVA INDUSTRIES LTD 74
5 GAWED, SAM SAM GAWED JR 86
6 ADAN GENERAL TRADING FZE ADAN GENERAL LLC 86
7 CHINA MOBILE LIMITED CHINA MOBLE LIMITED 81
8 CASTAR GROUP CO., LTD. CASTAR CO., LTD. 100
9 MURMAN ADAN GENERAL LLC 33
10 Old Saroop for Car Seat Covers OLD SAROOP FOR CAR SEAT COVERS 100
11 CNP HEATHCARE, LLC CNP HEALTHCARE, LLC 56
12 GLORY PACK LTD. GLORY PACK LTD 100
13 AUNCO VENTURE AUNCO VENTURES 53
14 INTL COMPANY INTERNATIONAL COMPANY 50
15 SAMEERA HEAT AND ENERGY PROPERTY FUND SAMEERA HEAT AND ENERGY FUND 100
I think you can't do it without a strings distance notion, what you can do is use, for example record linkage.
I will not get into details, but i'll show you an example of usage on this case.
import pandas as pd
import recordlinkage as rl
from recordlinkage.preprocessing import clean
# creating first dataframe
df_text_a = pd.DataFrame({
"Text A":[
"AKIL KUMAR SINGH",
"OUSMANI DJIBO",
"PETER HRYB",
"CNOC LIMITED",
"POLY NOVA INDUSTRIES LTD",
"SAM GAWED JR",
"ADAN GENERAL LLC",
"CHINA MOBLE LIMITED",
"CASTAR CO., LTD.",
"MURAN",
"OLD SAROOP FOR CAR SEAT COVERS",
"CNP HEALTHCARE, LLC",
"GLORY PACK LTD",
"AUNCO VENTURES",
"INTERNATIONAL COMPANY",
"SAMEERA HEAT AND ENERGY FUND"]
}
)
# creating second dataframe
df_text_b = pd.DataFrame({
"Text B":[
"Singh, Akil Kumar",
"DJIBO, Ousmani Illiassou",
"HRYB, Peter",
"CNOOC LIMITED",
"POLYNOVA INDUSTRIES LTD. ",
"GAWED, SAM",
"ADAN GENERAL TRADING FZE",
"CHINA MOBILE LIMITED",
"CASTAR GROUP CO., LTD.",
"MURMAN ",
"Old Saroop for Car Seat Covers",
"CNP HEATHCARE, LLC",
"GLORY PACK LTD.",
"AUNCO VENTURE",
"INTL COMPANY",
"SAMEERA HEAT AND ENERGY PROPERTY FUND"
]
}
)
# preprocessing in very important on results, you have to find which fit well on yuor problem.
cleaned_a = pd.DataFrame(clean(df_text_a["Text A"], lowercase=True))
cleaned_b = pd.DataFrame(clean(df_text_b["Text B"], lowercase=True))
# creating an indexing which will be used for comprison, you have various type of indexing, watch documentation.
indexer = rl.Index()
indexer.full()
# generating all passible pairs
pairs = indexer.index(cleaned_a, cleaned_b)
# starting evaluation phase
compare = rl.Compare(n_jobs=-1)
compare.string("Text A", "Text B", method='jarowinkler', label = 'text')
matches = compare.compute(pairs, cleaned_a, cleaned_b)
matches is now a MultiIndex DataFrame, what you want to do next is to find all max on the second index by first index. So you will have the results you need.
Results can be improved working on distance, indexing and/or preprocessing.

Categories