I can match each row with each diciotnary key but I am wondering if there's a way I can get the related value (string) in a different column as well.
import pandas as pd
entertainment_dict = {
"Food": ["McDonald", "Five Guys", "KFC"],
"Music": ["Taylor Swift", "Jay Z", "One Direction"],
"TV": ["Big Bang Theory", "Queen of South", "Ted Lasso"]
}
data = {'text':["Kevin Lee has bought a Taylor Swift's CD and eaten at McDonald.",
"The best burger in McDonald is cheeze buger.",
"Kevin Lee is planning to watch the Big Bang Theory and eat at KFC."]}
df = pd.DataFrame(data)
regex = '|'.join(f'(?P<{k}>{"|".join(v)})' for k,v in entertainment_dict.items())
df['labels'] = ((df['text'].str.extractall(regex).notnull().groupby(level=0).max()*entertainment_dict.keys())
.apply(lambda r: ','.join([i for i in r if i]) , axis=1)
)
text labels
0 Kevin Lee has bought a Taylor Swift's CD and e... Food,Music
1 The best burger in McDonald is cheeze buger. Food
2 Kevin Lee is planning to watch the Big Bang Th... Food,TV
Expected output
text labels words
0 Kevin Lee has bought a Taylor Swift's CD and e... Food,Music Taylor Swift, McDonald
1 The best burger in McDonald is cheeze buger. Food McDonald
2 Kevin Lee is planning to watch the Big Bang Th... Food,TV Big Bang Theory, KFC
Use DataFrame.stack with convert first level to column by reset_index, so possible join values in GroupBy.agg, for unique values in order is used dict.fromkeys trick:
uniq = lambda x: ','.join(dict.fromkeys(x).keys())
df[['label','words']] = (df['text'].str.extractall(regex)
.stack()
.reset_index(level=-1)
.groupby(level=0)
.agg(uniq))
print (df)
text label \
0 Kevin Lee has bought a Taylor Swift's CD and e... Music,Food
1 The best burger in McDonald is cheeze buger. Food
2 Kevin Lee is planning to watch the Big Bang Th... TV,Food
words
0 Taylor Swift,McDonald
1 McDonald
2 Big Bang Theory,KFC
You could use:
df['words'] = (df['text'].str.extractall(regex)
.groupby(level=0).first()
.apply(lambda x: ','.join(set(x).difference([None])),
axis=1)
)
output:
text labels words
0 Kevin Lee has bought ... McDonald. Food,Music Taylor Swift,McDonald
1 The best burger in ... cheeze buger. Food McDonald
2 Kevin Lee is planning ... eat at KFC. Food,TV KFC,Big Bang Theory
Related
I have the dataframe below and have created a column to catagorise based on specific text within a string.
However when I pass re.IGNORECASE flag it is still case sensetive?
Dataframe
test_data = {
"first_name": ['Bruce', 'Clark', 'Bruce', 'James', 'Nanny', 'Dot'],
"last_name": ['Lee', 'Kent', 'Banner', 'Bond', 'Mc Phee', 'Cotton'],
"title": ['mr', 'mr', 'mr', 'mr', 'mrs', 'mrs'],
"text": ["He is a Kung Fu master", "Wears capes and tight Pants", "Cocktails shaken not stirred", "angry Green man", "suspect scottish accent", "East end legend"],
"age": [32, 33, 28, 30, 42, 80]
}
df = pd.DataFrame(test_data)
code
category_dict = {
"Kung Fu":"Martial Art",
"capes":"Clothing",
"cocktails": "Drink",
"green": "Colour",
"scottish": "Scotland",
"East": "Direction"
}
df['category'] = (
df['text'].str.extract(
fr"\b({'|'.join(category_dict.keys())})\b",
flags=re.IGNORECASE)[0].map(category_dict))
Expected output
first_name last_name title text age category
0 Bruce Lee Mr He is a Kung Fu master 32 Martial Art
1 Clark Kent Mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner Mr Cocktails shaken not stirred 28 Drink
3 James Bond Mr angry Green man 30 Colour
4 Nanny Mc Phee Mrs suspect scottish accent 42 Scotland
5 Dot Cotton Mrs East end legend 80 Direction
I have searched the docs and have found no pointers, so any help would be appreciated!
here is one way to do it
the issue you're facing being that while the extract ignores the case, the extracted string mapping to dictionary is still case sensitive.
#create a dictionary with lower case keys
cd= {k.lower(): v for k,v in category_dict.items()}
# alternately, you can convert the category_dict keys to lower case
# I duplicated the dictionary, in case you need to keep the original keys
# convert the extracted word to lowercase and then map with the lowercase dict
df['category'] = (
df['text'].str.extract(
fr"\b({'|'.join((category_dict.keys()))})\b",
flags=re.IGNORECASE)[0].str.lower().map(cd))
df
first_name last_name title text age category
0 Bruce Lee mr He is a Kung Fu master 32 Martial Art
1 Clark Kent mr Wears capes and tight Pants 33 Clothing
2 Bruce Banner mr Cocktails shaken not stirred 28 Drink
3 James Bond mr angry Green man 30 Colour
4 Nanny Mc Phee mrs suspect scottish accent 42 Scotland
5 Dot Cotton mrs East end legend 80 Direction
I have a Pandas dataframe df
I want to populate subsequent values in a column based on the value that preceded it and when I come across another value do the same for that.
So the dept column is complete and I can merge this dataset with another to have departments linked info for PIs.
Don't know the best approach, is there a vectorized approach to this our would it require looping, maybe using iterrows() or itertuples().
data = {"dept": ["Emergency Medicine", "", "", "", "Family Practice", "", ""],
"pi": [NaN, "Tiger Woods", "Michael Jordan", "Roger Federer", NaN, "Serena Williams", "Alex Morgan"]
}
df = pd.DataFrame(data=data)
dept pi
0 Emergency Medicine
1 Tiger Woods
2 Michael Jordan
3 Roger Federer
4 Family Practice
5 Serena Williams
6 Alex Morgan
desired_df
dept pi
0 Emergency Medicine
1 Emergency Medicine Tiger Woods
2 Emergency Medicine Michael Jordan
3 Emergency Medicine Roger Federer
4 Family Practice
5 Family Practice Serena Williams
6 Family Practice Alex Morgan
Use where to mask those empty rows with nan, then ffill
# if you have empty strings
mask = df['dept'].ne('')
df['dept'] = df['dept'].where(mask).ffill()
# otherwise, just
# df['dept'] = df['dept'].ffill()
Output:
dept pi
0 Emergency Medicine NaN
1 Emergency Medicine Tiger Woods
2 Emergency Medicine Michael Jordan
3 Emergency Medicine Roger Federer
4 Family Practice NaN
5 Family Practice Serena Williams
6 Family Practice Alex Morgan
I have 2 datasets (in CSV format) with different size such as follow:
df_old:
index category text
0 spam you win much money
1 spam you are the winner of the game
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
4 neutral we have a party now
5 neutral they are driving to downtown
df_new:
index category text
0 spam you win much money
14 spam London is the capital of Canada
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
4 neutral we have a party now
31 neutral construction will be done
I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.
The code is:
(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.
The expected output should be:
df_concat:
index category text
14 spam London is the capital of Canada
0 spam you win much money
1 spam you are the winner of the game
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
31 neutral construction will be done
4 neutral we have a party now
5 neutral they are driving to downtown
I tried this and this but these are removing either the category or text.
To remove duplicates on specific column(s), use subset in drop_duplicates:
df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Try concat + sort_values:
res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)
Output
index category text
0 0 spam you win much money
1 14 spam London is the capital of Canada
1 1 spam you are the winner of the game
2 15 not_spam no more raining in winter
3 25 not_spam the soccer game plays on HBO
2 2 not_spam the weather in Chicago is nice
3 3 not_spam pizza is an Italian food
4 31 neutral construction will be done
4 4 neutral we have a party now
5 5 neutral they are driving to downtown
Your code seems right , try to add this to the concat result it will remove your duplicates :
# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
df_concat.drop_duplicates()
If you want to reindex it you can do ofcourse mot chnging the ‘index’column):
df_concat.drop_duplicates(ignore_index =True)
You can always do combine_first
out = df_new.combine_first(df_old)
I have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ['Steve Smith', 'Joe Nadal',
'Roger Federer'],
'birthdat/company': ['1995-01-26Sharp, Reed and Crane',
'1955-08-14Price and Sons',
'2000-06-28Pruitt, Bush and Mcguir']})
df[['data_time','full_company_name']] = df['birthdat/company'].str.split('[0-9]{4}-[0-9]{2}-[0-9]{2}', expand=True)
df
with my code I get the following:
____|____Name______|__birthdat/company_______________|_birthdate_|____company___________
0 |Steve Smith |1995-01-26Sharp, Reed and Crane | |Sharp, Reed and Crane
1 |Joe Nadal |1955-08-14Price and Sons | |Price and Sons
2 |Roger Federer |2000-06-28Pruitt, Bush and Mcguir| |Pruitt, Bush and Mcguir
what I want is - get this regex ('[0-9]{4}-[0-9]{2}-[0-9]{2}') and the rest should go to the column "full_company_name" and :
____|____Name______|_birthdate_|____company_name_______
0 |Steve Smith |1995-01-26 |Sharp, Reed and Crane
1 |Joe Nadal |1955-08-14 |Price and Sons
2 |Roger Federer |2000-06-28 |Pruitt, Bush and Mcguir
Updated Question:
How could I handle missing values for birthdate or company name,
example: birthdate/company = "NaApple" or birthdate/company = "2003-01-15Na" the missing values are not only limited to Na
You may use
df[['data_time','full_company_name']] = df['birthdat/company'].str.extract(r'^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*)', expand=False)
>>> df
Name Age ... data_time full_company_name
0 Steve Smith 32 ... 1995-01-26 Sharp, Reed and Crane
1 Joe Nadal 34 ... 1955-08-14 Price and Sons
2 Roger Federer 36 ... 2000-06-28 Pruitt, Bush and Mcguir
[3 rows x 5 columns]
The Series.str.extract is used here because you need to get two parts without losing the date.
The regex is
^ - start of string
([0-9]{4}-[0-9]{2}-[0-9]{2}) - your date pattern captured into Group 1
(.*) - the rest of the string captured into Group 2.
See the regex demo.
split splits the string by the separator while ignoring them. I think you want extract with two capture groups:
df[['data_time','full_company_name']] = \
df['birthdat/company'].str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*)')
Output:
Name birthdat/company data_time full_company_name
-- ------------- --------------------------------- ----------- -----------------------
0 Steve Smith 1995-01-26Sharp, Reed and Crane 1995-01-26 Sharp, Reed and Crane
1 Joe Nadal 1955-08-14Price and Sons 1955-08-14 Price and Sons
2 Roger Federer 2000-06-28Pruitt, Bush and Mcguir 2000-06-28 Pruitt, Bush and Mcguir
I have two columns in a DataFrame, crewname is a list of crew members worked on a film. Director_loc is the location within the list of the director.
I want to create a new column which has the name of the director.
crewname Director_loc
[John Lasseter, Joss Whedon, Andrew Stanton, J... 0
[Larry J. Franco, Jonathan Hensleigh, James Ho... 3
[Howard Deutch, Mark Steven Johnson, Mark Stev... 0
[Forest Whitaker, Ronald Bass, Ronald Bass, Ez... 0
[Alan Silvestri, Elliot Davis, Nancy Meyers, N... 5
[Michael Mann, Michael Mann, Art Linson, Micha... 0
[Sydney Pollack, Barbara Benedek, Sydney Polla... 0
[David Loughery, Stephen Sommers, Peter Hewitt... 2
[Peter Hyams, Karen Elise Baldwin, Gene Quinta... 0
[Martin Campbell, Ian Fleming, Jeffrey Caine, ... 0
I've tried a number of codes using list comprehension, enumerate etc. I'm a bit embarrassed to put them here.
Any help will be appreciated.
Use indexing with list comprehension:
df['name'] = [a[b] for a , b in zip(df['crewname'], df['Director_loc'])]
print (df)
crewname Director_loc \
0 [John Lasseter, Joss Whedon, Andrew Stanton] 2
1 [Larry J. Franco, Jonathan Hensleigh] 1
name
0 Andrew Stanton
1 Jonathan Hensleigh