I have two dataframes: one with full names and another with nicknames. The nickname is always a portion of the person's full name, and the data is not sorted or indexed, so I can't just merge the two.
What I want as an output is one data frame that contains the full name and the associated nick name by simple search: find the nickname inside the name and match it.
Any solutions to this?
df = pd.DataFrame({'fullName': ['Claire Daines', 'Damian Lewis', 'Mandy Patinkin', 'Rupert Friend', 'F. Murray Abraham']})
df2 = pd.DataFrame({'nickName': ['Rupert','Abraham','Patinkin','Daines','Lewis']})
Thanks
Use Series.str.extract with strings joined by | for regex or with \b for words boundaries:
pat = '|'.join(r"\b{}\b".format(x) for x in df2['nickName'])
df['nickName'] = df['fullName'].str.extract('('+ pat + ')', expand=False)
print (df)
fullName nickName
0 Claire Daines Daines
1 Damian Lewis Lewis
2 Mandy Patinkin Patinkin
3 Rupert Friend Rupert
4 F. Murray Abraham Abraham
Related
I have two dataframes with actor names (their types are object) that look like the following:
df = pd.DataFrame({Actors: [Christian Bale, Ben Kingsley, Halley Bailey, Aaron Paul, etc...]
df2 = pd.read_csv({id: [Halley Bailey - 1998, Coco Jones – 1998, etc...]
Normally I would use the following code to find if one item is present in another dataframe but due to the numbers in the second dataframe I get 0 matches. Is there any smart way of going over this?
df.assign(indf=df.Actors.isin(df_actor_list.id).astype(int))
The code above did not work obviously
You can extract the actor names from df2['id'] and check if df['Actors'] is in it:
df.assign(indf=df['Actors'].isin(df2['id'].str.extract('(.*)(?=\s[-–])',
expand=False)).astype(int))
output:
Actors indf
0 Christian Bale 0
1 Ben Kingsley 0
2 Halley Bailey 1
3 Aaron Paul 0
Another, more generic, approach relying on a regex:
import re
regex = '|'.join(map(re.escape, df['Actors']))
# 'Christian\\ Bale|Ben\\ Kingsley|Halley\\ Bailey|Aaron\\ Paul'
actors = df2['id'].str.extract(f'({regex})', expand=False).dropna()
df.assign(indf=df['Actors'].isin(actors).astype(int))
used inputs:
df = pd.DataFrame({'Actors': ['Christian Bale', 'Ben Kingsley', 'Halley Bailey', 'Aaron Paul']})
df2 = pd.DataFrame({'id': ['Halley Bailey - 1998', 'Coco Jones – 1998']})
I have a pandas dataframe with a fullnames field, I want to change the logic so that the First and Last name will have all the first and last word and the rest will go into the middle name field.
Note: The full name can contain two words in that case middle name will be null and there may be also extra spaces between the names.
Current Logic:
fullnames = "Walter John Ross Schmidt"
first, middle, *last = name.split()
print("First = {first}".format(first=first))
print("Middle = {middle}".format(middle=middle))
print("Last = {last}".format(last=" ".join(last)))
Output :
First = Walter
Middle = John
Last = Ross Schmidt
Expected Output :
FirstName = Walter
Middle = John Ross
Last = Schmidt
You can use negative indexing to get the last item in the list for the last name and also use a slice to get all but the first and last for the middle name:
fullnames = "Walter John Ross Schmidt"
first = fullnames.split()[0]
last = fullnames.split()[-1]
middle = " ".join(fullnames.split()[1:-1])
print("First = {first}".format(first=first))
print("Middle = {middle}".format(middle=middle))
print("Last = {last}".format(last=last))
PS if you are working with a data frame you can use:
df = pd.DataFrame({'fullnames':['Walter John Ross Schmidt']})
df = df.assign(**{
'first': df['fullnames'].str.split().str[0],
'middle': df['fullnames'].str.split().str[1:-1].str.join(' '),
'last': df['fullnames'].str.split().str[-1]
})
Output:
fullnames first middle last
0 Walter John Ross Schmidt Walter John Ross Schmidt
You can use capture groups in the regex passed to str.extract(), which will let you do this in a single operation:
df = pd.DataFrame({
"name": [
"Walter John Ross Schmidt",
"John Quincy Adams"
]
})
rx = re.compile(r'^(\w+)\s+(.*?)\s+(\w+)$')
df[['first', 'middle', 'last']] = df['name'].str.extract(pat=rx, expand=True)
This gives you:
name first middle last
0 Walter John Ross Schmidt Walter John Ross Schmidt
1 John Quincy Adams John Quincy Adams
I would use str.replace and str.extract here:
df["FirstName"] = df["FullName"].str.extract(r'^(\w+)')
df["Middle"] = df["FullName"].str.replace(r'^\w+\s+|\s+\w+$', '')
df["Last"] = df["FullName"].str.extract(r'(\w+)$')
You can use the following line instead.
first, *middle, last = fullnames.split()
Using the existing column name, add a new column first_name to df such that the new column splits the name into multiple words and takes the first word as its first name. For example, if the name is Elon Musk, it is split into two words in the list ['Elon', 'Musk'] and the first word Elon is taken as its first name. If the name has only one word, then the word itself is taken as its first name.
A snippet of the data frame
Name
Alemsah Ozturk
Igor Arinich
Christopher Maloney
DJ Holiday
Brian Tracy
Philip DeFranco
Patrick Collison
Peter Moore
Dr.Darrell Scott
Atul Gawande
Everette Taylor
Elon Musk
Nelly_Mo
This is what I have so far. I am not sure how to extract the name after I tokenize it
import nltk
first = df.name.apply(lambda x: nltk.word_tokenize(x))
df["first_name"] = This is where I'm stuck
Try this snippet:
df["first_name"] = df['Name'].map(lambda x: x.split(' ')[0])
df["last_name"] = df['Name'].map(lambda x: x.split(' ')[1])
I have the following list :
personnages = ['Stanley','Kevin', 'Franck']
I want to use str.contains function to create a new pandas dataframe df3 :
df3 = df2[df2['speaker'].str.contains('|'.join(personnages))]
However, if the row of the column speaker contains : 'Stanley & Kevin', i don't want it in df3.
How can I improve my code to do this ?
Here what I would do:
# toy data
df = pd.DataFrame({'speaker':['Stanley & Kevin', 'Everybody',
'Kevin speaks', 'The speaker is Franck', 'Nobody']})
personnages = ['Stanley','Kevin', 'Franck']
pattern = '|'.join(personnages)
s = (df['speaker'].str
.extractall(f'({pattern})') # extract all personnages
.groupby(level=0)[0] # group by df's row
.nunique().eq(1) # count the unique number
)
df.loc[s.index[s]]
Output:
speaker
2 Kevin speaks
3 The speaker is Franck
You'll want to denote line start and end in your regex, that way it only contains the single name:
import pandas as pd
speakers = ['Stanley', 'Kevin', 'Frank', 'Kevin & Frank']
df = pd.DataFrame([{'speaker': speaker} for speaker in speakers])
speaker
0 Stanley
1 Kevin
2 Frank
3 Kevin & Frank
r = '|'.join(speakers[:-1]) # gets all but the last one for the sake of example
# the ^ marks start of string, and $ is the end
df[df['speaker'].str.contains(f'^({r})$')]
speaker
0 Stanley
1 Kevin
2 Frank
I have a data frame where each row represent a full name and a website. I need to split that into 2 columns: name and website.
I've tried to use pandas str.split but I'm struggling to create a regex pattern that catches any initial 'http' plus the rest of the website. I have websites starting with http and https.
df = pd.DataFrame([['John Smith http://website.com'],['Alan Delon https://alandelon.com']])
I want to have a pattern that correctly identify the website to split my data. Any help would be very much appreciated.
using str.split
pd.DataFrame(df[0].str.split('\s(?=http)').tolist()).rename({0:'Name',1:'Website'}, axis=1)
Output
Name Website
0 John Smith http://website.com
1 Alan Delon https://alandelon.com
Using str.extract
Ex:
df = pd.DataFrame([['John Smith http://website.com'],['Alan Delon https://alandelon.com']], columns=["data"])
df[["Name", "Url"]] = df["data"].str.extract(r"(.*?)(http.*)")
print(df)
Output:
data Name Url
0 John Smith http://website.com John Smith http://website.com
1 Alan Delon https://alandelon.com Alan Delon https://alandelon.com