startswith() function help needed in Pandas Dataframe - python

I have a Name Column in Dataframe in which there are Multiple names.
DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
"Mr. Roderick Robert Crispin",
"Cunningham"," Mr. Alfred Fleming"]})`
OUTPUT
Name
0 Brailey, Mr. William Theodore Ronald
1 Roger Marie Bricoux
2 Mr. Roderick Robert Crispin
3 Cunningham
4 Mr. Alfred Fleming
I wrote a row classification function, like if I pass a row/name it should return output class
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux', 'John Frederick Preston Clarke']
def classify_role(row):
if row.loc['name'] in mus:
return 'musician'
Calling a function
is_brailey = df['name'].str.startswith('Brailey')
print(classify_role(df[is_brailey].iloc[0]))
Should show 'musician'
But output is showing different class I think I am writing something wrong here in classify_role()
Must be this row
if row.loc['name'] in mus:
Summary:
I am in need of a solution if I put first name of a person in startswith() who is in musi it should return musician

EDIT: If want test if values exist in lists you can create dictionary and test membership by Series.isin:
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
'John Frederick Preston Clarke']
cat1 = ['Mr. Alfred Fleming','Cunningham']
d = {'musician':mus, 'category':cat1}
for k, v in d.items():
df.loc[df['Name'].isin(v), 'type'] = k
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux musician
2 Mr. Roderick Robert Crispin NaN
3 Cunningham category
4 Mr. Alfred Fleming category
Your solution should be changed:
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
'John Frederick Preston Clarke']
def classify_role(row):
if row in mus:
return 'musician'
df['type'] = df['Name'].apply(classify_role)
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux musician
2 Mr. Roderick Robert Crispin None
3 Cunningham None
4 Mr. Alfred Fleming None
You can pass values in tuple to Series.str.startswith, solution should be expand to match more categories by dictionary:
d = {'musician': ['Brailey, Mr. William Theodore Ronald'],
'cat1':['Roger Marie Bricoux', 'Cunningham']}
for k, v in d.items():
df.loc[df['Name'].str.startswith(tuple(v)), 'type'] = k
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux cat1
2 Mr. Roderick Robert Crispin NaN
3 Cunningham cat1
4 Mr. Alfred Fleming NaN

Related

Matching two strings columns in "A vs B", then assign label into new column

I have a dataframe that looks like this:
Name F_Name L_Name Title
John Down John Down sth vs Down John
Dave Brown Dave Brown sth v Brown Dave
Mary Sith Mary Sith Sith Mary vs sth
Sam Walker Sam Walker sth vs Sam Walker
Chris Humpy Chris Humpy Humpy
John Hunter John Hunter John Hunter
Nola Smith Nola Smith Nola
Chuck Bass Chuck Bass Bass v sth
Rob Bank Rob Bank Rob v sth
Chris Ham Chris Ham Chris Ham
Angie Poppy Angie Poppy Poppy Angie
Joe Exhaust Joe Exhaust sth vs Joe
: : :
Tony Start Tony Start sth v Start
Tony Start Tony Start sth v james bb
Tony Start Tony Start Dave Sins
I would like to match the Name column with the Title column. If the Name appear before v or vs, then the new column Label will be first. Otherwise, it will be second. If the Title column only has the name without v or vs. It will be null.
Here is what the output dataframe would look like:
Name F_Name L_Name Title Label
John Down John Down sth vs Down John second
Dave Brown Dave Brown sth v Brown Dave second
Mary Sith Mary Sith Sith Mary vs sth first
Sam Walker Sam Walker sth vs Sam Walker second
Chris Humpy Chris Humpy Humpy null
John Hunter John Hunter John Hunter null
Nola Smith Nola Smith Nola null
Chuck Bass Chuck Bass Bass v sth first
Rob Bank Rob Bank Rob vs sth first
Chris Ham Chris Ham Chris Ham null
Angie Poppy Angie Poppy Poppy Angie null
Joe Exhaust Joe Exhaust sth vs Joe second
: : : :
Tony Start Tony Start sth v Start second
Tony Start Tony Start sth v james b null
Tony Start Tony Start Dave Sins null
I am thinking to split the v or vs from the Title column into two new columns then matching with the Name column. But I do not know how to add the conditions that to check whether the names appear before the v or vs. So I am wondering are there any better ways to do this without splitting the title column?
Idea for matching is values before v or vs splitted by spaces and converted to sets and for second condition test this strings in Series.str.contains, last passed to numpy.select:
#convert slitted by spaces Name column to sets
names = df['Name'].str.split().apply(set)
#convert both splitted columns by vs or v to sets, if emty value add empty set
df1 = (df['Title'].str.split('\s+vs|v\s+', expand=True)
.apply(lambda x: x.str.split())
.applymap(lambda x: set(x) if isinstance(x, list) else set()))
#tests subsets for both columns in df1
m11 = [label.issubset(name) for label, name in zip(df1[0], names)]
m12 = [label.issubset(name) for label, name in zip(df1[1], names)]
#test if no vs v
m2 = ~df['Title'].str.contains(r'\s+vs|v\s+')
#set values
df['Label'] = np.select([m2, m11, m12], [None, 'first','second'], None)
print (df)
Name F_Name L_Name Title Label
0 John Down John Down sth vs Down John second
1 Dave Brown Dave Brown sth v Brown Dave second
2 Mary Sith Mary Sith Sith Mary vs sth first
3 Sam Walker Sam Walker sth vs Sam Walker second
4 Chris Humpy Chris Humpy Humpy None
5 John Hunter John Hunter John Hunter None
6 Nola Smith Nola Smith Nola None
7 Chuck Bass Chuck Bass Bass v sth first
8 Rob Bank Rob Bank Rob v sth first
9 Chris Ham Chris Ham Chris Ham None
10 Angie Poppy Angie Poppy Poppy Angie None
11 Joe Exhaust Joe Exhaust sth vs Joe second
12 Tony Start Tony Start sth v Start second
13 Tony Start Tony Start sth v james bb None
14 Tony Start Tony Start Dave Sins None

How to slice pandas column with index list?

I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.

Can you make value_counts on a specific interval of characters with pandas?

So, I have a column "Names". If I do:
df['Names'].value_counts()
I get this:
Mr. Richard Vance 1
Mrs. Angela Bell 1
Mr. Stewart Randall 1
Mr. Andrew Ogden 1
Mrs. Maria Berry 1
..
Mrs. Lillian Wallace 1
Mr. William Bailey 1
Mr. Paul Ball 1
Miss Pippa Bond 1
Miss Caroline Gray 1
It's ok... Thera are lots of DISTINCT names. But what I want is to do this value_counts() only for the first characters until it get's to the empty character (i.e. space that devides, for instance Miss or Mrs. from Lillian Wallace) So that the output would be, for example:
Mrs. 1000
Mr. 2000
Miss 2000
Just to know how many distinct variants there are in the column names so that, in a 2nd stage create another variable (namely gender) based on those variants.
You can use value_counts(dropna=False) on str[0] after a str.split():
df = pd.DataFrame({'Names': ['Mr. Richard Vance','Mrs. Angela Bell','Mr. Stewart Randall','Mr. Andrew Ogden','Mrs. Maria Berry','Mrs. Lillian Wallace','Mr. William Bailey','Mr. Paul Ball','Miss Pippa Bond','Miss Caroline Gray','']})
df.Names.str.split().str[0].value_counts(dropna=False)
# Mr. 5
# Mrs. 3
# Miss 2
# NaN 1
# Name: Names, dtype: int64
If you want to know the unique values and if there's always a space you can do this.
df = pd.DataFrame(['Mr. Richard Vance',
'Mrs. Angela Bell',
'Mr. Stewart Randall',
'Mr. Andrew Ogden',
'Mrs. Maria Berry',
'Mrs. Lillian Wallace',
'Mr. William Bailey',
'Mr. Paul Ball',
'Miss Pippa Bond',
'Miss Caroline Gray'], columns=['names'])
df['names'].str.split(' ').str[0].unique().tolist()
Output is a list:
['Mr.', 'Mrs.', 'Miss']
Here is a solution. You can use regex:
#Dataset
Names
0 Mr. Richard Vance
1 Mrs. Angela Bell
2 Mr. Stewart Randall
3 Mr. Andrew Ogden
4 Mrs. Maria Berry
5 Mrs. Lillian Wallace
df['Names'].str.extract(r'(\w+\.\s)').value_counts()
#Output:
Mr. 3
Mrs. 3
Note : (\w+\.\s) will extract Mr. and Mrs. parts (or any title like Dr.) from the names

Partial match string in two dataframe columns

I have two dataframes. df1 contains partial names of persons and df2 contains names of persons, their dob, etc. I want to partially match df1['Partial_names'] column with df2['Full_names'] column. For example, to match Martin Lues all rows in Full_names Having Martin in them should be fetched. In Resulting dataframe should be
df1 = pd.DataFrame()
df2 = pd.Dataframe()
df1['Partial_names'] = ['John Smith','Leo Lauana' ,'Adam Marry', 'Martin Lues']
df2['Full_names'] = ['John Smith Wade', 'Adam Blake Marry', 'Riley
Leo Lauana', 'Martin Smith', 'Martin Flex Leo']
Partial_names
1 John Smith
2 Leo Lauana
3 Adam Marry
4 Martin Lues
5 Martin Author
Full_names
1 Martin Smith
2 Riley Leo Lauana
3 Adam Blake Marry
4 Jeff Hard Jin
5 Martin Flex Leo
Partial_names Resulting_Column_with_full_names
1 John Smith John Smith Wade
2 Leo Lauana Riley Leo Lauana
3 Adam Marry Adam Blake Marry
4 Martin Lues Martin Smith
Martin Flex Leo
In actual, both dataframe has more rows

A better/faster way to handle human names in Pandas columns?

I am dealing with a large amount of data that includes the standard five columns for human names (prefix, firstname, middlename, lastname, suffix) and I would like to merge them in a separate column as a readable name. The issue I have is with handling blank values - the issue creates spacing problems. Also, I cannot modify the original columns. My current process feels a little insane (but it works!) so I am looking for a more elegant solution.
My current code:
def add_space_prefix(x):
x = str(x)
if len(x) > 0:
return x + ' '
else:
return x
def add_space_middle(x):
x = str(x)
if len(x) > 0:
return ' ' + x
else:
return x
def add_space_suffix(x):
x = str(x)
if len(x) > 0:
return ', ' + x
else:
return x`
df["middlename"] =
df["middlename"].map(lambda x: add_space_middle(x))
df["prefix"] = df["prefix"].map(lambda x: add_space_prefix(x))
df["suffix"] = df["suffix"].map(lambda x: add_space_suffix(x))
df['fullname'] = df["prefix"] + df["firstname"] + df[
"middlename"] + ' ' + df["lastname"] + df['suffix']
Sample Dataframe
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Jobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
Option 1
' '.join and pd.Series.str
In this solution we join the entire row by spaces. This may lead to spaces at the beginning or end of the string or with 2 or more spaces in the middle. We handle this by chaining string accessor methods.
df.assign(
lastname=df.lastname + ','
).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,')
0 Michael Hobart, Jr.
1 Mr. Alan Lilt
2 Jon A. Smith, III
3 Joe Miller
4 Mika Jennifer Shabosky
5 Mrs. Angela Calder
6 Boris Al Bert, Esq.
7 Dr. Natasha Chorus
8 Bill Gibbons
dtype: object
df['fullname'] = df.assign(
lastname=df.lastname + ','
).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,')
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
Option 2
list comprehension
In this solution, we perform the same activities as with the first solution, but we bundle the string operations together and within a comprehension.
[re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,')
for s in df.assign(lastname=df.lastname + ',').values.tolist()]
['Michael Hobart, Jr.',
'Mr. Alan Lilt',
'Jon A. Smith, III',
'Joe Miller',
'Mika Jennifer Shabosky',
'Mrs. Angela Calder',
'Boris Al Bert, Esq.',
'Dr. Natasha Chorus',
'Bill Gibbons']
df['fullname'] = [re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,')
for s in df.assign(lastname=df.lastname + ',').values.tolist()]
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
Option 3
pd.replace and pd.DataFrame.stack
This one is a bit different in that we replace blanks '' with np.nan so that when we stack the np.nan are naturally dropped. This makes for the joining with ' ' more straight forward.
df.assign(
lastname=df.lastname + ','
).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',')
0 Michael Hobart, Jr.
1 Mr. Alan Lilt
2 Jon A. Smith, III
3 Joe Miller
4 Mika Jennifer Shabosky
5 Mrs. Angela Calder
6 Boris Al Bert, Esq.
7 Dr. Natasha Chorus
8 Bill Gibbons
dtype: object
df['fullname'] = df.assign(
lastname=df.lastname + ','
).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',')
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
Timing
bundling within a comprehension is fastest!
%timeit df.assign(fullname=df.replace('', np.nan).stack().groupby(level=0).apply(' '.join))
%timeit df.assign(fullname=df.apply(' '.join, 1).str.replace('\s+', ' ').str.strip())
%timeit df.assign(fullname=[re.sub(r'\s+', ' ', ' '.join(s)).strip() for s in df.values.tolist()])
100 loops, best of 3: 2.51 ms per loop
1000 loops, best of 3: 979 µs per loop
1000 loops, best of 3: 384 µs per loop

Categories