I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.
Related
I have a dataframe that looks like this:
Name F_Name L_Name Title
John Down John Down sth vs Down John
Dave Brown Dave Brown sth v Brown Dave
Mary Sith Mary Sith Sith Mary vs sth
Sam Walker Sam Walker sth vs Sam Walker
Chris Humpy Chris Humpy Humpy
John Hunter John Hunter John Hunter
Nola Smith Nola Smith Nola
Chuck Bass Chuck Bass Bass v sth
Rob Bank Rob Bank Rob v sth
Chris Ham Chris Ham Chris Ham
Angie Poppy Angie Poppy Poppy Angie
Joe Exhaust Joe Exhaust sth vs Joe
: : :
Tony Start Tony Start sth v Start
Tony Start Tony Start sth v james bb
Tony Start Tony Start Dave Sins
I would like to match the Name column with the Title column. If the Name appear before v or vs, then the new column Label will be first. Otherwise, it will be second. If the Title column only has the name without v or vs. It will be null.
Here is what the output dataframe would look like:
Name F_Name L_Name Title Label
John Down John Down sth vs Down John second
Dave Brown Dave Brown sth v Brown Dave second
Mary Sith Mary Sith Sith Mary vs sth first
Sam Walker Sam Walker sth vs Sam Walker second
Chris Humpy Chris Humpy Humpy null
John Hunter John Hunter John Hunter null
Nola Smith Nola Smith Nola null
Chuck Bass Chuck Bass Bass v sth first
Rob Bank Rob Bank Rob vs sth first
Chris Ham Chris Ham Chris Ham null
Angie Poppy Angie Poppy Poppy Angie null
Joe Exhaust Joe Exhaust sth vs Joe second
: : : :
Tony Start Tony Start sth v Start second
Tony Start Tony Start sth v james b null
Tony Start Tony Start Dave Sins null
I am thinking to split the v or vs from the Title column into two new columns then matching with the Name column. But I do not know how to add the conditions that to check whether the names appear before the v or vs. So I am wondering are there any better ways to do this without splitting the title column?
Idea for matching is values before v or vs splitted by spaces and converted to sets and for second condition test this strings in Series.str.contains, last passed to numpy.select:
#convert slitted by spaces Name column to sets
names = df['Name'].str.split().apply(set)
#convert both splitted columns by vs or v to sets, if emty value add empty set
df1 = (df['Title'].str.split('\s+vs|v\s+', expand=True)
.apply(lambda x: x.str.split())
.applymap(lambda x: set(x) if isinstance(x, list) else set()))
#tests subsets for both columns in df1
m11 = [label.issubset(name) for label, name in zip(df1[0], names)]
m12 = [label.issubset(name) for label, name in zip(df1[1], names)]
#test if no vs v
m2 = ~df['Title'].str.contains(r'\s+vs|v\s+')
#set values
df['Label'] = np.select([m2, m11, m12], [None, 'first','second'], None)
print (df)
Name F_Name L_Name Title Label
0 John Down John Down sth vs Down John second
1 Dave Brown Dave Brown sth v Brown Dave second
2 Mary Sith Mary Sith Sith Mary vs sth first
3 Sam Walker Sam Walker sth vs Sam Walker second
4 Chris Humpy Chris Humpy Humpy None
5 John Hunter John Hunter John Hunter None
6 Nola Smith Nola Smith Nola None
7 Chuck Bass Chuck Bass Bass v sth first
8 Rob Bank Rob Bank Rob v sth first
9 Chris Ham Chris Ham Chris Ham None
10 Angie Poppy Angie Poppy Poppy Angie None
11 Joe Exhaust Joe Exhaust sth vs Joe second
12 Tony Start Tony Start sth v Start second
13 Tony Start Tony Start sth v james bb None
14 Tony Start Tony Start Dave Sins None
I have two dataframes. One contains contact information for constituents. The other was created to pair up constituents that might be part of the same household.
Sample:
data1 = {'Household_0':['1234567','2345678','3456789','4567890'],
'Individual_0':['1111111','2222222','3333333','4444444'],
'Individual_1':['5555555','6666666','7777777','']}
df1=pd.DataFrame(data1)
data2 = {'Constituent Id':['1234567','2345678','3456789','4567890',
'1111111','2222222','3333333','4444444',
'5555555','6666666','7777777'],
'Display Name':['Clark Kent and Lois Lane','Bruce Banner and Betty Ross',
'Tony Stark and Pepper Pots','Steve Rogers','Clark Kent','Bruce Banner',
'Tony Stark','Steve Rogers','Lois Lane','Betty Ross','Pepper Pots']}
df2=pd.DataFrame(data2)
Resulting in:
df1
Household_0 Individual_0 Individual_1
0 1234567 1111111 5555555
1 2345678 2222222 6666666
2 3456789 3333333 7777777
3 4567890 4444444
df2
Constituent Id Display Name
0 1234567 Clark Kent and Lois Lane
1 2345678 Bruce Banner and Betty Ross
2 3456789 Tony Stark and Pepper Pots
3 4567890 Steve Rogers
4 1111111 Clark Kent
5 2222222 Bruce Banner
6 3333333 Tony Stark
7 4444444 Steve Rogers
8 5555555 Lois Lane
9 6666666 Betty Ross
10 7777777 Pepper Pots
I would like to take df1, reference the Constituent Id out of df2, and create a new dataframe that has the names of the constituents instead of their IDs, so that we can ensure they are truly family/household members.
I believe I can do this by iterating, but that seems like the wrong approach. Is there a straightforward way to do this?
you can map each column from df1 with a series based on df2 once set_index Constituent Id and select the column Display Name. Use apply to repeat the operation on each column.
print (df1.apply(lambda x: x.map(df2.set_index('Constituent Id')['Display Name'])))
Household_0 Individual_0 Individual_1
0 Clark Kent and Lois Lane Clark Kent Lois Lane
1 Bruce Banner and Betty Ross Bruce Banner Betty Ross
2 Tony Stark and Pepper Pots Tony Stark Pepper Pots
3 Steve Rogers Steve Rogers NaN
You can pipeline melt, merge and pivot_table.
df3 = (
df1
.reset_index()
.melt('index')
.merge(df2, left_on='value', right_on='Constituent Id')
.pivot_table(values='Display Name', index='index', columns='variable', aggfunc='last')
)
print(df3)
outputs
variable Household_0 Individual_0 Individual_1
index
0 Clark Kent and Lois Lane Clark Kent Lois Lane
1 Bruce Banner and Betty Ross Bruce Banner Betty Ross
2 Tony Stark and Pepper Pots Tony Stark Pepper Pots
3 Steve Rogers Steve Rogers NaN
You can also try using .applymap() to link the two together.
reference = df2.set_index('Constituent Id')['Display Name'].to_dict()
df1[df1.columns] = df1[df1.columns].applymap(reference.get)
I have two columns in a DataFrame, crewname is a list of crew members worked on a film. Director_loc is the location within the list of the director.
I want to create a new column which has the name of the director.
crewname Director_loc
[John Lasseter, Joss Whedon, Andrew Stanton, J... 0
[Larry J. Franco, Jonathan Hensleigh, James Ho... 3
[Howard Deutch, Mark Steven Johnson, Mark Stev... 0
[Forest Whitaker, Ronald Bass, Ronald Bass, Ez... 0
[Alan Silvestri, Elliot Davis, Nancy Meyers, N... 5
[Michael Mann, Michael Mann, Art Linson, Micha... 0
[Sydney Pollack, Barbara Benedek, Sydney Polla... 0
[David Loughery, Stephen Sommers, Peter Hewitt... 2
[Peter Hyams, Karen Elise Baldwin, Gene Quinta... 0
[Martin Campbell, Ian Fleming, Jeffrey Caine, ... 0
I've tried a number of codes using list comprehension, enumerate etc. I'm a bit embarrassed to put them here.
Any help will be appreciated.
Use indexing with list comprehension:
df['name'] = [a[b] for a , b in zip(df['crewname'], df['Director_loc'])]
print (df)
crewname Director_loc \
0 [John Lasseter, Joss Whedon, Andrew Stanton] 2
1 [Larry J. Franco, Jonathan Hensleigh] 1
name
0 Andrew Stanton
1 Jonathan Hensleigh
Input:
I have a Dataframe as follows
Full_Name Name1 Name2
John Mathew Davidson John Davidson
Paul Theodre Luther Paul Theodre
Victor George Mary George Mary
Output:
I need to find the Remaining_name column as shown below
Full_Name Name1 Name2 Remaining_name
John Mathew Davidson John Davidson Mathew
Paul Theodre Luther Paul Theodre Luther
Victor George Mary George Mary Victor
Clarification:
I need to compare more than one column's value (word) in another column's value (sentence) and find the unmatched words which could be in any position of the whole string.
This is the data you provided:
import pandas as pd
full_name = ['John Mathew Davidson', 'Paul Theodre Luther', 'Victor George Mary']
name_1 = ['John', 'Paul', 'George']
name_2 = ['Davidson', 'Theodre', 'Mary']
df = pd.DataFrame({'Full_Name':full_name, 'Name1':name_1, 'Name2':name_2 })
In order to perform an action on multiple columns in a row, best thing is to define the function separately. It makes the code more readable and easier to debug
The function will take a DataFrame row as an input:
def find_missing_name(row):
known_names = [row['Name1'], row['Name2']] ## we add known names to a list to check it later
full_name_list = row['Full_Name'].split(' ') ## converting the full name to the list by splitting it on spaces
## WARNING! this function works well only if you are sure your 'Full_Name' column items are separated by a space.
missing_name = [x for x in full_name_list if x not in known_names] ## looping throught the full name list and comparing it to the known_names list, to only keep the missing ones.
missing_name = ','.join(missing_name) ## in case there are more than one missing names convert them all in a string separated by comma
return missing_name
Now apply the function to the existing DataFrame:
df['missing_name'] = df.apply(find_missing_name, axis=1) ## axis=1 means 'apply to each row', where axis=0 means 'apply to each column'
Output:
Hope this helps :)
You can do so in one line with:
df['Remaining_name'] = df.apply(lambda x: [i for i in x['Full_Name'].split() if all(i not in x[c] for c in df.columns[1:])], axis=1)
This will return your Remaining_name column as a list, but this functionality will be helpful in the case that you have names with more than three sub-strings, for example:
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson [Mathew]
1 Paul Theodre Luther Paul Theodre [Luther]
2 Victor George Mary George Mary [Victor]
3 Henry Patrick John Harrison Patrick Henry [John, Harrison]
Vectorized solution using replace,
df['Remaining_name'] = df.apply(lambda x: x['Full_Name'].replace(x['Name1'], '').replace(x['Name2'], ''), axis=1).str.strip()
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor
Edit: If you have many columns starting with Name, you can select a slice a replace the values in Full_Name based on regex pattern
df['tmp'] = df[df.columns[df.columns.str.startswith('Name')]].apply('|'.join, axis = 1)
df['Remaining_name'] = df.apply(lambda x: x.replace(x['tmp'], '', regex = True), axis = 1)['Full_Name'].str.strip()
df.drop('tmp', axis =1, inplace = True)
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor
3 Henry Patrick John Harrison Henry John Patrick Harrison
I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm
Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None
Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do
This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)