This has to be so simple - but I can't figure it out. I have a "name" column within a DataFrame and I'm trying to reverse the order of ['First Name', 'Middle Name', 'Last Name'] to ['Last Name', 'First Name', 'Middle Name'].
Here is my code:
for i in range(2114):
bb = a['Approved by User'][i].split(" ",2)[2]
aa = a['Approved by User'][i].split(" ",2)[0]
a['Full Name]'] = bb+','+aa
Unfortunately I keep getting IndexError: list index out of range with the current code.
This is what I want:
Old column Name| Jessica Mary Simpson
New column Name| Simpson Jessica Mary
One way to do it is to split the string and joinit later on in a function.
like so:
import pandas as pd
d = {"name": ["Jessica Mary Simpson"]}
df = pd.DataFrame(d)
a = df.name.str.split()
a = a.apply(lambda x: " ".join(x[::-1])).reset_index()
print(a)
output:
index name
0 0 Simpson Mary Jessica
With your shown samples, you could try following.
Let's say following is the df:
fullname
0 Jessica Mary Simpson
1 Ravinder avtar singh
2 John jonny janardan
Here is the code:
df['fullname'].replace(r'^([^ ]*) ([^ ]*) (.*)$', r'\3 \1 \2',regex=True)
OR
df['fullname'].replace(r'^(\S*) (\S*) (.*)$', r'\3 \1 \2',regex=True)
output will be as follows:
0 Simpson Jessica Mary
1 singh Ravinder avtar
2 janardan John jonny
I think problem is in your data, here is your solution in pandas text functions Series.str.split, indexing and Series.str.join:
df['Full Name'] = df['Approved by User'].str.split(n=2).str[::-1].str.join(' ')
print (df)
Approved by User Full Name
0 Jessica Mary Simpson Simpson Mary Jessica
1 John Doe Doe John
2 Mary Mary
Related
I have a dataframe which consists of two columns, full name and last name. Sometimes, the last name column is not filled properly. In such cases, the last name would be found as the last word in the full name column between parenthesis. I would like to update my last name column for those cases where parenthesis are found to be equal to the word between parenthesis.
Code
import pandas as pd
df = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', '-', 'mac', '-']
})
result_to_be = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', 'james', 'mac', 'arnold']
})
print(df)
print(result_to_be)
I have tried to implement the contains function to be used as a mask but it seems to be messing the check regex when checking if it contains ')' or '(' characters
df['full'].str.contains(')')
The error it shows is
re.error: unbalanced parenthesis at position 0
You can use .str.findall to get the value between the parentheses and df.loc to assign that where last is -:
df.loc[df['last'] == '-', 'last'] = df['full'].str.findall('\((.+?)\)').str[-1]
Output:
>>> df
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold
For a slightly different syntax, you could also use extract
df.loc[df['last'] == '-', 'last'] = df['full'].str.extract('.*\((.*)\)', expand=False)
Output:
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold
I'm trying to replace column "Names" by a new variable "Gender" based on the first letters that we find in column name.
INPUT:
df['Name'].value_counts()
OUTPUT:
Mr. Gordon Hemmings 1
Miss Jane Wilkins 1
Mrs. Audrey North 1
Mrs. Wanda Sharp 1
Mr. Victor Hemmings 1
..
Miss Heather Abraham 1
Mrs. Kylie Hart 1
Mr. Ian Langdon 1
Mr. Gordon Watson 1
Miss Irene Vance 1
Name: Name, Length: 4999, dtype: int64
Now, see the Miss, Mrs., and Miss? The first question that comes to mind is: how many different words there are?
INPUT
df.Name.str.split().str[0].value_counts(dropna=False)
Mr. 3351
Mrs. 937
Miss 711
NaN 1
Name: Name, dtype: int64
Now I'm trying to:
#Replace missing value
df['Name'].fillna('Mr.', inplace=True)
# Create Column Gender
df['Gender'] = df['Name']
for i in range(0, df[0]):
A = df['Name'].values[i][0:3]=="Mr."
df['Gender'].values[i] = A
df.loc[df['Gender']==True, 'Gender']="Male"
df.loc[df['Gender']==False, 'Gender']="Female"
del df['Name'] #Delete column 'Name'
df
But I'm missing something since I get the following error:
KeyError: 0
The KeyError is because you don't have a column called 0. However, I would ditch that code and try something more efficient.
You can use np.where with str.contains to search for names with Mr. after using fillna(). Then, just drop the Name column.:
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
df
Full example:
df = pd.DataFrame({'Name': {0: 'Mr. Gordon Hemmings',
1: 'Miss Jane Wilkins',
2: 'Mrs. Audrey North',
3: 'Mrs. Wanda Sharp',
4: 'Mr. Victor Hemmings'},
'Value': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}})
print(df)
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
print('\n')
print(df)
Name Value
0 Mr. Gordon Hemmings 1
1 Miss Jane Wilkins 1
2 Mrs. Audrey North 1
3 Mrs. Wanda Sharp 1
4 Mr. Victor Hemmings 1
Value Gender
0 1 Male
1 1 Female
2 1 Female
3 1 Female
4 1 Male
Background
I have a toy df
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Mmith is Here',
'Mary Lisa Hder found here',
'Jane A Doe is also here',
'Tom T Tcker is here too'],
'P_ID': [1,2,3,4],
'P_Name' : ['MMITH, JON J', 'HDER, MARY LISA', 'DOE, JANE A', 'TCKER, TOM T'],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID', 'P_Name']]
df
Text N_ID P_ID P_Name
0 Jon J Mmith is Here A1 1 MMITH, JON J
1 Mary Lisa Hder found here A2 2 HDER, MARY LISA
2 Jane A Doe is also here A3 3 DOE, JANE A
3 Tom T Tcker is here to A4 4 TCKER, TOM T
Goal
1) Change the P_Name column from df into a format that looks like my desired output; that is, change the current format (e.g.MMITH, JON J) to a format (e.g. Mmith, Jon J) where the first and last names and middle letter all start with a capital letter
2) Create this in a new column P_Name_New
Desired Output
Text N_ID P_ID P_Name P_Name_New
0 Jon J Mmith is Here A1 1 MMITH, JON J Mmith, Jon J
1 Mary Lisa Hder found here A2 2 HDER, MARY LISA Hder, Mary Lisa
2 Jane A Doe is also here A3 3 DOE, JANE A Doe, Jane A
3 Tom T Tcker is here too A4 4 TCKER, TOM T Tcker, Tom T
Question
How do I achieve my desired goal?
Simply with str.title() function:
In [98]: df['P_Name_New'] = df['P_Name'].str.title()
In [99]: df
Out[99]:
Text N_ID P_ID P_Name P_Name_New
0 Jon J Smith is Here A1 1 SMITH, JON J Smith, Jon J
1 Mary Lisa Rider found here A2 2 RIDER, MARY LISA Rider, Mary Lisa
2 Jane A Doe is also here A3 3 DOE, JANE A Doe, Jane A
3 Tom T Tucker is here too A4 4 TUCKER, TOM T Tucker, Tom T
Input:
I have a Dataframe as follows
Full_Name Name1 Name2
John Mathew Davidson John Davidson
Paul Theodre Luther Paul Theodre
Victor George Mary George Mary
Output:
I need to find the Remaining_name column as shown below
Full_Name Name1 Name2 Remaining_name
John Mathew Davidson John Davidson Mathew
Paul Theodre Luther Paul Theodre Luther
Victor George Mary George Mary Victor
Clarification:
I need to compare more than one column's value (word) in another column's value (sentence) and find the unmatched words which could be in any position of the whole string.
This is the data you provided:
import pandas as pd
full_name = ['John Mathew Davidson', 'Paul Theodre Luther', 'Victor George Mary']
name_1 = ['John', 'Paul', 'George']
name_2 = ['Davidson', 'Theodre', 'Mary']
df = pd.DataFrame({'Full_Name':full_name, 'Name1':name_1, 'Name2':name_2 })
In order to perform an action on multiple columns in a row, best thing is to define the function separately. It makes the code more readable and easier to debug
The function will take a DataFrame row as an input:
def find_missing_name(row):
known_names = [row['Name1'], row['Name2']] ## we add known names to a list to check it later
full_name_list = row['Full_Name'].split(' ') ## converting the full name to the list by splitting it on spaces
## WARNING! this function works well only if you are sure your 'Full_Name' column items are separated by a space.
missing_name = [x for x in full_name_list if x not in known_names] ## looping throught the full name list and comparing it to the known_names list, to only keep the missing ones.
missing_name = ','.join(missing_name) ## in case there are more than one missing names convert them all in a string separated by comma
return missing_name
Now apply the function to the existing DataFrame:
df['missing_name'] = df.apply(find_missing_name, axis=1) ## axis=1 means 'apply to each row', where axis=0 means 'apply to each column'
Output:
Hope this helps :)
You can do so in one line with:
df['Remaining_name'] = df.apply(lambda x: [i for i in x['Full_Name'].split() if all(i not in x[c] for c in df.columns[1:])], axis=1)
This will return your Remaining_name column as a list, but this functionality will be helpful in the case that you have names with more than three sub-strings, for example:
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson [Mathew]
1 Paul Theodre Luther Paul Theodre [Luther]
2 Victor George Mary George Mary [Victor]
3 Henry Patrick John Harrison Patrick Henry [John, Harrison]
Vectorized solution using replace,
df['Remaining_name'] = df.apply(lambda x: x['Full_Name'].replace(x['Name1'], '').replace(x['Name2'], ''), axis=1).str.strip()
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor
Edit: If you have many columns starting with Name, you can select a slice a replace the values in Full_Name based on regex pattern
df['tmp'] = df[df.columns[df.columns.str.startswith('Name')]].apply('|'.join, axis = 1)
df['Remaining_name'] = df.apply(lambda x: x.replace(x['tmp'], '', regex = True), axis = 1)['Full_Name'].str.strip()
df.drop('tmp', axis =1, inplace = True)
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor
3 Henry Patrick John Harrison Henry John Patrick Harrison
I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm
Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None
Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do
This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)