Replace column based on string - python

I'm trying to replace column "Names" by a new variable "Gender" based on the first letters that we find in column name.
INPUT:
df['Name'].value_counts()
OUTPUT:
Mr. Gordon Hemmings 1
Miss Jane Wilkins 1
Mrs. Audrey North 1
Mrs. Wanda Sharp 1
Mr. Victor Hemmings 1
..
Miss Heather Abraham 1
Mrs. Kylie Hart 1
Mr. Ian Langdon 1
Mr. Gordon Watson 1
Miss Irene Vance 1
Name: Name, Length: 4999, dtype: int64
Now, see the Miss, Mrs., and Miss? The first question that comes to mind is: how many different words there are?
INPUT
df.Name.str.split().str[0].value_counts(dropna=False)
Mr. 3351
Mrs. 937
Miss 711
NaN 1
Name: Name, dtype: int64
Now I'm trying to:
#Replace missing value
df['Name'].fillna('Mr.', inplace=True)
# Create Column Gender
df['Gender'] = df['Name']
for i in range(0, df[0]):
A = df['Name'].values[i][0:3]=="Mr."
df['Gender'].values[i] = A
df.loc[df['Gender']==True, 'Gender']="Male"
df.loc[df['Gender']==False, 'Gender']="Female"
del df['Name'] #Delete column 'Name'
df
But I'm missing something since I get the following error:
KeyError: 0

The KeyError is because you don't have a column called 0. However, I would ditch that code and try something more efficient.
You can use np.where with str.contains to search for names with Mr. after using fillna(). Then, just drop the Name column.:
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
df
Full example:
df = pd.DataFrame({'Name': {0: 'Mr. Gordon Hemmings',
1: 'Miss Jane Wilkins',
2: 'Mrs. Audrey North',
3: 'Mrs. Wanda Sharp',
4: 'Mr. Victor Hemmings'},
'Value': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}})
print(df)
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
print('\n')
print(df)
Name Value
0 Mr. Gordon Hemmings 1
1 Miss Jane Wilkins 1
2 Mrs. Audrey North 1
3 Mrs. Wanda Sharp 1
4 Mr. Victor Hemmings 1
Value Gender
0 1 Male
1 1 Female
2 1 Female
3 1 Female
4 1 Male

Related

Extract last word in DataFrame column

This has to be so simple - but I can't figure it out. I have a "name" column within a DataFrame and I'm trying to reverse the order of ['First Name', 'Middle Name', 'Last Name'] to ['Last Name', 'First Name', 'Middle Name'].
Here is my code:
for i in range(2114):
bb = a['Approved by User'][i].split(" ",2)[2]
aa = a['Approved by User'][i].split(" ",2)[0]
a['Full Name]'] = bb+','+aa
Unfortunately I keep getting IndexError: list index out of range with the current code.
This is what I want:
Old column Name| Jessica Mary Simpson
New column Name| Simpson Jessica Mary
One way to do it is to split the string and joinit later on in a function.
like so:
import pandas as pd
d = {"name": ["Jessica Mary Simpson"]}
df = pd.DataFrame(d)
a = df.name.str.split()
a = a.apply(lambda x: " ".join(x[::-1])).reset_index()
print(a)
output:
index name
0 0 Simpson Mary Jessica
With your shown samples, you could try following.
Let's say following is the df:
fullname
0 Jessica Mary Simpson
1 Ravinder avtar singh
2 John jonny janardan
Here is the code:
df['fullname'].replace(r'^([^ ]*) ([^ ]*) (.*)$', r'\3 \1 \2',regex=True)
OR
df['fullname'].replace(r'^(\S*) (\S*) (.*)$', r'\3 \1 \2',regex=True)
output will be as follows:
0 Simpson Jessica Mary
1 singh Ravinder avtar
2 janardan John jonny
I think problem is in your data, here is your solution in pandas text functions Series.str.split, indexing and Series.str.join:
df['Full Name'] = df['Approved by User'].str.split(n=2).str[::-1].str.join(' ')
print (df)
Approved by User Full Name
0 Jessica Mary Simpson Simpson Mary Jessica
1 John Doe Doe John
2 Mary Mary

Fillna() in column based on condition

I have created a small dictionary, where a specific title is assigned a median age.
Age
Title
Master. 3.5
Miss. 21.0
Mr. 30.0
Mrs. 35.0
other 44.5
Now I want to use this dictionary to fill the missing values in a single column in a dataframe, based on that title. So, for rows where the "Age" is missing, and the title = "Master.", I want to insert the value 3.5 and so on.
I tried this piece of code, but it does not work; it doesn't produce an error, but it also doesn't replace the missing values. What am I doing wrong?
for title in piv.keys():
train[["Age"]][train["Title"]==title].fillna(piv[title], inplace=True)
where "piv" is the name of the dictionary, and "train" is the name of the dataframe.
Also, is there a more elegant way to do this?
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr.
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C Mrs.
{'Master.': 3.5, 'Miss.': 21.0, 'Mr.': 30.0, 'Mrs.': 35.0, 'other': 44.5}
One option:
train['Age'] = train.groupby('Title')['Age'].transform(lambda x: x.fillna(x.mean()))
Another option:
pivdict = piv.set_index('Title').squeeze().to_dict()
train['Age'] = train['Age'].fillna(train['Title'].map(pivdict))
One method:
# create lookup dictionary
title = ['Master', 'Miss.', 'Mr.', 'Mrs.', 'other']
age = [3.5, 21, 30, 35, 44]
title_dict = dict(zip(title, age))
# mock dataframe
df = pd.DataFrame({'Name': ['Bob', 'Alice', 'Charles', 'Mary'],
'Age': [12, 27, None, None],
'Title': ['Master', 'Miss.', 'Mr.', 'other']})
# if age is Na then look it up in dictionary
df['Age'] = df['Age'].fillna(df['Title'].map(title_dict))
Input:
Name Age Title
0 Bob 12.0 Master
1 Alice 27.0 Miss.
2 Charles NaN Mr.
3 Mary NaN other
Output:
Name Age Title
0 Bob 12.0 Master
1 Alice 27.0 Miss.
2 Charles 30.0 Mr.
3 Mary 44.0 other

Change names in pandas column to start with uppercase letters

Background
I have a toy df
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Mmith is Here',
'Mary Lisa Hder found here',
'Jane A Doe is also here',
'Tom T Tcker is here too'],
'P_ID': [1,2,3,4],
'P_Name' : ['MMITH, JON J', 'HDER, MARY LISA', 'DOE, JANE A', 'TCKER, TOM T'],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID', 'P_Name']]
df
Text N_ID P_ID P_Name
0 Jon J Mmith is Here A1 1 MMITH, JON J
1 Mary Lisa Hder found here A2 2 HDER, MARY LISA
2 Jane A Doe is also here A3 3 DOE, JANE A
3 Tom T Tcker is here to A4 4 TCKER, TOM T
Goal
1) Change the P_Name column from df into a format that looks like my desired output; that is, change the current format (e.g.MMITH, JON J) to a format (e.g. Mmith, Jon J) where the first and last names and middle letter all start with a capital letter
2) Create this in a new column P_Name_New
Desired Output
Text N_ID P_ID P_Name P_Name_New
0 Jon J Mmith is Here A1 1 MMITH, JON J Mmith, Jon J
1 Mary Lisa Hder found here A2 2 HDER, MARY LISA Hder, Mary Lisa
2 Jane A Doe is also here A3 3 DOE, JANE A Doe, Jane A
3 Tom T Tcker is here too A4 4 TCKER, TOM T Tcker, Tom T
Question
How do I achieve my desired goal?
Simply with str.title() function:
In [98]: df['P_Name_New'] = df['P_Name'].str.title()
In [99]: df
Out[99]:
Text N_ID P_ID P_Name P_Name_New
0 Jon J Smith is Here A1 1 SMITH, JON J Smith, Jon J
1 Mary Lisa Rider found here A2 2 RIDER, MARY LISA Rider, Mary Lisa
2 Jane A Doe is also here A3 3 DOE, JANE A Doe, Jane A
3 Tom T Tucker is here too A4 4 TUCKER, TOM T Tucker, Tom T

Adding a function to a string split command in Pandas

I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm
Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None
Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do
This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)

Joining string of a columns over several index while keeping other colums

Here is an example data set:
>>> df1 = pandas.DataFrame({
"Name": ["Alice", "Marie", "Smith", "Mallory", "Bob", "Doe"],
"City": ["Seattle", None, None, "Portland", None, None],
"Age": [24, None, None, 26, None, None],
"Group": [1, 1, 1, 2, 2, 2]})
>>> df1
Age City Group Name
0 24.0 Seattle 1 Alice
1 NaN None 1 Marie
2 NaN None 1 Smith
3 26.0 Portland 2 Mallory
4 NaN None 2 Bob
5 NaN None 2 Doe
I would like to merge the Name column for all index of the same group while keeping the City and the Age wanting someting like:
>>> df1_summarised
Age City Group Name
0 24.0 Seattle 1 Alice Marie Smith
1 26.0 Portland 2 Mallory Bob Doe
I know those 2 columns (Age, City) will be NaN/None after the first index of a given group from the structure of my starting data.
I have tried the following:
>>> print(df1.groupby('Group')['Name'].apply(' '.join))
Group
1 Alice Marie Smith
2 Mallory Bob Doe
Name: Name, dtype: object
But I would like to keep the Age and City columns...
try this:
In [29]: df1.groupby('Group').ffill().groupby(['Group','Age','City']).Name.apply(' '.join)
Out[29]:
Group Age City
1 24.0 Seattle Alice Marie Smith
2 26.0 Portland Mallory Bob Doe
Name: Name, dtype: object
using dropna and assign with groupby
docs to assign
df1.dropna(subset=['Age', 'City']) \
.assign(Name=df1.groupby('Group').Name.apply(' '.join).values)
timing
per request
update
use groupby and agg
I thought of this and it feels far more satisfying
df1.groupby('Group').agg(dict(Age='first', City='first', Name=' '.join))
to get the exact output
df1.groupby('Group').agg(dict(Age='first', City='first', Name=' '.join)) \
.reset_index().reindex_axis(df1.columns, 1)

Categories