I have created a small dictionary, where a specific title is assigned a median age.
Age
Title
Master. 3.5
Miss. 21.0
Mr. 30.0
Mrs. 35.0
other 44.5
Now I want to use this dictionary to fill the missing values in a single column in a dataframe, based on that title. So, for rows where the "Age" is missing, and the title = "Master.", I want to insert the value 3.5 and so on.
I tried this piece of code, but it does not work; it doesn't produce an error, but it also doesn't replace the missing values. What am I doing wrong?
for title in piv.keys():
train[["Age"]][train["Title"]==title].fillna(piv[title], inplace=True)
where "piv" is the name of the dictionary, and "train" is the name of the dataframe.
Also, is there a more elegant way to do this?
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr.
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C Mrs.
{'Master.': 3.5, 'Miss.': 21.0, 'Mr.': 30.0, 'Mrs.': 35.0, 'other': 44.5}
One option:
train['Age'] = train.groupby('Title')['Age'].transform(lambda x: x.fillna(x.mean()))
Another option:
pivdict = piv.set_index('Title').squeeze().to_dict()
train['Age'] = train['Age'].fillna(train['Title'].map(pivdict))
One method:
# create lookup dictionary
title = ['Master', 'Miss.', 'Mr.', 'Mrs.', 'other']
age = [3.5, 21, 30, 35, 44]
title_dict = dict(zip(title, age))
# mock dataframe
df = pd.DataFrame({'Name': ['Bob', 'Alice', 'Charles', 'Mary'],
'Age': [12, 27, None, None],
'Title': ['Master', 'Miss.', 'Mr.', 'other']})
# if age is Na then look it up in dictionary
df['Age'] = df['Age'].fillna(df['Title'].map(title_dict))
Input:
Name Age Title
0 Bob 12.0 Master
1 Alice 27.0 Miss.
2 Charles NaN Mr.
3 Mary NaN other
Output:
Name Age Title
0 Bob 12.0 Master
1 Alice 27.0 Miss.
2 Charles 30.0 Mr.
3 Mary 44.0 other
Related
So I have a list of dictionaries, that itself has lists of dictionaries within it like this:
myDict = [{'First_Name': 'Jack', 'Last_Name': 'Smith', 'Job_Data': [{'Company': 'Amazon'}, {'Hire_Date': '2011-04-01', 'Company': 'Target'}]},
{'First_Name': 'Jill', 'Last_Name': 'Smith', 'Job_Data': [{'Hire_Date': '2009-11-16', 'Company': 'Sears'}, {'Hire_Date': '2011-04-01'}]}]
However, as you can see, some of the key values are the same, and sometimes data elements will be missing like Jack missing a Hire Date and Jill missing a company. So what I want to do is preserve the data and write it to multiple rows so that my final output looks like this:
First_Name Last_Name Hire_Date Company
0 Jack Smith NaN Amazon
1 Jack Smith 2011-04-01 Target
2 Jill Smith 2009-11-16 Sears
3 Jill Smith 2011-04-01 NaN
Edit: Follow-up question. Say now that I have a dictionary that looks like this that adds in an extract key and I want to produce a similar output but with the new data included:
myDict = [{'First_Name': 'Jack', 'Last_Name': 'Smith', 'Job_Data': [{'Company': 'Amazon'}, {'Hire_Date': '2011-04-01', 'Company': 'Target'}, 'Dependent_data': [{'Dependent': 'Susan Smith'}, {'Dependent': 'Will Smith'}]},
{'First_Name': 'Jill', 'Last_Name': 'Smith', 'Job_Data': [{'Hire_Date': '2009-11-16', 'Company': 'Sears'}, {'Hire_Date': '2011-04-01'}]}]
Output:
First_Name Last_Name Hire_Date Company Dependent
0 Jack Smith NaN Amazon Susan Smith
1 Jack Smith 2011-04-01 Target Will Smith
2 Jill Smith 2009-11-16 Sears NaN
3 Jill Smith 2011-04-01 NaN NaN
Using json_normalize
df = pd.json_normalize(data=myDict, meta=["First_Name", "Last_Name"], record_path="Job_Data")
print(df)
Company Hire_Date First_Name Last_Name
0 Amazon NaN Jack Smith
1 Target 2011-04-01 Jack Smith
2 Sears 2009-11-16 Jill Smith
3 NaN 2011-04-01 Jill Smith
I'm trying to replace column "Names" by a new variable "Gender" based on the first letters that we find in column name.
INPUT:
df['Name'].value_counts()
OUTPUT:
Mr. Gordon Hemmings 1
Miss Jane Wilkins 1
Mrs. Audrey North 1
Mrs. Wanda Sharp 1
Mr. Victor Hemmings 1
..
Miss Heather Abraham 1
Mrs. Kylie Hart 1
Mr. Ian Langdon 1
Mr. Gordon Watson 1
Miss Irene Vance 1
Name: Name, Length: 4999, dtype: int64
Now, see the Miss, Mrs., and Miss? The first question that comes to mind is: how many different words there are?
INPUT
df.Name.str.split().str[0].value_counts(dropna=False)
Mr. 3351
Mrs. 937
Miss 711
NaN 1
Name: Name, dtype: int64
Now I'm trying to:
#Replace missing value
df['Name'].fillna('Mr.', inplace=True)
# Create Column Gender
df['Gender'] = df['Name']
for i in range(0, df[0]):
A = df['Name'].values[i][0:3]=="Mr."
df['Gender'].values[i] = A
df.loc[df['Gender']==True, 'Gender']="Male"
df.loc[df['Gender']==False, 'Gender']="Female"
del df['Name'] #Delete column 'Name'
df
But I'm missing something since I get the following error:
KeyError: 0
The KeyError is because you don't have a column called 0. However, I would ditch that code and try something more efficient.
You can use np.where with str.contains to search for names with Mr. after using fillna(). Then, just drop the Name column.:
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
df
Full example:
df = pd.DataFrame({'Name': {0: 'Mr. Gordon Hemmings',
1: 'Miss Jane Wilkins',
2: 'Mrs. Audrey North',
3: 'Mrs. Wanda Sharp',
4: 'Mr. Victor Hemmings'},
'Value': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}})
print(df)
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
print('\n')
print(df)
Name Value
0 Mr. Gordon Hemmings 1
1 Miss Jane Wilkins 1
2 Mrs. Audrey North 1
3 Mrs. Wanda Sharp 1
4 Mr. Victor Hemmings 1
Value Gender
0 1 Male
1 1 Female
2 1 Female
3 1 Female
4 1 Male
I have defined a simple function to replace missing values in numerical columns with the average of the non missing values for the columns. The function is syntactically correct and generating correct values. However, the missing values are not getting replaced
Below is the code snippet
def fillmissing_with_mean(df1):
df2 = df1._get_numeric_data()
for i in range(len(df2.columns)):
df2[df2.iloc[:,i].isnull()].iloc[:,i]=df2.iloc[:,i].mean()
return df2
fillmissing_with_mean(df)
The data frame which is passed looks like this:
age gender job name height
NaN F student alice 165.0
26.0 None student john 180.0
NaN M student eric 175.0
58.0 None manager paul NaN
33.0 M engineer julie 171.0
34.0 F scientist peter NaN
You do not need worry about select the numeric or not , when you doing the mean ,it will only affect to those numeric column, and fillna can pass by pd.Serise
df.fillna(df.mean())
Out[1398]:
age gender job name height
0 37.75 F student alice 165.00
1 26.00 None student john 180.00
2 37.75 M student eric 175.00
3 58.00 None manager paul 172.75
4 33.00 M engineer julie 171.00
5 34.00 F scientist peter 172.75
More Info
df.mean()
Out[1399]:
age 37.75
height 172.75
dtype: float64
This may be what you need. skipna=True by default, but I've included it here explicitly so you know what it's doing.
for col in ['age', 'height']:
df[col] = df[col].fillna(df[col].mean(skipna=True))
# age gender job name height
# 0 37.75 F student alice 165.00
# 1 26.00 None student john 180.00
# 2 37.75 M student eric 175.00
# 3 58.00 None manager paul 172.75
# 4 33.00 M engineer julie 171.00
# 5 34.00 F scientist peter 172.75
I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm
Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None
Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do
This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)
Here is an example data set:
>>> df1 = pandas.DataFrame({
"Name": ["Alice", "Marie", "Smith", "Mallory", "Bob", "Doe"],
"City": ["Seattle", None, None, "Portland", None, None],
"Age": [24, None, None, 26, None, None],
"Group": [1, 1, 1, 2, 2, 2]})
>>> df1
Age City Group Name
0 24.0 Seattle 1 Alice
1 NaN None 1 Marie
2 NaN None 1 Smith
3 26.0 Portland 2 Mallory
4 NaN None 2 Bob
5 NaN None 2 Doe
I would like to merge the Name column for all index of the same group while keeping the City and the Age wanting someting like:
>>> df1_summarised
Age City Group Name
0 24.0 Seattle 1 Alice Marie Smith
1 26.0 Portland 2 Mallory Bob Doe
I know those 2 columns (Age, City) will be NaN/None after the first index of a given group from the structure of my starting data.
I have tried the following:
>>> print(df1.groupby('Group')['Name'].apply(' '.join))
Group
1 Alice Marie Smith
2 Mallory Bob Doe
Name: Name, dtype: object
But I would like to keep the Age and City columns...
try this:
In [29]: df1.groupby('Group').ffill().groupby(['Group','Age','City']).Name.apply(' '.join)
Out[29]:
Group Age City
1 24.0 Seattle Alice Marie Smith
2 26.0 Portland Mallory Bob Doe
Name: Name, dtype: object
using dropna and assign with groupby
docs to assign
df1.dropna(subset=['Age', 'City']) \
.assign(Name=df1.groupby('Group').Name.apply(' '.join).values)
timing
per request
update
use groupby and agg
I thought of this and it feels far more satisfying
df1.groupby('Group').agg(dict(Age='first', City='first', Name=' '.join))
to get the exact output
df1.groupby('Group').agg(dict(Age='first', City='first', Name=' '.join)) \
.reset_index().reindex_axis(df1.columns, 1)