Filling missing values with groupby object on Pandas - Titanic dataset - python

I've already seen similar questions but none of them is an answer to mine or I didn't see/understand. I am a newbie in ML, and trying to discover numpy, pandas with known datasets on Kaggle. Currently, I am on Titanic dataset. I have 2 distinct dataset: train and test. I have to fill missing values on "Age" column of both train and test datasets. My criteria is a grouped object I created with train dataset. I am grouping with "Sex", "Pclass", and "Title"(comes from title of every passengers name).
grouped = train.groupby(["Sex","Title","Pclass"])
grouped_m = grouped.median()
grouped_m = grouped_m.reset_index()[["Sex","Title","Pclass", "Age"]]
Output is:
Sex Title Pclass Age
0 female Miss 1 30.0
1 female Miss 2 24.0
2 female Miss 3 18.0
3 female Mrs 1 40.0
4 female Mrs 2 32.0
5 female Mrs 3 31.0
6 female Officer 1 49.0
7 female Royalty 1 40.5
8 male Master 1 4.0
9 male Master 2 1.0
10 male Master 3 4.0
11 male Mr 1 40.0
12 male Mr 2 31.0
13 male Mr 3 26.0
14 male Officer 1 51.0
15 male Officer 2 46.5
16 male Royalty 1 40.0
This is my criteria to apply on "Age" column of "test" dataset. For ex: when a row on test dataset with Sex = Female, Title = Miss, Pclass = 1, Age = NaN, Nan value must be filled with output that above, which should be Age = 30.
Before filling:
train["Age"].isna().sum()
Output is:
177
I tried this:
train["Age"] = train["Age"].fillna(grouped["Age"].transform("median"))
It perfectly filled NaN values on train set.
After filling:
train["Age"].isna().sum()
Output is:
0
But when I apply this on test dataset, it changes nothing at all and didn't give any errors.
Before filling:
test["Age"].isna().sum()
Output is:
86
Then I apply the function with group object that I created on train dataset:
test["Age"] = test["Age"].fillna(grouped["Age"].transform("median"))
test["Age"].isna().sum()
Output is:
86
NaN values still there on test dataset. How should I apply this function to change NaN values on test dataset with my grouped object which I created with train dataset ?

We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation).
check the average age by passenger class. For example:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Data visualization to see the age difference due to Passenger class
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)#filling the missing values

Edit:
I merged the data using DataFrame.merge() method as #ALollz suggests and apparently is works. Here's the code:
# First filling NaN on train set as I did before.
grouped = train.groupby(["Sex","Title", "Pclass"])
grouped_m = grouped.median().reset_index()[["Sex", "Title", "Pclass", "Age"]]
train["Age"] = train["Age"].fillna(grouped["Age"].transform("median"))
# Then used pd.DataFrame.merge() to apply the same grouped features on the test data.
med = train.groupby(['Sex', 'Pclass', 'Title'],
as_index=False)['Age'].median()
test = test.merge(med, on=['Sex','Pclass','Title'], how='left', suffixes=('','_'))
test['Age'] = test['Age'].fillna(test.pop('Age_'))
Thank you everyone!

Related

A question about how to make calculus after making a group by pandas

I'm working with a Data Frame with categorical values where my input DataFrame is below:
df
Age Gender Smoke
18 Female Yes
24 Female No
18 Female Yes
34 Male Yes
34 Male No
I want to groupby my DataFrame based on columns "Age" and "Gender" where "Occurrence" column calculates the frequency of each selection and then, I want to create two other columns "Smoke Yes" that calculates number of smoking people based on the selection and "Smoke No" that calculates number of non smoking people
Age Gender Occurence Smoke Yes Smoke No
18 Woman 2 0.50 0.50
24 Woman 1 0 1
34 Man 2 0.5 0.5
In order to do that, I used the following code
#Group and sort
df1=df.groupby(['Age', 'Gender']).size().reset_index(name='Frequency').sort_values('Frequency', ascending=False)
#Delete index
df1.reset_index(drop=True,inplace=True)
However the df['Smoke'] column is disappeared so I can't continue my calculus. Does any one have an idea and what can I do to obtain like the output DataFrame?
you can use groupby and value_counts with normalize=True to return percentage share. then unstack. Also using a dictionary you can replace the Gender column to match the desired output.
d = {"Female":"Woman","Male":"Man"}
u = (df.groupby(['Age','Gender'])['Smoke'].value_counts(normalize=True)
.unstack().fillna(0))
s = df.groupby("Age")['Gender'].value_counts()
u.columns = u.columns.name+"_"+u.columns
out=u.rename_axis(None,axis=1).assign(Occurance=s).reset_index().replace({"Gender":d})
print(out)
Age Gender Smoke_No Smoke_Yes Occurance
0 18 Woman 0.0 1.0 2
1 24 Woman 1.0 0.0 1
2 34 Man 0.5 0.5 2

Python pandas problem calculating percent in dataFrame and making it to a list

I have a problem calculating percent within a dataframe.
I have the following dataframe called dfGender:
age gender impressions
0 13-17 female 234561
1 13-17 male 34574
2 25-34 female 120665
3 25-34 male 234560
4 35-44 female 5134
5 35-44 male 2405
6 45-54 female 423
7 45-54 male 324
Now I would like to make to have list of the total percent for all female and male impressions like this: [female%, male%].
My idea is to pivot_table with the following code:
df_genderSum = dfGender.pivot_table(columns='gender', values='impressions', aggfunc='sum')
Then calculating the total of them all:
df_genderSum['total'] = df_genderSum.sum(axis=1)
Then after this making the percent calculations through:
df_genderSum['female%'] = (df_genderSum['female']/df_genderSum['total'])*100
df_genderSum['male%'] = (df_genderSum['male']/df_genderSum['total'])*100
Now this gives me the desired correct calculations, altough I think it's a really messy code.
I have 2 questions:
1: Is there a simpler way to do this, where you get a dataframe only existing of:
gender female% male%
impressions "number" "number"
2: How do i make it to a list. I was thinking of the following code.
list = df_genderSum.reset_index().values.tolist()
Any help is appreciated!
You can try:
df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100))
gender
female 57.0276
male 42.9724
and
df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100)).to_list()
[57.02762682448004, 42.972373175519957]
If you want the exact dataframe that you asked for, save the above as "s" and do the following:
s=df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100))
pd.DataFrame(s).T
gender female male
impressions 57.027627 42.972373
Here you go:
df_agg = df.drop(['age'], axis=1).groupby('gender').sum()
print(df_agg['impressions']/df_agg['impressions'].sum()*100)
Prints (can be different based on your data):
F 71.428571
M 28.571429
Name: impressions, dtype: float64
df_genderSum = df_gender.groupby('gender')['impressions'].sum() # same result as pivot_table
df_genderSum /= df_genderSum.sum() # percentage, inplace
# now it is a series, reshape as needed
df_genderSum = df_genderSum.to_frame().T
you can try this one :
(df.groupby('gender').sum()['impressions']/df['impressions'].sum()).to_frame(name = 'impressions').T

How to change categorical column values?

In my data frame, I have column 'countries', I am trying to change that column values into 'developed countries' and 'developing countries'. My data frame is as following:
countries age gender
1 India 21 Male
2 China 22 Female
3 USA 23 Male
4 UK 25 Male
I have following two arrays:
developed = ['USA','UK']
developing = ['India', 'China']
I want to convert array into following data frame:
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male
I tried following code, but I got 'SettingWithCopyWarning' error:
df[df['countries'].isin(developed)]['countries'] = 'developed'
I tried following code, but I got 'SettingWithCopyWarning' error and my jupyter notebook got hanged:
for i, x in enumerate(df['countries']):
if x in developed:
df['countries'][i] = 'developed'
Is their alternative way to change column categories??
use np.where:
import numpy as np
df['countries']=np.where(df['countries'].isin(developed),'developed','developing')
print(df)
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male
Also you can use DataFrame.loc:
c=df['countries'].isin(developed)
df.loc[c,'countries']='developed'
df.loc[~c,'countries']='developing'
print(df)
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male
You can try implementing a replace function, it wouldn't give an error.
Updated_DataSet1 = data_set.replace("India", "Developing")
Updated_DataSet2 = Updated_DataSet1.replace("China","Developing")

Python - Creating a data frame,transpose and merge it to get a table

I am learning Python and I have a question related to creating a data frame for every 5 rows, transpose and merge the data frames.
I have a .txt file with the following input. It has thousands of rows and I need to go through each line until the end of the file.
Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle
I need to get this as my output:
Name,Age,Sex,Company,Vehicle
Kamath,23,Male,ACC,Car
Ram,32,Male,CCA,Bike
Reena,26,Female,BARC,Cycle
Use read_csv for DataFrame and then pivot with cumcount for counter for new index:
import pandas as pd
temp=u"""Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.txt'
df = pd.read_csv(pd.compat.StringIO(temp), names=['a','b'])
print (df)
a b
0 Name Kamath
1 Age 23
2 Sex Male
3 Company ACC
4 Vehicle Car
5 Name Ram
6 Age 32
7 Sex Male
8 Company CCA
9 Vehicle Bike
10 Name Reena
11 Age 26
12 Sex Female
13 Company BARC
14 Vehicle Cycle
df = pd.pivot(index=df.groupby('a').cumcount(),
columns=df['a'],
values=df['b'])
print (df)
a Age Company Name Sex Vehicle
0 23 ACC Kamath Male Car
1 32 CCA Ram Male Bike
2 26 BARC Reena Female Cycle

Write value in next available cell csv

I have a code of writing peoples names, ages and scores for a quiz that I made. I simplified the code to write the names and ages together and not separately but I cant write the score with the names as they are in separate parts of the code. The CSV file looks like this
name, age, score
Alfie, 15, 20
Michael, 16, 19
Alfie, 15, #After I simplified
Dylan, 16,
As you can see i don't know how to write a value in the 3rd column. Does anyone know how to write a value into the next available cell in a CSV file in the column 2. I'm new to programming so any help would be greatly appreciated.
Michael
This is your data:
df = pd.DataFrame({'name':['Alfie','Michael','Alfie','Dylan'], 'age':[15,16,15,16], 'score':[20,19,None,None]})
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 nan
3 Dylan 16 nan
if you need read csv to pandas then use:
import pandas as pd
df = pd.read_csv('Your_file_name.csv')
I suggest two ways to solve your problem:
df.fillna(0, inplace=True) fill all (this example fill 0).
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 0.0
3 Dylan 16 0.0
df.loc[2,'score'] = 22 fill specific cells
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 22.0
3 Dylan 16 nan
If, after that you need write your fixed data to csv, the use:
df.to_csv('New_name.csv', sep=',', header=0)

Categories