Creating new variable by aggregation in python 2 - python

I have data on births that looks like this:
Date Country Sex
1.1.20 USA M
1.1.20 USA M
1.1.20 Italy F
1.1.20 England M
2.1.20 Italy F
2.1.20 Italy M
3.1.20 USA F
3.1.20 USA F
My purpose is to get a new dataframe in which each row is a date at a country, and then number of total births, number of male births and number of female births. It's supposed to look like this:
Date Country Births Males Females
1.1.20 USA 2 2 0
1.1.20 Italy 1 0 1
1.1.20 England 1 1 0
2.1.20 Italy 2 1 1
3.1.20 USA 2 0 2
I tried using this code:
df.groupby(by=['Date', 'Country', 'Sex']).size()
but it only gave me a new column of total births, with different rows for each sex in every date+country combination.
any help will be appreciated.
Thanks,
Eran

You can group the dataframe on columns Date and Country then aggregate column Sex using value_counts followed by unstack to reshape, finally assign the Births columns by summing frequency along axis=1:
out = df.groupby(['Date', 'Country'], sort=False)['Sex']\
.value_counts().unstack(fill_value=0)
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Or you can use a very similar approach with .crosstab instead of groupby + value_counts:
out = pd.crosstab([df['Date'], df['Country']], df['Sex'], colnames=[None])
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Date Country Female Male Births
0 1.1.20 USA 0 2 2
1 1.1.20 Italy 1 0 1
2 1.1.20 England 0 1 1
3 2.1.20 Italy 1 1 2
4 3.1.20 USA 2 0 2

Related

Calculate nonzeros percentage for specific columns and each row in Pandas

If I have a the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill','lisa','jose'], 'gender':['M','F','M','M','M','F','M'],'state':['california','dc','california','dc','california','texas','texas'],'num_children':[2,0,0,3,2,1,4],'num_pets':[5,1,0,5,2,2,3]})
name gender state num_children num_pets
0 john M california 2 5
1 mary F dc 0 1
2 peter M california 0 0
3 jeff M dc 3 5
4 bill M california 2 2
5 lisa F texas 1 2
6 jose M texas 4 3
I want to create a new row and column pct. to get the percentage of zero values in columns num_children and num_pets
Expected output:
name gender state num_children num_pets pct.
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%
I have calculated percentage of zero in each row for targets columns:
df['pct'] = df[['num_children', 'num_pets']].astype(bool).sum(axis=1)/2
df['pct.'] = 1-df['pct']
del df['pct']
df['pct.'] = pd.Series(["{0:.0f}%".format(val * 100) for val in df['pct.']], index = df.index)
name gender state num_children num_pets pct.
0 john M california 2 5 0%
1 mary F dc 0 1 50%
2 peter M california 0 0 100%
3 jeff M dc 3 5 0%
4 bill M california 2 2 0%
5 lisa F texas 1 2 0%
6 jose M texas 4 3 0%
But i don't know how to insert results below to row of pct. as expected output, please help me to get expected result in more pythonic way. Thanks.
df[['num_children', 'num_pets']].astype(bool).sum(axis=0)/len(df.num_children)
Out[153]:
num_children 0.714286
num_pets 0.857143
dtype: float64
UPDATE: same thing but for calculation of sums, great thanks to #jezrael:
df['sums'] = df[['num_children', 'num_pets']].sum(axis=1)
df1 = (df[['num_children', 'num_pets']].sum()
.to_frame()
.T
.assign(name='sums'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets sums
0 sums 12 18
1 john M california 2 5 7
2 mary F dc 0 1 1
3 peter M california 0 0 0
4 jeff M dc 3 5 8
5 bill M california 2 2 4
6 lisa F texas 1 2 3
7 jose M texas 4 3 7
You can use mean with boolean mask by compare 0 values by DataFrame.eq, because sum/len=mean by definition, multiple by 100 and add percentage with apply:
s = df[['num_children', 'num_pets']].eq(0).mean(axis=1)
df['pct'] = s.mul(100).apply("{0:.0f}%".format)
For first row create new DataFrame with same columns like original and concat together:
df1 = (df[['num_children', 'num_pets']].eq(0)
.mean()
.mul(100)
.apply("{0:.1f}%".format)
.to_frame()
.T
.assign(name='pct.'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets pct
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%

How to group rows so as to use value_counts on the created groups with pandas?

I have some customer data such as this in a data frame:
S No Country Sex
1 Spain M
2 Norway F
3 Mexico M
...
I want to have an output such as this:
Spain
M = 1207
F = 230
Norway
M = 33
F = 102
...
I have a basic notion that I want to group my rows based on their countries with something like df.groupby(df.Country), and on the selected rows, I need to run something like df.Sex.value_counts()
Thanks!
I think need crosstab:
df = pd.crosstab(df.Sex, df.Country)
Or if want use your solution add unstack for columns with first level of MultiIndex:
df = df.groupby(df.Country).Sex.value_counts().unstack(level=0, fill_value=0)
print (df)
Country Mexico Norway Spain
Sex
F 0 1 0
M 1 0 1
EDIT:
If want add more columns then is possible set which level parameter is converted to columns:
df1 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=0, fill_value=0).reset_index()
print (df1)
No Country Sex 1 2 3
0 Mexico M 0 0 1
1 Norway F 0 1 0
2 Spain M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=1, fill_value=0).reset_index()
print (df2)
Country No Sex Mexico Norway Spain
0 1 M 0 0 1
1 2 F 0 1 0
2 3 M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=2, fill_value=0).reset_index()
print (df2)
Sex No Country F M
0 1 Spain 0 1
1 2 Norway 1 0
2 3 Mexico 0 1
You can also use pandas.pivot_table:
res = df.pivot_table(index='Country', columns='Sex', aggfunc='count', fill_value=0)
print(res)
SNo
Sex F M
Country
Mexico 0 1
Norway 1 0
Spain 0 1

Counting elements in Pandas

Let's say I have a Panda DataFrame like this
import pandas as pd
a=pd.Series([{'Country'='Italy','Name'='Augustina','Gender'='Female','Number'=1}])
b=pd.Series([{'Country'='Italy','Name'='Piero','Gender'='Male','Number'=2}])
c=pd.Series([{'Country'='Italy','Name'='Carla','Gender'='Female','Number'=3}])
d=pd.Series([{'Country'='Italy','Name'='Roma','Gender'='Female','Number'=4}])
e=pd.Series([{'Country'='Greece','Name'='Sophia','Gender'='Female','Number'=5}])
f=pd.Series([{'Country'='Greece','Name'='Zeus','Gender'='Male','Number'=6}])
df=pd.DataFrame([a,b,c,d,e,f])
then, I sort with multiindex, like
df.set_index(['Country','Gender'],inplace=True)
Now, I wold like to know how to count how many people are from Italy, or how many Greek female I have in the dataframe.
I've tried
df['Italy'].count()
and
df['Greece']['Female'].count()
. None of them works,
Thanks
I think you need groupby with aggregatingsize:
What is the difference between size and count in pandas?
a=pd.DataFrame([{'Country':'Italy','Name':'Augustina','Gender':'Female','Number':1}])
b=pd.DataFrame([{'Country':'Italy','Name':'Piero','Gender':'Male','Number':2}])
c=pd.DataFrame([{'Country':'Italy','Name':'Carla','Gender':'Female','Number':3}])
d=pd.DataFrame([{'Country':'Italy','Name':'Roma','Gender':'Female','Number':4}])
e=pd.DataFrame([{'Country':'Greece','Name':'Sophia','Gender':'Female','Number':5}])
f=pd.DataFrame([{'Country':'Greece','Name':'Zeus','Gender':'Male','Number':6}])
df=pd.concat([a,b,c,d,e,f], ignore_index=True)
print (df)
Country Gender Name Number
0 Italy Female Augustina 1
1 Italy Male Piero 2
2 Italy Female Carla 3
3 Italy Female Roma 4
4 Greece Female Sophia 5
5 Greece Male Zeus 6
df = df.groupby('Country').size()
print (df)
Country
Greece 2
Italy 4
dtype: int64
df = df.groupby(['Country', 'Gender']).size()
print (df)
Country Gender
Greece Female 1
Male 1
Italy Female 3
Male 1
dtype: int64
If need only some sizes with select by MultiIndex by xs or slicers:
df.set_index(['Country','Gender'],inplace=True)
print (df)
Name Number
Country Gender
Italy Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
Greece Female Sophia 5
Male Zeus 6
print (df.xs('Italy', level='Country'))
Name Number
Gender
Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
print (len(df.xs('Italy', level='Country').index))
4
print (df.xs(('Greece', 'Female'), level=('Country', 'Gender')))
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.xs(('Greece', 'Female'), level=('Country', 'Gender')).index))
1
#KeyError: 'MultiIndex Slicing requires
#the index to be fully lexsorted tuple len (2), lexsort depth (0)'
df.sort_index(inplace=True)
idx = pd.IndexSlice
print (df.loc[idx['Italy', :],:])
Name Number
Country Gender
Italy Female Augustina 1
Female Carla 3
Female Roma 4
Male Piero 2
print (len(df.loc[idx['Italy', :],:].index))
4
print (df.loc[idx['Greece', 'Female'],:])
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.loc[idx['Greece', 'Female'],:].index))
1

Create columns based on sequent row values in Pandas

I have a dataframe containing about 300 000 rows with a structure like this:
name Jack
gender M
year 1993
country USA
city Odessa
name John
gender M
year 1992
name Sam
country Canada
city Toronto
Is there a possibility to make dataframe looks like this using Pandas?
name gender year country city
Jack M 1993 USA Odessa
John M 1992
Sam Canada Toronto
Row with "name" is always there, but others could be absent. I try to use iterrows with no success.
In [17]:
g = np.cumsum(df.iloc[: , 0] == 'name')
In [15]:
df.groupby(g).apply(lambda x : pd.DataFrame(x.set_index([0]).T , columns=['name' , 'gender' , 'year' , 'country' , 'city']) )
Out[15]:
name gender year country city
0
1 1 Jack M 1993 USA Odessa
2 1 John M 1992 NaN NaN
3 1 Sam NaN NaN Canada Toronto

Group by values across two columns and filter in Pandas

I have a DataFrame like this:
name sex births year
0 Mary F 7433 2000
1 John M 6542 2000
2 Emma F 2342 2000
3 Ron M 5432 2001
4 Bessie F 4234 2001
5 Jennie F 2413 2002
6 Nick M 2343 2002
7 Ron M 4342 2002
I need to get new DataFrame where data will be grouped by year and sex, and last two columns will be name with max births and max (births) value, like this:
year sex name births
0 2000 F Mary 7433
1 2000 M John 6542
2 2001 F Bessie 4234
3 2001 M Ron 5432
4 2002 F Jennie 2413
5 2002 M Ron 4342
It can be done using the following groupby operation:
>>> df.groupby(['year', 'sex'], as_index=False).max()
year sex name births
0 2000 F Mary 7433
1 2000 M John 6542
2 2001 F Bessie 4234
3 2001 M Ron 5432
4 2002 F Jennie 2413
5 2002 M Ron 4342
as_index=False stops the groupby keys from becoming the index in the returned DataFrame.
Alternatively, to get the desired output you may need to to sort the 'births' column and then use groupby.first():
df = df.sort_values(by='births', ascending=False)
df.groupby(['year', 'sex'], as_index=False).first()

Categories