Let's say I have a Panda DataFrame like this
import pandas as pd
a=pd.Series([{'Country'='Italy','Name'='Augustina','Gender'='Female','Number'=1}])
b=pd.Series([{'Country'='Italy','Name'='Piero','Gender'='Male','Number'=2}])
c=pd.Series([{'Country'='Italy','Name'='Carla','Gender'='Female','Number'=3}])
d=pd.Series([{'Country'='Italy','Name'='Roma','Gender'='Female','Number'=4}])
e=pd.Series([{'Country'='Greece','Name'='Sophia','Gender'='Female','Number'=5}])
f=pd.Series([{'Country'='Greece','Name'='Zeus','Gender'='Male','Number'=6}])
df=pd.DataFrame([a,b,c,d,e,f])
then, I sort with multiindex, like
df.set_index(['Country','Gender'],inplace=True)
Now, I wold like to know how to count how many people are from Italy, or how many Greek female I have in the dataframe.
I've tried
df['Italy'].count()
and
df['Greece']['Female'].count()
. None of them works,
Thanks
I think you need groupby with aggregatingsize:
What is the difference between size and count in pandas?
a=pd.DataFrame([{'Country':'Italy','Name':'Augustina','Gender':'Female','Number':1}])
b=pd.DataFrame([{'Country':'Italy','Name':'Piero','Gender':'Male','Number':2}])
c=pd.DataFrame([{'Country':'Italy','Name':'Carla','Gender':'Female','Number':3}])
d=pd.DataFrame([{'Country':'Italy','Name':'Roma','Gender':'Female','Number':4}])
e=pd.DataFrame([{'Country':'Greece','Name':'Sophia','Gender':'Female','Number':5}])
f=pd.DataFrame([{'Country':'Greece','Name':'Zeus','Gender':'Male','Number':6}])
df=pd.concat([a,b,c,d,e,f], ignore_index=True)
print (df)
Country Gender Name Number
0 Italy Female Augustina 1
1 Italy Male Piero 2
2 Italy Female Carla 3
3 Italy Female Roma 4
4 Greece Female Sophia 5
5 Greece Male Zeus 6
df = df.groupby('Country').size()
print (df)
Country
Greece 2
Italy 4
dtype: int64
df = df.groupby(['Country', 'Gender']).size()
print (df)
Country Gender
Greece Female 1
Male 1
Italy Female 3
Male 1
dtype: int64
If need only some sizes with select by MultiIndex by xs or slicers:
df.set_index(['Country','Gender'],inplace=True)
print (df)
Name Number
Country Gender
Italy Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
Greece Female Sophia 5
Male Zeus 6
print (df.xs('Italy', level='Country'))
Name Number
Gender
Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
print (len(df.xs('Italy', level='Country').index))
4
print (df.xs(('Greece', 'Female'), level=('Country', 'Gender')))
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.xs(('Greece', 'Female'), level=('Country', 'Gender')).index))
1
#KeyError: 'MultiIndex Slicing requires
#the index to be fully lexsorted tuple len (2), lexsort depth (0)'
df.sort_index(inplace=True)
idx = pd.IndexSlice
print (df.loc[idx['Italy', :],:])
Name Number
Country Gender
Italy Female Augustina 1
Female Carla 3
Female Roma 4
Male Piero 2
print (len(df.loc[idx['Italy', :],:].index))
4
print (df.loc[idx['Greece', 'Female'],:])
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.loc[idx['Greece', 'Female'],:].index))
1
Related
under sex column of my Dataframe are for each row mehrere elemente (female, female) or (male, famale) or normal ones (female). Now i want to count the number of male and female for each row. can you help me pls?
import collections
for female in df_fundinv['borrower_genders'][0]:
collections.counter()
type(liste)
collections.Counter(liste)
0 female
1 female, female
2 female
3 female
4 female
...
646828 female
646829 female
646830 female, female
646831 female, female
646832 female
Name: borrower_genders, Length: 646833, dtype: object
gender.values
männlich={}
for element in gender.values:
try:
'element'!= 'male'
except:
männlich.append(dict(zip(male, counts)))
männlich
Was ich erwarte:
male(neue Spalte) female(neue Sp
0 female 1
1 female, female 2
2 female 1
3 female
4 female
...
646828 female
646829 female
646830 female, female 2 2
646831 female, female
646832 female
Name: borrower_genders, Length: 646833, dtype: object
I want to add a suffix to the first N columns. But I can't.
This is how to add a suffix to all columns:
import pandas as pd
df = pd.DataFrame( {"name" : ["John","Alex","Kate","Martin"], "surname" : ["Smith","Morgan","King","Cole"],
"job": ["Engineer","Dentist","Coach","Teacher"],"Age":[25,20,25,30],
"Id": [1,2,3,4]})
df.add_suffix("_x")
And this is the result:
name_x surname_x job_x Age_x Id_x
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
But I want to add the first N columns so let's say the first 3. Desired output is:
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
Work with the indices and take slices to modify a subset of them:
df.columns = (df.columns[:3]+'_x').union(df.columns[3:], sort=False)
print(df)
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
This should work:
N=3
cols=[i for i in df.columns[:N]]
new_cols=[i+'_x' for i in df.columns[:N]]
dict_cols=dict(zip(cols,new_cols))
df.rename(dict_cols,axis=1)
set the column labels using a list comprehension:
n = 3
df.columns = [f'{c}_x' if i < n else c for i, c in enumerate(df.columns)]
results in
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
I have a csv file with some messy data.
I have following dataframe in pandas
Name
Age
Sex
Salary
Status
John
32
Nan
NaN
NaN
Nan
Male
4000
Single
NaN
May
20
Female
5000
Married
teresa
45
Desired output:
Name Age Sex Salary Status
0 John 32 Male 4000 Single
1 May 20 Female 5000 Married
2 teresa 45
So Does anyone know how to do it with Pandas?
You can use a bit of numpy magic to drop the NaNs and reshape the underlying array:
a = df.replace({'Nan': float('nan')}).values.flatten()
pd.DataFrame(a[~pd.isna(a)].reshape(-1, len(df.columns)),
columns=df.columns)
output:
Name Age Sex Salary Status
0 John 32 Male 4000 Single
1 May 20 Female 5000 Married
Try groupby:
>>> df.groupby(df['Name'].notna().cumsum()).apply(lambda x: x.apply(lambda x: next(iter(x.dropna()), np.nan))).reset_index(drop=True)
Name Age Sex Salary Status
0 John 32 4000 Single NaN
1 May 20 Female 5000 Married
>>>
I have data on births that looks like this:
Date Country Sex
1.1.20 USA M
1.1.20 USA M
1.1.20 Italy F
1.1.20 England M
2.1.20 Italy F
2.1.20 Italy M
3.1.20 USA F
3.1.20 USA F
My purpose is to get a new dataframe in which each row is a date at a country, and then number of total births, number of male births and number of female births. It's supposed to look like this:
Date Country Births Males Females
1.1.20 USA 2 2 0
1.1.20 Italy 1 0 1
1.1.20 England 1 1 0
2.1.20 Italy 2 1 1
3.1.20 USA 2 0 2
I tried using this code:
df.groupby(by=['Date', 'Country', 'Sex']).size()
but it only gave me a new column of total births, with different rows for each sex in every date+country combination.
any help will be appreciated.
Thanks,
Eran
You can group the dataframe on columns Date and Country then aggregate column Sex using value_counts followed by unstack to reshape, finally assign the Births columns by summing frequency along axis=1:
out = df.groupby(['Date', 'Country'], sort=False)['Sex']\
.value_counts().unstack(fill_value=0)
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Or you can use a very similar approach with .crosstab instead of groupby + value_counts:
out = pd.crosstab([df['Date'], df['Country']], df['Sex'], colnames=[None])
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Date Country Female Male Births
0 1.1.20 USA 0 2 2
1 1.1.20 Italy 1 0 1
2 1.1.20 England 0 1 1
3 2.1.20 Italy 1 1 2
4 3.1.20 USA 2 0 2
I have two dataframes:
df1:
Gender Registered
female 1
male 0
female 0
female 1
male 1
male 0
df2:
Gender
female
female
male
male
I want to modify df2, so that there is a new column 'Count' with the count of registered = 1 for corresponding gender values from df1. For example, in df1 there are 2 registered females and 1 registered male. I want to transform the df2 so that the output is as follows:
output:
Gender Count
female 2
female 2
male 1
male 1
I tried many things and got close but couldn't make it fully work.
sum + map:
v = df1.groupby('Gender').Registered.sum()
df2.assign(Count=df2.Gender.map(v))
Gender Count
0 female 2
1 female 2
2 male 1
3 male 1
pd.merge
pd.merge(df2, df1.groupby('Gender', as_index=False).sum())
Gender Registered
0 female 2
1 female 2
2 male 1
3 male 1