The Background
I have a data set of a simulated population of people. They have the following attributes
Age (0-120 years)
Gender (male,female)
Race (white, black, hispanic, asian, other)
df.head()
Age Race Gender in_population
0 32 0 0 1
1 53 0 0 1
2 49 0 1 1
3 12 0 0 1
4 28 0 0 1
There is another variable that identifies the individual as "In_Population"* which is a boolean variable. I am using groupby in pandas to group the population the possible combinations of the 3 attributes to calculate a table of counts by summing the "In_Population" variable in each possible category of person.
There are 2 genders * 5 races * 121 ages = 1210 total possible groups that every individual in the population will fall under.
If a particular group of people in a particular year has no members (e.g. 0 year old male 'other'), then I still want that group to show up in my group-by dataframe, but with a zero in the count. This happens correctly in the data sample below (Age = 0, Gender = {0,1}, and Race = 4). There were no 'other' zero year olds in this particular
grouped_obj = df.groupby( ['Age','Gender','Race'] )
groupedAGR = grouped_obj.sum()
groupedAGR.head(10)
in_population
Age Gender Race
0 0 0 16
1 8
2 63
3 5
4 0
1 0 22
1 4
2 64
3 12
4 0
The issue
This only happens for some of the Age-Gender-Race combinations.
Sometimes the zero sum groups get skipped entirely. The following is the data for age 45. I was expecting to see 0, indicating that there are no 45 year old male 'other' races in this data set.
>>> groupedAGR.xs( 45, level = 'Age' )
in_population
Gender Race
0 0 515
1 68
2 40
3 20
1 0 522
1 83
2 48
3 29
4 3
Notes
*"In_Population"
Basically filters out "newborns" and "immigrants" who are not part of the relevant population when calculating "Mortality Rates"; the deaths in the population happen before immigration and births happen so I exclude them from the calculations. I had a suspicion that this had something to do with it - the zero year olds were showing zero counts but every other age group was not showing anything at all...but that's not the case.
>>> groupedAGR.xs( 88, level = 'Age' )
in_population
Gender Race
0 0 52
2 1
3 0
1 0 62
1 3
2 5
3 3
4 1
There are no 88 year old Asian men in the population, so there's a zero in the category. There are no 88 year old 'other' men in the population either, but they don't show up at all.
EDIT: I added in the code showing how I'm making the group by object in pandas and how I'm summing to find the counts in each group.
Use reindex with a predefined index and fill_value=0
ages = np.arange(21, 26)
genders = ['male', 'female']
races = ['white', 'black', 'hispanic', 'asian', 'other']
sim_size = 10000
midx = pd.MultiIndex.from_product([
ages,
genders,
races
], names=['Age', 'Gender', 'Race'])
sim_df = pd.DataFrame({
# I use [1:-1] to explicitly skip some age groups
'Age': np.random.choice(ages[1:-1], sim_size),
'Gender': np.random.choice(genders, sim_size),
'Race': np.random.choice(races, sim_size)
})
These will have missing age groups
counts = sim_df.groupby(sim_df.columns.tolist()).size()
counts.unstack()
This fills in missing age groups
counts.reindex(midx, fill_value=0).unstack()
Related
I'm a python newbie and getting a bit lost in how to transform my data.
Here's an example dataset:
import numpy as np
import pandas as pd
import random
random.seed(123)
df = pd.DataFrame({'pp': list(range(1, 11)), 'age': list(np.random.randint(1,9,10)*10), 'gender': list(np.random.randint(1,3,10)), 'yes/no': list(np.random.randint(0,2,10))})
>>> df
pp age gender yes/no
0 1 20 1 1
1 2 50 1 0
2 3 10 2 1
3 4 50 1 1
4 5 40 2 0
5 6 60 2 0
6 7 30 2 1
7 8 70 1 0
8 9 30 2 0
9 10 70 1 0
I want to create a three new columns within my dataframe which represent the ratio between my different variables, namely:
ratio between gender 1 and 2 per yes/no category,
ratio between all existing age groups per yes/no category,
ratio between age and gender combination per yes/no category
For the first example I got something working like this:
df.groupby(["gender", "yes/no"]).size()/df.groupby(["yes/no"]).size()
But I'd actually want to get the output values as a new column, one value per pp.
Anyone know a neat way to do this?
Try to use this:
(df.groupby(["gender", "yes/no"]).size()/df.groupby(["yes/no"]).size()).rename('ratio').reset_index()
I have a dataframe with 9 columns, two of which are gender and smoker status. Every row in the dataframe is a person, and each column is their entry on a particular trait.
I want to count the number of entries that satisfy the condition of being both a smoker and is male.
I have tried using a sum function:
maleSmoke = sum(1 for i in data['gender'] if i is 'm' and i in data['smoker'] if i is 1 )
but this always returns 0. This method works when I only check one criteria however and I can't figure how to expand it to a second.
I also tried writing a function that counted its way through every entry into the dataframe but this also returns 0 for all entries.
def countSmokeGender(df):
maleSmoke = 0
femaleSmoke = 0
maleNoSmoke = 0
femaleNoSmoke = 0
for i in range(20000):
if df['gender'][i] is 'm' and df['smoker'][i] is 1:
maleSmoke = maleSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 1:
femaleSmoke = femaleSmoke + 1
if df['gender'][i] is 'm' and df['smoker'][i] is 0:
maleNoSmoke = maleNoSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 0:
femaleNoSmoke = femaleNoSmoke + 1
return maleSmoke, femaleSmoke, maleNoSmoke, femaleNoSmoke
I've tried pulling out the data sets as numpy arrays and counting those but that wasn't working either.
Are you using pandas?
Assuming you are, you can simply do this:
# How many male smokers
len(df[(df['gender']=='m') & (df['smoker']==1)])
# How many female smokers
len(df[(df['gender']=='f') & (df['smoker']==1)])
# How many male non-smokers
len(df[(df['gender']=='m') & (df['smoker']==0)])
# How many female non-smokers
len(df[(df['gender']=='f') & (df['smoker']==0)])
Or, you can use groupby:
df.groupby(['gender'])['smoker'].sum()
Another alternative, which is great for data exploration: .pivot_table
With a DataFrame like this
id gender smoker other_trait
0 0 m 0 0
1 1 f 1 1
2 2 m 1 0
3 3 m 1 1
4 4 f 1 0
.. .. ... ... ...
95 95 f 0 0
96 96 f 1 1
97 97 f 0 1
98 98 m 0 0
99 99 f 1 0
you could do
result = df.pivot_table(
index="smoker", columns="gender", values="id", aggfunc="count"
)
to get a result like
gender f m
smoker
0 32 16
1 27 25
If you want to display the partial counts you can add the margins=True option and get
gender f m All
smoker
0 32 16 48
1 27 25 52
All 59 41 100
If you don't have a column to count over (you can't use smoker and gender because they are used for the labels) you could add a dummy column:
result = df.assign(dummy=1).pivot_table(
index="smoker", columns="gender", values="dummy", aggfunc="count",
margins=True
)
I have a dataset which I want to group by the age.
So, here is the first part of the dataset:
It is a simulation for a inventory data. Größe means the number of people with the age (Alter) 15. Risiko gives every person a number and Geschlecht is feminine or masculine.
I want to add a column "Group" and give every people, which have the age 15-19 one number, than with age 20-24 one number and so on.
How can I do this?
You can use map and lambda to create a new column like so :
def return_age_from_range(age):
# Max value in range is excluded, so remember to add +1 to the range you want
if age in range(15, 20):
return 1
elif age in range(20, 25):
return 2
# and so on...
df['group'] = df.alter.map(lambda x: return_age_from_range(x))
Use numpy.select:
In [488]: import numpy as np
In [489]: conds = [df['Alter'].between(15,19), df['Alter'].between(20,24), df['Alter'].between(24,28)]
In [490]: choices = [1,2,3]
In [493]: df['Group'] = np.select(conds, choices)
In [494]: df
Out[494]:
Größe Risiko Geschlecht Alter Group
0 95 1 F 15 1
1 95 2 F 15 1
2 95 3 M 15 1
3 95 4 F 15 1
4 95 5 M 15 1
5 95 6 M 15 1
6 95 7 M 15 1
7 95 8 F 15 1
8 95 9 M 15 1
I have a Dataset that lists individual transactions by country, quarter, division, the transaction type and the value. I would like to sum it up based on the first three variables but create new columns for the other two. The dataset looks like this:
Country Quarter Division Type Value
A 1 Sales A 50
A 2 Sales A 150
A 3 Sales B 20
A 1 Sales A 250
A 2 Sales B 50
A 3 Sales B 50
A 2 Marketing A 50
Now I would like to aggregate the data to get the number of transactions by type as a new variable. The overall number of transactions grouped by the first three variables is easy:
df.groupby(['Country', 'Quarter', 'Division'], as_index=False).agg({'Type':'count', 'Value':'sum'})
However, I would like my new dataframe to look as follows:
Country Quarter Division Type_A Type_B Value_A Value_B
A 1 Sales 2 0 300 0
A 2 Sales 1 1 150 50
A 3 Sales 0 2 0 70
A 2 Marketing 1 0 50 0
How do I do that?
Specify column after groupby with tuples in agg functions for new columns names with aggregate functions, then reshape by DataFrame.unstack and last convert MultiIndex in columns by map:
df1 = (df.groupby(['Country', 'Quarter', 'Division', 'Type'])['Value']
.agg([('Type','count'), ('Value','sum')])
.unstack(fill_value=0))
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index()
print (df1)
Country Quarter Division Type_A Type_B Value_A Value_B
0 A 1 Sales 2 0 300 0
1 A 2 Marketing 1 0 50 0
2 A 2 Sales 1 1 150 50
3 A 3 Sales 0 2 0 70
I was working with the Gun violence dataset from Kaggle which had the age column like this:
In [5]: df['participant_age_group'].head()
Out [5]:
0 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
1 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
2 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
3 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
4 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::...
Name: participant_age_group, dtype: object
Where 0::,1:: correspond to index. So I want to split them and forma a whole new dataframe with no. of people belonging to that age group and having total no. of people belonging to that age group, say df_age. For ex:
Age Group No_of_people
18 300
25 210
30 100
So that I can then .groupby(age) and No_of_people.value_counts() by No._of_people and visualize the age group which is responsible for max. gun violences.
Unfortunately I'm only able to split but then not coming up to what I want.
I started from this input:
df = pd.DataFrame({'participant_age_group':['0::Adult 18+||1::Adult 18+||2::Adult 18+||',
'0::Adult 18+||1::Adult 18+||2::Adult 18+||',
'0::Adult 25+||1::Adult 25+||2::Adult 30+||',
'0::Adult 18+||1::Adult 18+||2::Teen 12-17||']})
then to create the df_age:
df_age = (df['participant_age_group'].str.replace('+','')
.str.split('\|{2}',expand=True).stack()
.str.split(' ',expand=True).dropna()
.groupby(1,as_index=False).count()
.rename(columns={0:'No_of_people',1:'Age_group'}))
Some explanation of the code.
str.split('\|{2}',expand=True).stack() splits each row where the symbol || is in the string and stack will expand as a column instead in the row. You get something like this, where the first level of index is your rows number in your original df.
0 0 0::Adult 18
1 1::Adult 18
2 2::Adult 18
3
1 0 0::Adult 18
1 1::Adult 18
...
(I don't print all the data). Then str.split(' ',expand=True).dropna() will split each string were a space is (before the age) and also drop empty rows to get:
0 1
0 0 0::Adult 18
1 1::Adult 18
2 2::Adult 18
1 0 0::Adult 18
1 1::Adult 18
...
Here you can see you have created two columns, 0 and 1, and in the column 1 you have the ages, so you just have to groupby this column and count the occurence of each age with groupby(1,as_index=False).count()
With my input, df_age is like:
Age_group No_of_people
0 12-17 1
1 18 8
2 25 2
3 30 1