Pandas groupby with minimal group size

Pandas groupby with minimal group size - python

I have a dataframe df of shape (450 000, 15), containing informations about users, and each line is a different user, with 13 characteristics(age, gender, home-town...) and 1 boolean variable, whether the user has or doesn't have a car.
I would like to regroup my users to find out which groups have the most cars, but I need to keep at least 2500 users in a group to keep it statistically relevant.
test= df.groupby(['Gender'])
test.size() # check the groups size
Gender
Female 150000
Male 300000
dtype: int64
So far so good, I have way more than 2500 user by groups. So I had another grouping criteria :
test2= df.groupby(['Gender','Age'])
test2.size()
Gender Age
Female <30 15022
30-90 134960
90+ 18
Male <20 5040
20-90 291930
90+ 3030
dtype: int64
As we can expect, I now have groups with very little users...
I would like to have something like this :
Gender
Female 150 000 # Don't split here because groups will be too small
# Here I can split, because group size > 2500 :
Gender Age
Male <20 5040
20-90 291930
90+ 3030
dtype: int64
I didn't find a way to group a groupby dataframe based on a size criteria, so I was wondering what would be a pythonic way to handle this.
As I have 13 characteristics to group my users, I am also wondering about the grouping order : should I group by gender first and then by age, or the opposite? This has an impact when using multiple variables with a stop condition.
I don't need to use pandas, but I thought it would be appropriate. The output should look like :
name_of_group group_size
Female 150000
Male, <20 5040
Male, 20-90 291930
Male, 90+ 3030

groupby has to group on a "key" which must be separately computable for each row. That is, there's no way to group by some criterion that depends on aggregate characteristics that you won't know until after the group is created (like its size). You could write code that tries different groupings and uses some heuristic to decide which is "best", but there's nothing built in for this.

Do you want all the groups to have at least 2500 users?
You could so something like this:
# List of all sets of categories you want to test
group_ids_list = [['Gender'], ['Age'], ['Gender','Age']]
# Will be filled with groups that pass your test
valid_groups = []
group_sizes = {}
for group_ids in group_ids_list :
grouped_df = df.groupby(group_id)
for key, group in grouped_df:
if len(group) > 25000:
valid_groups.append(group)
group_sizes[key] = len(group)
group_sizes = pd.Series(group_sizes)
And then you can work using only the valid groupers.
Hope the pseudo-code helps, otherwise please provide a reproducible example.

I think FLab's answer is probably more complete and correct. But if you're after a quick fix;
column = 'Gender'
minimum_size = 2500
valid_groups = [g for g in set(df[col]) if sum(df[col] == g) >= minimum_size]
mask = df[column].isin(valid_groups)
df[mask].groupby(column)

Related

Selection of a condition by a range that includes strings (letter + numbers)

Please help. I have two tables: 1 report and 1 data file.
The data table is presented as follows:
PATIENTS_ID
POL
Age
ICD10
10848754
0
22
H52
10848754
0
22
R00
10848754
0
22
Z01
10848754
0
22
Z02
10850478
1
26
H52
And etc.
The report file asks to collect the following data:
ICD10
Male (20-29)
Male (30-39)
Female (20-29)
Female (30-39)
C00 - C97
E10 - E14
I00 - I99
So... I need to collect all "ICD10" data which include the gap between C00 to C99, and aggregate together with gender and age span. I know that in SQL there is a "BETWEEN " that will quite easily build a range and select values like this without additional conditions: "C00, C01, C02".
Is there something similar in python/pandas?
Logical expressions like ">= C00 <= C99" will include other letters, already tried. I would be grateful for help. Creating a separate parser/filter seems too massive for such a job.

I'll assume that Excel can be used for a solution.
Lets say PATIENS_ID is column A.... ICD10 is column D.
You can use this expression to count in range:
=COUNTIFS(D:D,">=C00",D:D,"<=C99")
I'll assume that "POL" is gender and 0 is male. So formula for Male (20-29) and ICD10 C00 - C99 would be:
=COUNTIFS(D:D,">=C00",D:D,"<=C99",C:C,">=20",C:C,"<=29",B:B,"0")

If there is only one letter as "identifier", like C02, E34, etc. you can split your column ICD10 into two columns, first one is the first character of ICD10, and second are the numbers.
df.loc[:, "Letter_identifier"] = df["ICD10"].str[0]
df.loc[:, "Number_identifier"] = df["ICD10"].str[1:].astype(int)
Then you can create a masks like:
(df["Letter_identifier"] == "C") & (df["Number_identifier"] > 0) & (df["Number_identifier"] <= 99)
You can split your dataframe as shown, aggregate on those sub-dataframes and concat your result.

How can i add the gender column?

I have this dataset with 5 columns and lots of rows. I've been asked to get the total number of male and females. They are string and can't figure it out. I have to use numpy too.
Please help.
Thanks
ls = gender.values.tolist()
ls
top = []
for i in ls:
if i == 'M':
top.append(i)
print(i)
I need to sum of the male and female in the above dataset.

You can use value_counts from pandas to see total number of occurrences of a categorical column.
In your case, it can be
df['gender'].value_counts()
IT will return the count of male and female.

How to use pandas to count rows in which two columns must have one specific string from a specified set of strings for each column?

I have a dataset that includes, among other things, a column for level of education and yearly salary (represented for some godforsaken reason as >50k, >=50k, <50k, etc). I need to figure out how many people with higher education AKA bachelors, masters, and doctorate make more than 50k. That means that I need to select the rows in which there is either a doctorate, bachelors, or masters in the education column, AND the first character of the salary column is '>'. What is the proper syntax for that? Will give more information if needed. Please help.

To select only people with higher education you can use isin passing the list of education degree. For the yearly salary, if you test only against the > (e.g. str.startswith('>')) you could end up including the rows where Year_Salary are also equal to 50k.
import pandas as pd
import numpy as np
#setup
np.random.seed(42)
d = {
'Year_Salary': np.random.choice(['>50k','>=50k','<50k'], size=(50,)),
'Education': np.random.choice(['doctorate','bachelors','masters','undergraduate'], size=(50,))
}
df = pd.DataFrame(d)
#code
filtered_df = df[df['Education'].isin(['doctorate','bachelors','masters']) \
& df['Year_Salary'].str.startswith('>')]
print(filtered_df)
print(filtered_df.shape[0]) # 20 (number of matches)
Output from filtered_df
Year_Salary Education
1 >50k doctorate
4 >50k bachelors
7 >=50k masters
14 >=50k masters
...
To get only the rows where Year_Salary is greater than 50k you could use str.match with the regex ^>\d+, a string that starts with a literal > follow by one or more digits.
df[df['Education'].isin(['doctorate','bachelors','masters']) & (df['Year_Salary'].str.match(r'^>\d+'))]

You can use below statement to filter the dataframe based on condition:
newdf = df[(df.val > 0.5) & (df.val2 == 1)]
OR
you can iter through rows and update the column. Refer the below code:
for index, row in df.iterrows():
....

Pandas count average number of unique numbers across groups

I have a dataset that contain columns household_key, age_group, income_group and day. For each household, there is a row for each day that household went shopping. I want to find on average how many distinct days each age group went shopping in the study period. I tried grouping by age group and counting the number of unique dates, but I want to get the unique dates per household in each group, not just the unique dates in each group, then I want things like mean and standard deviation. I have tried:
df.groupby('age_group', as_index=False).agg({'DAY': 'nunique'})
But this ignores the households, I also tried:
df.groupby(['age_group', 'household_key'], as_index=False).agg({'DAY': 'nunique'})
but this gets me one group per household (each household is in one age group). Then I don't know how to get the information by age group. I want to do some sort of multilevel group but I don't know how. I'm using Pandas in Python 3.

IIUC, first you want to aggregate over each age and household:
agg = (df.groupby(['age_group', 'household_key'])
.agg({'DAY': 'nunique'})
)
and then groupby again for the mean, e.g.,
agg.groupby('age_group').mean()
will give you the mean for each age_group across the household_key.

If I understand correctly what you want to achieve you can try something like this:
import pandas as pd
data = {'household_key':[1,1,1,1,2,2,2,3,3,3],
'age_group':[25,25,25,25,30,30,30,25,25,25],
'income_group':[40,40,40,40,40,40,40,30,30,30],
'day':['2019-01-01','2019-01-05','2019-01-08','2019-01-15','2019-01-01','2019-01-08','2019-01-10','2019-01-01','2019-01-05','2019-01-10']}
df = pd.DataFrame(data)
# get group by household
group1 = df.groupby(['household_key', 'age_group']).agg({'day': 'nunique'})
# get group by age_group
group2 = df.groupby(['age_group']).agg({'day': 'nunique'})
# join the results
group = group2.merge(group1, how='right', left_index=True, right_index=True)
group.columns = ['unique_days_in_group', 'unique_days_in_household']
print(group)
the result will be like this:
unique_days_in_group unique_days_in_household
household_key age_group
1 25 5 4
2 30 3 3
3 25 5 3

Calculating a probability based on several variables in a Pandas dataframe

I'm still very new to Python and Pandas, so bear with me...
I have a dataframe of passengers on a ship that sunk. I have broken this down into other dataframes by male and female, and also by class to create probabilities for survival. I made a function that compares one dataframe to a dataframe of only survivors, and calculates the probability of survival among this group:
def survivability(total_pass_df, column, value):
survivors = sum(did_survive[column] == value)
total = len(total_pass_df)
survival_prob = round((survivors / total), 2)
return survival_prob
But now I'm trying to compare survivability among smaller groups - male first class passengers vs female third class passengers for example. I did make dataframes for both of these groups, but I still can't use my survivability function because I"m comparing two different columns - sex and class - rather than just one.
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
But I'm supposed to use Pandas for this, and I can't for the life of me work out in my head how to do it....
:/

Without a sample of the data frames you're working with, I can't be sure if I understand your question correctly. But based on your description of the pure-Python procedure,
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
you can do this in Pandas by simply writing
dataframe['survived'].mean()
That's it. Given that all the values are either 1 or 0, the mean will be the number of 1's divided by the total number of rows.
If you start out with a data frame that has columns like survived, sex, class, and so on, you can elegantly combine this with Pandas' boolean indexing to pick out the survival rates for different groups. Let me use the Socialcops Titanic passengers data set as an example to demonstrate. Assuming the DataFrame is called df, if you want to analyze only male passengers, you can get those records as
df[df['sex'] == 'male']
and then you can take the survived column of that and get the mean.
>>> df[df['sex'] == 'male']['survived'].mean()
0.19198457888493475
So 19% of male passengers survived. If you want to narrow down to male second-class passengers, you'll need to combine the conditions using &, like this:
>>> df[(df['sex'] == 'male') & (df['pclass'] == 2)]['survived'].mean()
0.14619883040935672
This is getting a little unwieldy, but there's an easier way that actually lets you do multiple categories at once. (The catch is that this is a somewhat more advanced Pandas technique and it might take a while to understand it.) Using the DataFrame.groupby() method, you can tell Pandas to group the rows of the data frame according to their values in certain columns. For example,
df.groupby('sex')
tells Pandas to group the rows by their sex: all male passengers' records are in one group, and all female passengers' records are in another group. The thing you get from groupby() is not a DataFrame, it's a special kind of object that lets you apply aggregation functions - that is, functions which take a whole group and turn it into one number (or something). So, for example, if you do this
>>> df.groupby('sex').mean()
pclass survived age sibsp parch fare \
sex
female 2.154506 0.727468 28.687071 0.652361 0.633047 46.198097
male 2.372479 0.190985 30.585233 0.413998 0.247924 26.154601
body
sex
female 166.62500
male 160.39823
you see that for each column, Pandas takes the average over the male passengers' records of all that column's values, and also over all the female passenger's records. All you care about here is the survival rate, so just use
>>> df.groupby('sex').mean()['survived']
sex
female 0.727468
male 0.190985
One big advantage of this is that you can give more than one column to group by, if you want to look at small groups. For example, sex and class:
>>> df.groupby(['sex', 'pclass']).mean()['survived']
sex pclass
female 1 0.965278
2 0.886792
3 0.490741
male 1 0.340782
2 0.146199
3 0.152130
(you have to give groupby a list of column names if you're giving more than one)

Have you tried merging the two dataframes by passenger ID and then doing a pivot table in Pandas with whatever row subtotals and aggfunc=numpy.mean?
import pandas as pd
import numpy as np
# Passenger List
p_list = pd.DataFrame()
p_list['ID'] = [1,2,3,4,5,6]
p_list['Class'] = ['1','2','2','1','2','1']
p_list['Gender'] = ['M','M','F','F','F','F']
# Survivor List
s_list = pd.DataFrame()
s_list['ID'] = [1,2,3,4,5,6]
s_list['Survived'] = [1,0,0,0,1,0]
# Merge the datasets
merged = pd.merge(p_list,s_list,how='left',on=['ID'])
# Pivot to get sub means
result = pd.pivot_table(merged,index=['Class','Gender'],values=['Survived'],aggfunc=np.mean, margins=True)
# Reset the index
for x in range(result.index.nlevels-1,-1,-1):
result.reset_index(level=x,inplace=True)
print result
Class Gender Survived
0 1 F 0.000000
1 1 M 1.000000
2 2 F 0.500000
3 2 M 0.000000
4 All 0.333333

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby with minimal group size - python

I think FLab's answer is probably more complete and correct. But if you're after a quick fix; column = 'Gender' minimum_size = 2500 valid_groups = [g for g in set(df[col]) if sum(df[col] == g) >= minimum_size] mask = df[column].isin(valid_groups) df[mask].groupby(column)

Related

Selection of a condition by a range that includes strings (letter + numbers)

How can i add the gender column?

How to use pandas to count rows in which two columns must have one specific string from a specified set of strings for each column?

Pandas count average number of unique numbers across groups

Calculating a probability based on several variables in a Pandas dataframe

Categories

Resources