Pandas count average number of unique numbers across groups - python

I have a dataset that contain columns household_key, age_group, income_group and day. For each household, there is a row for each day that household went shopping. I want to find on average how many distinct days each age group went shopping in the study period. I tried grouping by age group and counting the number of unique dates, but I want to get the unique dates per household in each group, not just the unique dates in each group, then I want things like mean and standard deviation. I have tried:
df.groupby('age_group', as_index=False).agg({'DAY': 'nunique'})
But this ignores the households, I also tried:
df.groupby(['age_group', 'household_key'], as_index=False).agg({'DAY': 'nunique'})
but this gets me one group per household (each household is in one age group). Then I don't know how to get the information by age group. I want to do some sort of multilevel group but I don't know how. I'm using Pandas in Python 3.

IIUC, first you want to aggregate over each age and household:
agg = (df.groupby(['age_group', 'household_key'])
.agg({'DAY': 'nunique'})
)
and then groupby again for the mean, e.g.,
agg.groupby('age_group').mean()
will give you the mean for each age_group across the household_key.

If I understand correctly what you want to achieve you can try something like this:
import pandas as pd
data = {'household_key':[1,1,1,1,2,2,2,3,3,3],
'age_group':[25,25,25,25,30,30,30,25,25,25],
'income_group':[40,40,40,40,40,40,40,30,30,30],
'day':['2019-01-01','2019-01-05','2019-01-08','2019-01-15','2019-01-01','2019-01-08','2019-01-10','2019-01-01','2019-01-05','2019-01-10']}
df = pd.DataFrame(data)
# get group by household
group1 = df.groupby(['household_key', 'age_group']).agg({'day': 'nunique'})
# get group by age_group
group2 = df.groupby(['age_group']).agg({'day': 'nunique'})
# join the results
group = group2.merge(group1, how='right', left_index=True, right_index=True)
group.columns = ['unique_days_in_group', 'unique_days_in_household']
print(group)
the result will be like this:
unique_days_in_group unique_days_in_household
household_key age_group
1 25 5 4
2 30 3 3
3 25 5 3

Related

Pandas - using group by and including value counts which are larger than n

I have a table which includes salary and company_location.
I was trying to calculate the mean salary of a country, its works:
wage = df.groupby('company_location').mean()['salary']
However, I have many with company_location which have less than 5 entries, I would like to exclude them from the report.
I know how to calculate countries with the top 5 entries:
Top_5 = df['company_location'].value_counts().head(5)
I am just having a problem connecting those to variables into one and making a graph out of it...
Thank you.
You can remove rows whose value occurrence is below a threshold:
df = df[df.groupby('company_location')['company_location'].transform('size') > 5]
You can do the following to only apply the groupby and aggregation to those with more than 5 records:
mask = (df['company_location'].map(df['company_location'].value_counts()) > 5)
wage = df[mask].groupby('company_location')['salary'].mean()

How to convert rows into columns (as value but not header) in Python

In the following dataset, I need to convert each row for the “description” under “name" column (for example, inventory1, inventory2 and inventory3) into two separate columns (namely description1 and description2, respectively). If I used either pviot_table or groupby, the value of the description will become header instead of a value under a column. What would be the way to generate the desired output? Thanks
import pandas as pd
df1 = { 'item':['item1','item2','item3','item4','item5','item6'],
'name':['inventory1','inventory1','inventory2','inventory2','inventory3','inventory3'],
'code':[1,1,2,2,3,3],
'description':['sales number decrease compared to last month', 'Sales number
decreased','sales number increased','Sales number increased, need to keep kpi','no sales this
month','item out of stock']}
df1=pd.DataFrame(df1)
desired output as below:
You can actually use pd.concat:
new_df = pd.concat([
(
df.drop_duplicates('name')
.drop('description', axis=1)
.reset_index(drop=True)
),
(
pd.DataFrame([pd.Series(l) for l in df.groupby('name')['description'].agg(list).tolist()])
.add_prefix('description')
),
],
axis=1)
Output:
>>> new_df
item name code description0 description1
0 item1 inventory1 1 sales number decrease compared to last month Sales number decreased
1 item3 inventory2 2 sales number increased Sales number increased, need to keep kpi
2 item5 inventory3 3 no sales this month item out of stock
One-liner version of the above, in case you want it:
pd.concat([df.drop_duplicates('name').drop('description', axis=1).reset_index(drop=True), pd.DataFrame([pd.Series(l) for l in df.groupby('name')['description'].agg(list).tolist()]).add_prefix('description')], axis=1)

How to use pandas to count rows in which two columns must have one specific string from a specified set of strings for each column?

I have a dataset that includes, among other things, a column for level of education and yearly salary (represented for some godforsaken reason as >50k, >=50k, <50k, etc). I need to figure out how many people with higher education AKA bachelors, masters, and doctorate make more than 50k. That means that I need to select the rows in which there is either a doctorate, bachelors, or masters in the education column, AND the first character of the salary column is '>'. What is the proper syntax for that? Will give more information if needed. Please help.
To select only people with higher education you can use isin passing the list of education degree. For the yearly salary, if you test only against the > (e.g. str.startswith('>')) you could end up including the rows where Year_Salary are also equal to 50k.
import pandas as pd
import numpy as np
#setup
np.random.seed(42)
d = {
'Year_Salary': np.random.choice(['>50k','>=50k','<50k'], size=(50,)),
'Education': np.random.choice(['doctorate','bachelors','masters','undergraduate'], size=(50,))
}
df = pd.DataFrame(d)
#code
filtered_df = df[df['Education'].isin(['doctorate','bachelors','masters']) \
& df['Year_Salary'].str.startswith('>')]
print(filtered_df)
print(filtered_df.shape[0]) # 20 (number of matches)
Output from filtered_df
Year_Salary Education
1 >50k doctorate
4 >50k bachelors
7 >=50k masters
14 >=50k masters
...
To get only the rows where Year_Salary is greater than 50k you could use str.match with the regex ^>\d+, a string that starts with a literal > follow by one or more digits.
df[df['Education'].isin(['doctorate','bachelors','masters']) & (df['Year_Salary'].str.match(r'^>\d+'))]
You can use below statement to filter the dataframe based on condition:
newdf = df[(df.val > 0.5) & (df.val2 == 1)]
OR
you can iter through rows and update the column. Refer the below code:
for index, row in df.iterrows():
....

How to find customers who made 2nd purchase within 30 days?

I need your quick help. I want to find a list of customer_id's and first purchase_date for customers who have made their second purchase within 30 days of their first purchase.
i.e. curstomer_id's 1,2,3 have made their 2nd purchase within 30 days.
I need curstomer_id's 1,2,3 and their respective first purchase_date.
I have more than 100k customer_id's.
How I can achieve this in pandas?
You can do it with groupby
s = df.groupby('Customer_id')['purchase_date'].apply(lambda x : (x.iloc[1]-x.iloc[0]).days<30)
out = df.loc[df.Customer_id.isin(s.index[s])].drop_duplicates('Customer_id')
Here is a way:
df2 = (df.loc[df['purchase_date']
.lt(df['Customer_id']
.map((df.sort_values('purchase_date').groupby('Customer_id').first() + pd.to_timedelta(30,'d'))
.squeeze()))])
df2 = (df2.loc[df2.duplicated('Customer_id',keep=False)]
.groupby('Customer_id').first())
You can set a boolean mask to filter the groups of customers who have made their second purchase within 30 days, as follows:
# Pre-processing to sort the data and convert date to the required date format
df = df.sort_values(['Customer_id', 'purchase_date'])
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
# Set boolean mask
mask = (((df['purchase_date'] - df['purchase_date'].groupby(df['Customer_id']).shift()).dt.days <= 30)
.groupby(df['Customer_id'])
.transform('any')
)
Then, we can already filter the transaction records of customers with second purchase within 30 days by the following code:
df[mask]
To further show the customer_id's and their respective first purchase_date, you can use:
df[mask].groupby('Customer_id', as_index=False).first()

Calculating a probability based on several variables in a Pandas dataframe

I'm still very new to Python and Pandas, so bear with me...
I have a dataframe of passengers on a ship that sunk. I have broken this down into other dataframes by male and female, and also by class to create probabilities for survival. I made a function that compares one dataframe to a dataframe of only survivors, and calculates the probability of survival among this group:
def survivability(total_pass_df, column, value):
survivors = sum(did_survive[column] == value)
total = len(total_pass_df)
survival_prob = round((survivors / total), 2)
return survival_prob
But now I'm trying to compare survivability among smaller groups - male first class passengers vs female third class passengers for example. I did make dataframes for both of these groups, but I still can't use my survivability function because I"m comparing two different columns - sex and class - rather than just one.
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
But I'm supposed to use Pandas for this, and I can't for the life of me work out in my head how to do it....
:/
Without a sample of the data frames you're working with, I can't be sure if I understand your question correctly. But based on your description of the pure-Python procedure,
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
you can do this in Pandas by simply writing
dataframe['survived'].mean()
That's it. Given that all the values are either 1 or 0, the mean will be the number of 1's divided by the total number of rows.
If you start out with a data frame that has columns like survived, sex, class, and so on, you can elegantly combine this with Pandas' boolean indexing to pick out the survival rates for different groups. Let me use the Socialcops Titanic passengers data set as an example to demonstrate. Assuming the DataFrame is called df, if you want to analyze only male passengers, you can get those records as
df[df['sex'] == 'male']
and then you can take the survived column of that and get the mean.
>>> df[df['sex'] == 'male']['survived'].mean()
0.19198457888493475
So 19% of male passengers survived. If you want to narrow down to male second-class passengers, you'll need to combine the conditions using &, like this:
>>> df[(df['sex'] == 'male') & (df['pclass'] == 2)]['survived'].mean()
0.14619883040935672
This is getting a little unwieldy, but there's an easier way that actually lets you do multiple categories at once. (The catch is that this is a somewhat more advanced Pandas technique and it might take a while to understand it.) Using the DataFrame.groupby() method, you can tell Pandas to group the rows of the data frame according to their values in certain columns. For example,
df.groupby('sex')
tells Pandas to group the rows by their sex: all male passengers' records are in one group, and all female passengers' records are in another group. The thing you get from groupby() is not a DataFrame, it's a special kind of object that lets you apply aggregation functions - that is, functions which take a whole group and turn it into one number (or something). So, for example, if you do this
>>> df.groupby('sex').mean()
pclass survived age sibsp parch fare \
sex
female 2.154506 0.727468 28.687071 0.652361 0.633047 46.198097
male 2.372479 0.190985 30.585233 0.413998 0.247924 26.154601
body
sex
female 166.62500
male 160.39823
you see that for each column, Pandas takes the average over the male passengers' records of all that column's values, and also over all the female passenger's records. All you care about here is the survival rate, so just use
>>> df.groupby('sex').mean()['survived']
sex
female 0.727468
male 0.190985
One big advantage of this is that you can give more than one column to group by, if you want to look at small groups. For example, sex and class:
>>> df.groupby(['sex', 'pclass']).mean()['survived']
sex pclass
female 1 0.965278
2 0.886792
3 0.490741
male 1 0.340782
2 0.146199
3 0.152130
(you have to give groupby a list of column names if you're giving more than one)
Have you tried merging the two dataframes by passenger ID and then doing a pivot table in Pandas with whatever row subtotals and aggfunc=numpy.mean?
import pandas as pd
import numpy as np
# Passenger List
p_list = pd.DataFrame()
p_list['ID'] = [1,2,3,4,5,6]
p_list['Class'] = ['1','2','2','1','2','1']
p_list['Gender'] = ['M','M','F','F','F','F']
# Survivor List
s_list = pd.DataFrame()
s_list['ID'] = [1,2,3,4,5,6]
s_list['Survived'] = [1,0,0,0,1,0]
# Merge the datasets
merged = pd.merge(p_list,s_list,how='left',on=['ID'])
# Pivot to get sub means
result = pd.pivot_table(merged,index=['Class','Gender'],values=['Survived'],aggfunc=np.mean, margins=True)
# Reset the index
for x in range(result.index.nlevels-1,-1,-1):
result.reset_index(level=x,inplace=True)
print result
Class Gender Survived
0 1 F 0.000000
1 1 M 1.000000
2 2 F 0.500000
3 2 M 0.000000
4 All 0.333333

Categories